Short reminder: what is tantivy?

Tantivy is a high performant full-text search engine library written in Rust (benchmarks). The library is inspired by Apache Lucene and acts as a foundation to build a search engine, we use it to build our distributed search engine Quickwit.

Schemaless Fastfields

The completely overhauled fast field handling via our tantivy-columnar crate is the next big step towards full support for schemaless data, without compromising on indexing performance, query performance or compression. It comes with:

Unified storage for typed/untyped fields
Support for fields with mixed types
Sparse codec for optional values
Cardinality detection (No need to specify the cardinality of a field upfront anymore!)

Stay tuned for a deep dive in an upcoming blog post!

Aggregations

Aggregations are becoming an essential part of tantivy, and we've made them even better in this release:

New aggregation types: date_histogram, percentiles, count, min, max, sum
Support for u64, i64, f64 fields in term aggregation
Lower memory consumption for aggregations
Setting memory limits for aggregations via AggregationLimits
Faster aggregation execution, especially for term aggregations
Schemaless aggregations via JSON type. Even mixed types are supported!

Performance

This release brings many optimizations to tantivy's indexing and searching performance:

Improved tokenizer performance using GATs and less allocations #1924 #2062
Faster term hash map used during indexing #2058 #1940
Batch processing of documents for faster segment collection (only without score for now) #1937
Improved fast field range query performance using SIMD #1954

Tokenizer API

Previously, the tokenizer API was part of the tantivy core crate. This meant that whenever a new version of tantivy was released, all existing tokenizers had to be updated. To address this issue, the tokenizer API has been moved to the tantivy-tokenizer crate. This separation ensures that tokenizers will remain compatible as long as the API remains unchanged.

Phrase Prefix Queries

You can now perform prefix queries on phrases using the new PhrasePrefixQuery. The query parser already supports this feature, for instance "quickwit tan"* will match documents containing "quickwit tantivy" and "quickwit tango".

More Improvements

Regex tokenizer: Tokenization based on regex patterns.
Coerce option: Convert values instead of returning an error during indexing.
Slop in phrase queries supports now transpositions. "quickwit tantivy"~2 will match "tanvity quickwit".
Hungarian stop words
Support tokenizer on JSON fields

Thanks to all contributors

New Contributors

Thanks and welcome to all new contributors!

@mhlakhani made their first contribution in #1733
@pinkforest made their first contribution in #1746
@DawChihLiou made their first contribution in #1737
@mkleen made their first contribution in #1759
@lonre made their first contribution in #1803
@gyk made their first contribution in #1843
@alexcole made their first contribution in #1855
@Toasterson made their first contribution in #1944
@vsop-479 made their first contribution in #1970
@Tony-X made their first contribution in #1985
@RTEnzyme made their first contribution in #1999
@tottoto made their first contribution in #2018
@nyurik made their first contribution in #2038
@bazhenov made their first contribution in #1609
@lavrd made their first contribution in #1422
@tnxbutno made their first contribution in #2069

Tantivy 0.20: Schemaless

Short reminder: what is tantivy?

Schemaless Fastfields​

Aggregations​

Performance​

Tokenizer API​

Phrase Prefix Queries​

More Improvements​