Tantivy 0.20: Schemaless
People rejoice! A new tantivy release. And it's packed with new features:
- Schemaless columnar storage (fast fields in tantivy)
- Faster and more aggregations
- Performance improvements
- Phrase prefix queries
- ... and more!
See the CHANGELOG for the full list.
Short reminder: what is tantivy?
Tantivy is a high performant full-text search engine library written in Rust (benchmarks). The library is inspired by Apache Lucene and acts as a foundation to build a search engine, we use it to build our distributed search engine Quickwit.
Schemaless Fastfields
The completely overhauled fast field handling via our tantivy-columnar crate is the next big step towards full support for schemaless data, without compromising on indexing performance, query performance or compression. It comes with:
- Unified storage for typed/untyped fields
- Support for fields with mixed types
- Sparse codec for optional values
- Cardinality detection (No need to specify the cardinality of a field upfront anymore!)
Stay tuned for a deep dive in an upcoming blog post!
Aggregations
Aggregations are becoming an essential part of tantivy, and we've made them even better in this release:
- New aggregation types:
date_histogram
,percentiles
,count
,min
,max
,sum
- Support for
u64
,i64
,f64
fields interm
aggregation - Lower memory consumption for aggregations
- Setting memory limits for aggregations via
AggregationLimits
- Faster aggregation execution, especially for term aggregations
- Schemaless aggregations via JSON type. Even mixed types are supported!
Performance
This release brings many optimizations to tantivy's indexing and searching performance:
- Improved tokenizer performance using GATs and less allocations #1924 #2062
- Faster term hash map used during indexing #2058 #1940
- Batch processing of documents for faster segment collection (only without score for now) #1937
- Improved fast field range query performance using SIMD #1954
Tokenizer API
Previously, the tokenizer API was part of the tantivy
core crate.
This meant that whenever a new version of tantivy was released, all existing tokenizers had to be updated.
To address this issue, the tokenizer API has been moved to the tantivy-tokenizer
crate.
This separation ensures that tokenizers will remain compatible as long as the API remains unchanged.
Phrase Prefix Queries
You can now perform prefix queries on phrases using the new PhrasePrefixQuery
.
The query parser already supports this feature, for instance "quickwit tan"* will match documents containing "quickwit tantivy" and "quickwit tango".
More Improvements
- Regex tokenizer: Tokenization based on regex patterns.
- Coerce option: Convert values instead of returning an error during indexing.
- Slop in phrase queries supports now transpositions. "quickwit tantivy"~2 will match "tanvity quickwit".
- Hungarian stop words
- Support tokenizer on JSON fields
Thanks to all contributors
New Contributors
Thanks and welcome to all new contributors!
- @mhlakhani made their first contribution in #1733
- @pinkforest made their first contribution in #1746
- @DawChihLiou made their first contribution in #1737
- @mkleen made their first contribution in #1759
- @lonre made their first contribution in #1803
- @gyk made their first contribution in #1843
- @alexcole made their first contribution in #1855
- @Toasterson made their first contribution in #1944
- @vsop-479 made their first contribution in #1970
- @Tony-X made their first contribution in #1985
- @RTEnzyme made their first contribution in #1999
- @tottoto made their first contribution in #2018
- @nyurik made their first contribution in #2038
- @bazhenov made their first contribution in #1609
- @lavrd made their first contribution in #1422
- @tnxbutno made their first contribution in #2069