Tantivy 0.24

Exciting news from the search engine world — Tantivy 0.24 has just been released!
Brief Recap: What is Tantivy?
Tantivy is a high-performance full-text search engine library written in Rust, inspired by Apache Lucene. It is designed to serve as a foundation for building search engines, including our distributed search engine, Quickwit.
Here's a detailed look at what's new in Tantivy 0.24 (we skipped 0.23).
Aggregations
Tantivy 0.24 introduces several improvements to its aggregation capabilities:
Cardinality Aggregation
Cardinality aggregation allows for counting the number of distinct values in a field. This is useful to cases such as tracking the number of unique users visiting a endpoint or the number of unique IP addresses accessing a server. The feature leverages HyperLogLog++, to provide a space-efficient estimation for large datasets. #2337 #2446
Extended Stats Aggregation
Adds more statistical measures such as variance and standard deviation compared to the regular stats aggregation. #2247
Improvements
- Introduces
Key::I64
andKey::U64
variants to avoid floating-point precision issues. #2468 - Performance improvement for term aggregation with large Top N. #2447 #2426
- Faster term fetching and reduced memory consumption for
top_hits
aggregation by 45%. #2447 #2426 - Reduced contention on the shared
AggregationLimits
and renamed toAggregationLimitsGuard
as it is a guard itself now. #2495 - Added support for counting non integer in aggregation. #2547
RegexPhraseQuery
RegexPhraseQuery
has been added, which supports phrase queries with regex. E.g. a RegexPhraseQuery "b.* b.* wolf"
matches "big bad wolf"
.
Slop is supported as well: "bi.* wolf"~2
matches "big bad wolf"
. #2516
This feature comes with some new Postings
implementations to handle the complexity of potentially 100000 terms. They may be useful for other use cases:
SimpleUnion
A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping (phrase search, but intersections in general).LoadedPostings
Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is only used to reduce memory consumption.BitSetPostingUnion
Creates a Posting that uses a precalculated bitset for docid hits and the docsets for positions.
Range Query Improvements
- Added support for fast field range queries on JSON fields. #2456
- Range queries on string fields are now supported. #2460 #2452
- Faster Intersections: Modify fastfield range query heuristic for faster intersections with e.g. term queries. #2375
Performance and Memory
- Smaller Documents: Replaces
TantivyDocument
withCompactDoc
, a smaller, more efficient structure that reduces memory usage. #2402 - Multivalue Columnar Index: For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case). This is alleviated by placing an optional index in the multivalued index to mark documents that have values. This will slightly increase space and access time. #2439
- Lifted clauses in Logical AST. #2449
- Recycling in the
PrefixPhraseScorer
reduces memory allocation overhead. #2443
JSON Field Enhancements
JSON fields now support all values at the root level, previously only objects were supported. This enables usage of JSON as a mixed type field. #2383
Query Parser Enhancements
- Improved support for field grouping. Supports now queries like
title:(return AND "pink panther")
. #2333 - Fix colon handling in single quotes, e.g.
'abc:def'
. #2416
Bug Fixes
- Merge Loop Fix: Resolved a potential endless loop during merges. #2457
- Out-of-Order SSTable Key Bug: Fixed a bug causing out-of-order keys in SSTable. #2445
Breaking API Changes
- Index sorting has been removed. #2434
Additional Improvements
- DateTime in the doc store was truncated to microseconds previously. We removed this truncation, while still keeping backwards compatibility. #2486
- Added columnar access benchmarks to identify and optimize bottlenecks. #2432
- Add columnar format compatibility tests. #2433
- For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case). This is alleviated by placing an optional index in the multivalued index to mark documents that have values This will slightly increase space and access time. #2439
- Catch panics during merges #2582
- Switch from
u32
tousize
in bitpacker. This enables multivalued columns larger than 4GB, which crashed during merge before. #2581 - Support
minimum_number_should_match
inBooleanQuery
. #2405
A Big Thank You to Our Contributors!
This release would not have been possible without the hard work and dedication of our contributors:
New Contributors
A special welcome and thanks to our first-time contributors for this release!
- @hamirmahal made their first contribution in #2412
- @wsxiaoys made their first contribution in #2417
- @philippemnoel made their first contribution in #2437
- @raphaelcoeffic made their first contribution in #2337
- @LebranceBW made their first contribution in #2405
- @robcaulk made their first contribution in #2450
- @gezihuzi made their first contribution in #2474
- @coplat made their first contribution in #2483
- @triandco made their first contribution in #2499
- @b41sh made their first contribution in #2512
- @marvin-j97 made their first contribution in #2524
- @jotare made their first contribution in #2528
Check out the CHANGELOG for a detailed list of all the changes. We’re excited to see what you build with Tantivy 0.24!