Exciting news from the search engine world — Tantivy 0.24 has just been released!

Brief Recap: What is Tantivy?

Tantivy is a high-performance full-text search engine library written in Rust, inspired by Apache Lucene. It is designed to serve as a foundation for building search engines, including our distributed search engine, Quickwit.

Here's a detailed look at what's new in Tantivy 0.24 (we skipped 0.23).

Aggregations

Tantivy 0.24 introduces several improvements to its aggregation capabilities:

Cardinality Aggregation

Cardinality aggregation allows for counting the number of distinct values in a field. This is useful to cases such as tracking the number of unique users visiting a endpoint or the number of unique IP addresses accessing a server. The feature leverages HyperLogLog++, to provide a space-efficient estimation for large datasets. #2337 #2446

Extended Stats Aggregation

Adds more statistical measures such as variance and standard deviation compared to the regular stats aggregation. #2247

Improvements

Introduces Key::I64 and Key::U64 variants to avoid floating-point precision issues. #2468
Performance improvement for term aggregation with large Top N. #2447 #2426
Faster term fetching and reduced memory consumption for top_hits aggregation by 45%. #2447 #2426
Reduced contention on the shared AggregationLimits and renamed to AggregationLimitsGuard as it is a guard itself now. #2495
Added support for counting non integer in aggregation. #2547

RegexPhraseQuery

RegexPhraseQuery has been added, which supports phrase queries with regex. E.g. a RegexPhraseQuery "b.* b.* wolf" matches "big bad wolf". Slop is supported as well: "bi.* wolf"~2 matches "big bad wolf". #2516

This feature comes with some new Postings implementations to handle the complexity of potentially 100000 terms. They may be useful for other use cases:

SimpleUnion A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping (phrase search, but intersections in general).
LoadedPostings Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is only used to reduce memory consumption.
BitSetPostingUnion Creates a Posting that uses a precalculated bitset for docid hits and the docsets for positions.

Range Query Improvements

Added support for fast field range queries on JSON fields. #2456
Range queries on string fields are now supported. #2460 #2452
Faster Intersections: Modify fastfield range query heuristic for faster intersections with e.g. term queries. #2375

Performance and Memory

Smaller Documents: Replaces TantivyDocument with CompactDoc, a smaller, more efficient structure that reduces memory usage. #2402
Multivalue Columnar Index: For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case). This is alleviated by placing an optional index in the multivalued index to mark documents that have values. This will slightly increase space and access time. #2439
Lifted clauses in Logical AST. #2449
Recycling in the PrefixPhraseScorer reduces memory allocation overhead. #2443

JSON Field Enhancements

JSON fields now support all values at the root level, previously only objects were supported. This enables usage of JSON as a mixed type field. #2383

Query Parser Enhancements

Improved support for field grouping. Supports now queries like title:(return AND "pink panther"). #2333
Fix colon handling in single quotes, e.g. 'abc:def'. #2416

Bug Fixes

Merge Loop Fix: Resolved a potential endless loop during merges. #2457
Out-of-Order SSTable Key Bug: Fixed a bug causing out-of-order keys in SSTable. #2445

Breaking API Changes

Index sorting has been removed. #2434

Additional Improvements

DateTime in the doc store was truncated to microseconds previously. We removed this truncation, while still keeping backwards compatibility. #2486
Added columnar access benchmarks to identify and optimize bottlenecks. #2432
Add columnar format compatibility tests. #2433
For mostly empty multivalued indices there was a large overhead during creation when iterating all docids (merge case). This is alleviated by placing an optional index in the multivalued index to mark documents that have values This will slightly increase space and access time. #2439
Catch panics during merges #2582
Switch from u32 to usize in bitpacker. This enables multivalued columns larger than 4GB, which crashed during merge before. #2581
Support minimum_number_should_match in BooleanQuery. #2405

A Big Thank You to Our Contributors!

This release would not have been possible without the hard work and dedication of our contributors:

New Contributors

A special welcome and thanks to our first-time contributors for this release!

@hamirmahal made their first contribution in #2412
@wsxiaoys made their first contribution in #2417
@philippemnoel made their first contribution in #2437
@raphaelcoeffic made their first contribution in #2337
@LebranceBW made their first contribution in #2405
@robcaulk made their first contribution in #2450
@gezihuzi made their first contribution in #2474
@coplat made their first contribution in #2483
@triandco made their first contribution in #2499
@b41sh made their first contribution in #2512
@marvin-j97 made their first contribution in #2524
@jotare made their first contribution in #2528

Check out the CHANGELOG for a detailed list of all the changes. We’re excited to see what you build with Tantivy 0.24!