Skip to main content

Tantivy 0.22

Tantivy 0.22 is out.

Exciting news from the search engine world — Tantivy 0.22 has just been released!

Brief Recap: What is Tantivy?

Tantivy is a high-performance full-text search engine library written in Rust, inspired by Apache Lucene. It is designed to serve as a foundation for building search engines, including our distributed search engine, Quickwit.

Changes

Tantivy 0.22 is backward-compatible with indices created with Tantivy 0.21.

This release comes packed with significant performance improvements, more aggregations, and a new feature that allows custom data types to be indexed without copying.

Aggregations

Tantivy 0.22 further expands its elastic search compatible aggregation capabilities with the new top_hits aggregation and additional support for term aggregations.

Top Hits Aggregation

The top hits aggregation allows users to retrieve the top documents of a bucket aggregation. For now it only supports the docvalue_fields parameter to fetch additional data from a fast field. #2198

Term Aggregations

Tantivy 0.22 completes type support for term aggregations, enabling users to aggregate on boolean values, IP addresses, and dates. #2318 #2319 #2172

Deserializing Floating Point Numbers

Aggregations can now deserialize floating point numbers from strings. #2311

Document as a Trait

In previous versions users had to convert their data into Tantivy's Document type for indexing. This conversion process requires unnecessary data copies, which impacts indexing performance.

With the Document as a trait, users can now directly index their custom data types by just implementing the trait. Zero-copy functionality is achieved by using the long awaited Rust feature, Generic Associated Types. If you want to implement the Document trait for your custom data type, a good starting point are the tantivy docs or see the implementation of TantivyDocument in the github repo.

The previous Document struct is renamed to TantivyDocument, which implements the Document trait and can be used like before.

Query Parser Improvements

The strict query parser has been updated to allow some combinations of occur and boolean operators and now supports queries like aaa ccc AND b and a OR -b. #2323

Support for exists queries has been integrated into the query parser and can be used with my_field:*. #2170

Performance Improvements

Tantivy 0.22 comes packed with significant performance optimizations for indexing and querying.

Query Performance

The TopDocs collector received a performance boost through the use of our TopNComputer over a binary heap. Average query performance improved by more than 15% according to our benchmark. Check out the dedicated blogpost for a deep dive! #2186

Aggregations are faster by using a fast path for full columns in fetch_block. This increases performance for certain aggregation queries by 5%. #2328

SSTable loading is now free by using fst for the sstable index. In previous tantivy versions we had to scan and load the sstable index into memory, which could occupy a significant portion of a query in some cases. SStables are used as the term dictionary for fast fields on string fields. We added a compatibility layer to continue supporting the old format. #2268

Indexing Performance:

Indexing of fast fields has been accelerated by using our specialized term hashmap instead of the default Rust hashmap. Overall indexing throughput increased by ~40% in a benchmark using the GitHub dataset. #2243

We now use significantly less memory while indexing, by using docid deltas instead of direct docids. While indexing 1.1GB of the HDFS dataset, memory usage was reduced from 760MB to 590MB. #2249

Smaller Indices

Encode parts of the posting list as -1 instead of direct values to achieve slightly smaller inverted indices. We added some compatibility code to support the old format. #2185

Bug Fixes

A number of critical bug fixes have been addressed in this release, including:

  • Fixed null byte handling in JSON paths, where null bytes in JSON keys caused a panic during indexing. #2345
  • Avoid creating single-document indices by increasing the minimum memory requirement to 15MB for indexing. #2176
  • An error in time precision conversion when creating empty histograms caused the aggregation to try to allocate too many buckets. #2183
  • Fixed two merge panics when merging JSON fields with mixed types and indexed with positions. #2284 #2253
  • Multiple bug fixes on fast field range queries
    • Fixed a bug causing get_docids_for_value_range to panic. #2295
    • Fixed an end condition check causing too few results. Only fields with less than 1 value per doc are affected. #2226
    • Fixed a panic on some queries caused by out of bounds ranges #2174

Breaking API Changes

  • Rename ReloadPolicy onCommit to onCommitWithDelay #2235
  • Move exports from the root into modules #2220
  • Accept String field name instead of Field in FilterCollector. String will be able to address nested JSON fields with the . syntax. #2196
  • Remove deprecated IntOptions and DatePrecision. Use NumericOptions and DateTimePrecision instead. #2353

Planned Removal of Index Sorting in 0.23

We plan to remove index sorting in 0.23. If you have a use case that requires index sorting, please let us know in the issue below. https://github.com/quickwit-oss/tantivy/issues/2352

A Big Thank You to Our Contributors!

This release would not have been possible without the hard work and dedication of our contributors:

New Contributors

A special welcome and thanks to our first-time contributors for this release!

  • @imotov made their first contribution in #2160
  • @PingXia-at made their first contribution in #2173
  • @giovannicuccu made their first contribution in #2235
  • @BlackHoleFox made their first contribution in #2265
  • @ditsuke made their first contribution in #2282
  • @MochiXu made their first contribution in #2312

Check out the CHANGELOG for a detailed list of all the changes. We’re excited to see what you build with Tantivy 0.22!