Tantivy 0.22
Exciting news from the search engine world — Tantivy 0.22 has just been released!
Brief Recap: What is Tantivy?
Tantivy is a high-performance full-text search engine library written in Rust, inspired by Apache Lucene. It is designed to serve as a foundation for building search engines, including our distributed search engine, Quickwit.
Changes
Tantivy 0.22 is backward-compatible with indices created with Tantivy 0.21.
This release comes packed with significant performance improvements, more aggregations, and a new feature that allows custom data types to be indexed without copying.
Aggregations
Tantivy 0.22 further expands its elastic search compatible aggregation capabilities with the new top_hits
aggregation and additional support for term aggregations.
Top Hits Aggregation
The top hits aggregation allows users to retrieve the top documents of a bucket aggregation.
For now it only supports the docvalue_fields
parameter to fetch additional data from a fast
field.
#2198
Term Aggregations
Tantivy 0.22 completes type support for term aggregations, enabling users to aggregate on boolean values, IP addresses, and dates. #2318 #2319 #2172
Deserializing Floating Point Numbers
Aggregations can now deserialize floating point numbers from strings. #2311
Document as a Trait
In previous versions users had to convert their data into Tantivy's Document
type for indexing.
This conversion process requires unnecessary data copies, which impacts indexing performance.
With the Document
as a trait, users can now directly index their custom data types by just implementing the trait.
Zero-copy functionality is achieved by using the long awaited Rust feature, Generic Associated Types.
If you want to implement the Document
trait for your custom data type, a good starting point are the tantivy docs
or see the implementation of TantivyDocument
in the github repo.
The previous Document
struct is renamed to TantivyDocument
, which implements the Document
trait and can be used like before.
Query Parser Improvements
The strict query parser has been updated to allow some combinations of occur
and boolean operators and now supports queries
like aaa ccc AND b
and a OR -b
.
#2323
Support for exists
queries has been integrated into the query parser and can be used with my_field:*
.
#2170
Performance Improvements
Tantivy 0.22 comes packed with significant performance optimizations for indexing and querying.
Query Performance
The TopDocs
collector received a performance boost through the use of our TopNComputer
over a binary heap.
Average query performance improved by more than 15%
according to our benchmark. Check out the dedicated blogpost for a deep dive!
#2186
Aggregations are faster by using a fast path for full columns in fetch_block
.
This increases performance for certain aggregation queries by 5%
.
#2328
SSTable loading is now free by using fst
for the sstable index.
In previous tantivy versions we had to scan and load the sstable index into memory, which could occupy a significant portion of a query
in some cases.
SStables are used as the term dictionary for fast fields on string fields. We added a compatibility layer to continue supporting the old format.
#2268
Indexing Performance:
Indexing of fast fields has been accelerated by using our specialized term hashmap instead of the default Rust hashmap.
Overall indexing throughput increased by ~40%
in a benchmark using the GitHub dataset.
#2243
We now use significantly less memory while indexing, by using docid deltas instead of direct docids. While indexing 1.1GB of the HDFS dataset, memory usage was reduced from 760MB to 590MB. #2249
Smaller Indices
Encode parts of the posting list as -1 instead of direct values to achieve slightly smaller inverted indices. We added some compatibility code to support the old format. #2185
Bug Fixes
A number of critical bug fixes have been addressed in this release, including:
- Fixed null byte handling in JSON paths, where null bytes in JSON keys caused a panic during indexing. #2345
- Avoid creating single-document indices by increasing the minimum memory requirement to 15MB for indexing. #2176
- An error in time precision conversion when creating empty histograms caused the aggregation to try to allocate too many buckets. #2183
- Fixed two merge panics when merging JSON fields with mixed types and indexed with positions. #2284 #2253
- Multiple bug fixes on fast field range queries
Breaking API Changes
- Rename ReloadPolicy
onCommit
toonCommitWithDelay
#2235 - Move exports from the root into modules #2220
- Accept String field name instead of
Field
in FilterCollector.String
will be able to address nested JSON fields with the.
syntax. #2196 - Remove deprecated IntOptions and DatePrecision. Use NumericOptions and DateTimePrecision instead. #2353
Planned Removal of Index Sorting in 0.23
We plan to remove index sorting in 0.23
. If you have a use case that requires index sorting, please let us know in the issue below.
https://github.com/quickwit-oss/tantivy/issues/2352
A Big Thank You to Our Contributors!
This release would not have been possible without the hard work and dedication of our contributors:
New Contributors
A special welcome and thanks to our first-time contributors for this release!
- @imotov made their first contribution in #2160
- @PingXia-at made their first contribution in #2173
- @giovannicuccu made their first contribution in #2235
- @BlackHoleFox made their first contribution in #2265
- @ditsuke made their first contribution in #2282
- @MochiXu made their first contribution in #2312
Check out the CHANGELOG for a detailed list of all the changes. We’re excited to see what you build with Tantivy 0.22!