Querying
Quickwit provides two endpoints with full-text search queries identified by the query
parameter:
- A search endpoint that returns a JSON
- A search stream endpoint that returns a stream of the requested field values
A search query received by a searcher will be executed using a map-reduce approach following these steps:
- The Searcher identifies relevant splits based on the request’s timestamp interval and tags.
- It distributes the splits workload among other searchers available in the cluster using rendez-vous hashing to optimize caching and load.
- It finally waits for all results, merges them, and returns them to the client.
A search stream query follows the same execution path as for a search query except for the last step: instead of waiting for each Searcher's result, the searcher streams the results as soon as it starts receiving some from a searcher.
Time sharding
On datasets with a time component, Quickwit will shard data into timestamp-aware splits. With this feature, Quickwit is capable of filtering out most splits before they can make it to the query processing stage, thus reducing drastically the amount of data needed to process the query.
The following query parameters are available to apply timestamped pruning to your query:
startTimestamp
: restricts search to documents with atimestamp >= start_timestamp
endTimestamp
: restricts search to documents with atimestamp < end_timestamp
Tag pruning
Quickwit also provides pruning on a second dimension called tags
. By setting a field as tagged Quickwit will generate split metadata at indexing in order to filter splits that match requested tags at query time. Note that this metadata is only generated when the cardinality of the field is less than 1,000.
Tag pruning is notably useful on multi-tenant datasets.
Partitioning
Quickwit makes it possible to route documents into different splits based on a partitioning key.
This feature is especially useful in a context where documents with different tags are all mixed together in the same source (usually a Kafka topic).
In that case, simply marking the field as tag will have no positive effect on search, as all produced splits will contain almost all tags.
The partition_key
attributes (defined in the doc mapping) lets you configure the logic used by Quickwit to route documents into isolated splits.
Quickwit will also enforce this isolation during merges. This functionality is, in a sense, similar to sharding.
Quickwit supports a simple DSL for partitioning described in the next section.
Partition & tags are often used to:
- separate
tenants
in a multi-tenant application - separate
team
orapplication
in an observation logging case.
Emitting many splits can heavily stress an indexer
. For this reason,
another parameter of the doc mapping called max_num_partitions
acts as a safety valve. If the number of partitions is
about to exceed max_num_partitions
, a single extra partition is created
and all extra partitions will be grouped together into this special partition.
If you are expecting 20 partitions, we strongly recommend you to not set
max_num_partitions
to 20, but instead use a larger value (200 for instance).
Quickwit should handle that number of partitions smoothly, and it will avoid documents belonging to different partitions from being grouped together due to
a few faulty documents.
Partition key DSL
Quickwit allows you to configure how document are routed with a simple DSL. Here are some sample expression with a short description of their result:
tenant_id
: create one partition per tenant_idtenant_id,app_id
: create one partition per unique combination of tenant_id and app_idtenant_id,hash_mod(app_id, 8)
: for each tenant, create up to 8 partitions containing each data related to some applicationshash_mod((tenant_id,app_id), 50)
: create 50 partition in total, containing some combination of tenant and apps.
The partition key DSL is generated by this grammar:
RoutingExpr := RoutingSubExpr [ , RoutingExpr ]
RougingSubExpr := Identifier [ \( Arguments \) ]
Identifier := FieldChar [ Identifier ]
FieldChar := { a..z | A..Z | 0..9 | _ }
Arguments := Argument [ , Arguments ]
Argument := { \( RoutingExpr \) | RoutingSubExpr | DirectValue }
# We may want other DirectValue in the future
DirectValue := Number
Number := { 0..9 } [ Number ]
Supported functions are currently:
hash_mod(RoutingExpr, Number)
: hashRoutingExpr
and divide the result byNumber
, keeping only the reminder.
When using hash_mod
with a tuple of key like in hash_mod((tenant_id,app_id), 50)
, beware it might route together documents which would make tags less effective.
For instance, if tenant_1,app_1 and tenant_2,app_2 are both sent to partition one, but tenant_1,app_2 is sent to partition two, a query for tenant_1,app_2 will
still search inside the 1st partition as it will be tagged with tenant_1,tenant_2,app_1 and app_2. You should therefore prefer a partition key such as
hash_mod(tenant_id, 10),hash_mod(app_id, 5)
which will generate as many splits, but with better tags.
Search stream query limits
Search stream queries can take a huge amount of RAM. Quickwit limits the number of concurrent search streams per split to 100 by default. You can adjust this limit by setting the value of the searcher configuration property called max_num_concurrent_split_streams
in the configuration file.
Caching
Quickwit does caching in many places to deliver a highly performing query engine.
- Hotcache caching: A static cache that holds information about a split file internal representation. It helps speed up the opening of a split file. Its size can be defined via the
split_footer_cache_capacity
configuration parameter. - Fast field caching: Fast fields tend to be accessed very frequently by users especially for stream requests. They are cached in a RAM whose size can be limited by the
fast_field_cache_capacity
configuration value. - Partial request caching: In some cases, like when using dashboards, some very similar requests might be issued, with only timestamp bounds changing. Some partial results can be cached to make these requests faster and issue less requests to the storage. They are cached in a RAM whose size can be limited by the
partial_request_cache_capacity
configuration value.
Scoring
Quickwit supports sorting docs by their BM25 scores. In order to query by score, fieldnorms must be enabled for the field. By default, BM25 scoring is disabled to improve query latencies but it can be opt-in by setting the sort_by
option to _score
in queries.
Document ID
Each document in Quickwit is assigned a unique document ID, which is a combination of the split ID and the Tantivy DocId within the split. This implies that you cannot assign a custom ID and that the ID changes when splits undergo merges. This ID is used for every search query as sort order (after the explicitly specified sort values) to make the results deterministic.