Source configuration
Quickwit can insert data into an index from one or multiple sources.
A source can be added after index creation using the CLI command quickwit source create
.
It can also be enabled or disabled with the quickwit source enable/disable
subcommands.
A source is declared using an object called source config, which defines the source's settings. It consists of multiple parameters:
- source ID
- source type
- source parameters
- input_format
- maximum number of pipelines per indexer (optional)
- desired number of pipelines (optional)
- transform parameters (optional)
Source ID
The source ID is a string that uniquely identifies the source within an index. It may only contain uppercase or lowercase ASCII letters, digits, hyphens (-
), and underscores (_
). Finally, it must start with a letter and contain at least 3 characters but no more than 255.
Source type
The source type designates the kind of source being configured. As of version 0.5, available source types are ingest-api
, kafka
, kinesis
, and pulsar
. The file
type is also supported but only for local ingestion from the CLI.
Source parameters
The source parameters indicate how to connect to a data store and are specific to the source type.
File source (CLI only)
A file source reads data from a local file. The file must consist of JSON objects separated by a newline (NDJSON). As of version 0.5, a file source can only be ingested with the CLI command. Compressed files (bz2, gzip, ...) and remote files (Amazon S3, HTTP, ...) are not supported.
./quickwit tool local-ingest --input-path <INPUT_PATH>
Ingest API source
An ingest API source reads data from the Ingest API. This source is automatically created at the index creation and cannot be deleted nor disabled.
Kafka source
A Kafka source reads data from a Kafka stream. Each message in the stream must hold a JSON object.
A tutorial is available here.
Kafka source parameters
The Kafka source consumes a topic
using the client library librdkafka and forwards the key-value pairs carried by the parameter client_params
to the underlying librdkafka consumer. Common client_params
options are bootstrap servers (bootstrap.servers
), or security protocol (security.protocol
). Please, refer to Kafka and librdkafka documentation pages for more advanced options.
Property | Description | Default value |
---|---|---|
topic | Name of the topic to consume. | required |
client_log_level | librdkafka client log level. Possible values are: debug, info, warn, error. | info |
client_params | librdkafka client configuration parameters. | {} |
enable_backfill_mode | Backfill mode stops the source after reaching the end of the topic. | false |
Kafka client parameters
bootstrap.servers
Comma-separated list of host and port pairs that are the addresses of a subset of the Kafka brokers in the Kafka cluster.enable.auto.commit
The Kafka source manages commit offsets manually using the checkpoint API and disables auto-commit.group.id
Kafka-based distributed indexing relies on consumer groups. Unless overridden in the client parameters, the default group ID assigned to each consumer managed by the source isquickwit-{index_uid}-{source_id}
max.poll.interval.ms
Short max poll interval durations may cause a source to crash when back pressure from the indexer occurs. Therefore, Quickwit recommends using the default value of300000
(5 minutes).
Adding a Kafka source to an index with the CLI
cat << EOF > source-config.yaml
version: 0.6
source_id: my-kafka-source
source_type: kafka
params:
topic: my-topic
client_params:
bootstrap.servers: localhost:9092
security.protocol: SSL
EOF
./quickwit source create --index my-index --source-config source-config.yaml
Kinesis source
A Kinesis source reads data from an Amazon Kinesis stream. Each message in the stream must hold a JSON object.
A tutorial is available here.
Kinesis source parameters
The Kinesis source consumes a stream identified by a stream_name
and a region
.
Property | Description | Default value |
---|---|---|
stream_name | Name of the stream to consume. | required |
region | The AWS region of the stream. Mutually exclusive with endpoint . | us-east-1 |
endpoint | Custom endpoint for use with AWS-compatible Kinesis service. Mutually exclusive with region . | optional |
If no region is specified, Quickwit will attempt to find one in multiple other locations and with the following order of precedence:
Environment variables (
AWS_REGION
thenAWS_DEFAULT_REGION
)Config file, typically located at
~/.aws/config
or otherwise specified by theAWS_CONFIG_FILE
environment variable if set and not empty.Amazon EC2 instance metadata service determining the region of the currently running Amazon EC2 instance.
Default value:
us-east-1
Adding a Kinesis source to an index with the CLI
cat << EOF > source-config.yaml
version: 0.6
source_id: my-kinesis-source
source_type: kinesis
params:
stream_name: my-stream
EOF
quickwit source create --index my-index --source-config source-config.yaml
Pulsar source
A Puslar source reads data from one or several Pulsar topics. Each message in topic(s) must hold a JSON object.
A tutorial is available here.
Pulsar source parameters
The Pulsar source consumes topics
using the client library pulsar-rs.
Property | Description | Default value |
---|---|---|
topics | List of topics to consume. | required |
address | Pulsar URL (pulsar:// and pulsar+ssl://). | required |
consumer_name | The consumer name to register with the pulsar source. | quickwit |
Adding a Pulsar source to an index with the CLI
cat << EOF > source-config.yaml
version: 0.6
source_id: my-pulsar-source
source_type: pulsar
params:
topics:
- my-topic
address: pulsar://localhost:6650
EOF
./quickwit source create --index my-index --source-config source-config.yaml
Maximum number of pipelines per indexer
The max_num_pipelines_per_indexer
parameter is only available for sources that can be distributed: Kafka and (coming soon) Pulsar.
The maximum number of indexing pipelines defines the limit of pipelines spawned for the source on a given indexer.
This maximum can be reached only if there are enough desired_num_pipelines
to run.
With the following parameters, only one pipeline will run on one indexer.
max_num_pipelines_per_indexer=2
desired_num_pipelines=1
Desired number of pipelines
desired_num_pipelines
parameter is only available for sources that can be distributed: Kafka and Pulsar (coming soon).
The desired number of indexing pipelines defines the number of pipelines to run on a cluster for the source. It is a "desired" number as it cannot be reach it there is not enough indexers in the cluster.
With the following parameters, only one pipeline will start on the sole indexer.
max_num_pipelines_per_indexer=1
desired_num_pipelines=2
Transform parameters
For all source types but the ingest-api
, ingested documents can be transformed before being indexed using Vector Remap Language (VRL) scripts.
Property | Description | Default value |
---|---|---|
script | Source code of the VRL program executed to transform documents. | required |
timezone | Timezone used in the VRL program for date and time manipulations. It must be a valid name in the TZ database | UTC |
# Your source config here
# ...
transform:
script: |
.message = downcase(string!(.message))
.timestamp = now()
del(.username)
timezone: local
Input format
The input_format
parameter specifies the expected data format of the source. Two formats are currently supported:
json
: JSON, the defaultplain_text
: unstructured text document
Internally, Quickwit can only index JSON data. To allow the ingestion of plain text documents, Quickwit transform them on the fly into JSON objects of the following form: {"plain_text": "<original plain text document>"}
. Then, they can be optionally transformed into more complex documents using a VRL script. (see transform feature).
The following is an example of how one could parse and transform a CSV dataset containing a list of users described by 3 attributes: first name, last name, and age.
# Your source config here
# ...
transform:
script: |
user = parse_csv!(.plain_text)
.first_name = user[0]
.last_name = user[1]
.age = to_int!(user[2])
del(.plain_text)
Enabling/Disabling a source from an index
A source can be enabled or disabled from an index using the CLI command quickwit source enable
or quickwit source disable
:
quickwit source disable --index my-index --source my-source
A source is enabled by default. When disabling a source, the related indexing pipelines will be shut down on each relevant indexer and indexing for this source will be paused.
Deleting a source from an index
A source can be removed from an index using the CLI command quickwit source delete
:
quickwit source delete --index my-index --source my-source
When deleting a source, the checkpoint associated with the source is also removed.