Skip to main content
Version: main branch

S3 with SQS notifications

In this tutorial, we describe how to set up Quickwit to ingest data from S3 with bucket notification events flowing through SQS. We will first create the AWS resources (S3 bucket, SQS queue, notifications) using terraform. We will then configure the Quickwit index and file source. Finally we will send some data to the source bucket and verify that it gets indexed.

AWS resources

The complete terraform script can be downloaded here.

First, create the bucket that will receive the source data files (NDJSON format):

resource "aws_s3_bucket" "file_source" {
bucket_prefix = "qw-tuto-source-bucket"
}

Then setup the SQS queue that will carry the notifications when files are added to the bucket. The queue is configured with a policy that allows the source bucket to write the S3 notification messages to it. Also create a dead letter queue (DLQ) to receive the messages that couldn't be processed by the file source (e.g corrupted files). Messages are moved to the DLQ after 5 indexing attempts.

locals {
sqs_notification_queue_name = "qw-tuto-s3-event-notifications"
}

data "aws_iam_policy_document" "sqs_notification" {
statement {
effect = "Allow"

principals {
type = "*"
identifiers = ["*"]
}

actions = ["sqs:SendMessage"]
resources = ["arn:aws:sqs:*:*:${local.sqs_notification_queue_name}"]

condition {
test = "ArnEquals"
variable = "aws:SourceArn"
values = [aws_s3_bucket.file_source.arn]
}
}
}

resource "aws_sqs_queue" "s3_events_deadletter" {
name = "${locals.sqs_notification_queue_name}-deadletter"
}

resource "aws_sqs_queue" "s3_events" {
name = local.sqs_notification_queue_name
policy = data.aws_iam_policy_document.sqs_notification.json

redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.s3_events_deadletter.arn
maxReceiveCount = 5
})
}

resource "aws_sqs_queue_redrive_allow_policy" "s3_events_deadletter" {
queue_url = aws_sqs_queue.s3_events_deadletter.id

redrive_allow_policy = jsonencode({
redrivePermission = "byQueue",
sourceQueueArns = [aws_sqs_queue.s3_events.arn]
})
}

Configure the bucket notification that writes messages to SQS each time a new file is created in the source bucket:

resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = aws_s3_bucket.file_source.id

queue {
queue_arn = aws_sqs_queue.s3_events.arn
events = ["s3:ObjectCreated:*"]
}
}
note

Only events of type s3:ObjectCreated:* are supported. Other types (e.g. ObjectRemoved) are acknowledged and a warning is logged.

The source needs to have access to both the notification queue and the source bucket. The following policy document contains the minimum permissions required by the source:

data "aws_iam_policy_document" "quickwit_node" {
statement {
effect = "Allow"
actions = [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:ChangeMessageVisibility",
"sqs:GetQueueAttributes",
]
resources = [aws_sqs_queue.s3_events.arn]
}
statement {
effect = "Allow"
actions = ["s3:GetObject"]
resources = ["${aws_s3_bucket.file_source.arn}/*"]
}
}

Create the IAM user and credentials that will be used to associate this policy to your local Quickwit instance:

resource "aws_iam_user" "quickwit_node" {
name = "quickwit-filesource-tutorial"
path = "/system/"
}

resource "aws_iam_user_policy" "quickwit_node" {
name = "quickwit-filesource-tutorial"
user = aws_iam_user.quickwit_node.name
policy = data.aws_iam_policy_document.quickwit_node.json
}

resource "aws_iam_access_key" "quickwit_node" {
user = aws_iam_user.quickwit_node.name
}
danger

We don't recommend using IAM user credentials for running Quickwit nodes in production. This is just a simplified setup for the sake of the tutorial. When running on EC2/ECS, attach the policy document to an IAM roles instead.

Download the complete terraform script and deploy it using terraform init and terraform apply. After a successful execution, the outputs required to configure Quickwit will be listed. You can display the values of the sensitive outputs (key id and secret key) with:

terraform output quickwit_node_access_key_id
terraform output quickwit_node_secret_access_key

Run Quickwit

Install Quickwit locally, then in your install directory, run Quickwit with the necessary access rights by replacing the <quickwit_node_access_key_id> and <quickwit_node_secret_access_key> with the matching Terraform output values:

AWS_ACCESS_KEY_ID=<quickwit_node_access_key_id> \
AWS_SECRET_ACCESS_KEY=<quickwit_node_secret_access_key> \
AWS_REGION=us-east-1 \
./quickwit run

Configure the index and the source

In another terminal, in the Quickwit install directory, create an index:

cat << EOF > tutorial-sqs-file-index.yaml
version: 0.7
index_id: tutorial-sqs-file
doc_mapping:
mode: dynamic
indexing_settings:
commit_timeout_secs: 30
EOF

./quickwit index create --index-config tutorial-sqs-file-index.yaml

Replacing <notification_queue_url> with the corresponding Terraform output value, create a file source for that index:

cat << EOF > tutorial-sqs-file-source.yaml
version: 0.8
source_id: sqs-filesource
source_type: file
num_pipelines: 2
params:
notifications:
- type: sqs
queue_url: <notification_queue_url>
message_type: s3_notification
EOF

./quickwit source create --index tutorial-sqs-file --source-config tutorial-sqs-file-source.yaml
tip

The num_pipeline configuration controls how many consumers will poll from the queue in parallel. Choose the number according to the indexer compute resources you want to dedicate to this source. As a rule of thumb, configure 1 pipeline for every 2 cores.

Ingest data

We can now ingest data into Quickwit by uploading files to S3. If you have the AWS CLI installed, run the following command, replacing <source_bucket_name> with the associated Terraform output:

curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants-10000.json | \
aws s3 cp - s3://<source_bucket_name>/hdfs-logs-multitenants-10000.json

If you prefer not to use the AWS CLI, you can also download the file and upload it manually to the source bucket using the AWS console.

Wait approximately 1 minute and the data should appear in the index:

./quickwit index describe --index tutorial-sqs-file

Tear down the resources

The AWS resources instantiated in this tutorial don't incur any fixed costs, but we still recommend deleting them when you are done. In the directory with the Terraform script, run terraform destroy.