Skip to main content
Version: 0.8.0

Full-text search on ClickHouse

This guide will help you add full-text search to a well-known OLAP database, ClickHouse, using the Quickwit search streaming feature. Indeed Quickwit exposes a REST endpoint that streams ids or whatever attributes matching a search query extremely fast (up to 50 million in 1 second), and ClickHouse can easily use them with joins queries.

We will take the GitHub archive dataset, which gathers more than 3 billion GitHub events: PullRequestEvent, IssuesEvent... You can dive into this great analysis made by ClickHouse to have a good understanding of the dataset. We also took strong inspiration from this work, and we are very grateful to them for sharing this.

Install

curl -L https://install.quickwit.io | sh
cd quickwit-v*/

Start a Quickwit server

./quickwit run

Create a Quickwit index

After [starting Quickwit], we need to create an index configured to receive these events. Let's first look at the data to ingest. Here is an event example:

{
"id": 11410577343,
"event_type": "PullRequestEvent",
"actor_login": "renovate[bot]",
"repo_name": "dmtrKovalenko/reason-date-fns",
"created_at": 1580515200000,
"action": "closed",
"number": 44,
"title": "Update dependency rollup to ^1.31.0",
"labels": [],
"ref": null,
"additions": 5,
"deletions": 5,
"commit_id": null,
"body":"This PR contains the following updates..."
}

We don't need to index all fields described above as title and body are the fields of interest for our full-text search tutorial. The id will be helpful for making the JOINs in ClickHouse, created_at and event_type may also be beneficial for timestamp pruning and filtering.

gh-archive-index-config.yaml
version: 0.7
index_id: gh-archive
# By default, the index will be stored in your data directory,
# but you can store it on s3 or on a custom path as follows:
# index_uri: s3://my-bucket/gh-archive
# index_uri: file://my-big-ssd-harddrive/
doc_mapping:
store_source: false
field_mappings:
- name: id
type: u64
fast: true
- name: created_at
type: datetime
input_formats:
- unix_timestamp
output_format: unix_timestamp_secs
fast_precision: seconds
fast: true
- name: event_type
type: text
tokenizer: raw
- name: title
type: text
tokenizer: default
record: position
- name: body
type: text
tokenizer: default
record: position
timestamp_field: created_at

search_settings:
default_search_fields: [title, body]
curl -o gh-archive-index-config.yaml https://raw.githubusercontent.com/quickwit-oss/quickwit/main/config/tutorials/gh-archive/index-config-for-clickhouse.yaml
./quickwit index create --index-config gh-archive-index-config.yaml

Indexing events

The dataset is a compressed NDJSON file. Let's index it.

wget https://quickwit-datasets-public.s3.amazonaws.com/gh-archive/gh-archive-2021-12-text-only.json.gz
gunzip -c gh-archive-2021-12-text-only.json.gz | ./quickwit index ingest --index gh-archive

You can check it's working by using the search command and looking for tantivy word:

./quickwit index search --index gh-archive --query "tantivy"

Streaming IDs

We are now ready to fetch some ids with the search stream endpoint. Let's start by streaming them on a simple query and with a csv output format.

curl "http://127.0.0.1:7280/api/v1/gh-archive/search/stream?query=tantivy&output_format=csv&fast_field=id"

We will use the click_house binary output format in the following sections to speed up queries.

ClickHouse

Let's leave Quickwit for now and install ClickHouse. Start a ClickHouse server.

Create database and table

Once installed, just start a client and execute the following sql statements:

CREATE DATABASE "gh-archive";
USE "gh-archive";


CREATE TABLE github_events
(
id UInt64,
event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4,
'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8,
'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11,
'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15,
'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19,
'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22),
actor_login LowCardinality(String),
repo_name LowCardinality(String),
created_at Int64,
action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9,
'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20),
comment_id UInt64,
body String,
ref LowCardinality(String),
number UInt32,
title String,
labels Array(LowCardinality(String)),
additions UInt32,
deletions UInt32,
commit_id String
) ENGINE = MergeTree ORDER BY (event_type, repo_name, created_at);

Import events

We have created a second dataset, gh-archive-2021-12.json.gz, which gathers all events, even ones with no text. So it's better to insert it into ClickHouse, but if you don't have the time, you can use the dataset gh-archive-2021-12-text-only.json.gz used for Quickwit.

wget https://quickwit-datasets-public.s3.amazonaws.com/gh-archive/gh-archive-2021-12.json.gz
gunzip -c gh-archive-2021-12.json.gz | clickhouse-client -d gh-archive --query="INSERT INTO github_events FORMAT JSONEachRow"

Let's check it's working:

# Top repositories by stars
SELECT repo_name, count() AS stars
FROM github_events
GROUP BY repo_name
ORDER BY stars DESC LIMIT 5

┌─repo_name─────────────────────────────────┬─stars─┐
│ test-organization-kkjeer/app-test-2 │ 16697 │
│ test-organization-kkjeer/bot-validation-2 │ 15326 │
│ microsoft/winget-pkgs │ 14099 │
│ conda-forge/releases │ 13332 │
│ NixOS/nixpkgs │ 12860 │
└───────────────────────────────────────────┴───────┘

Use Quickwit search inside ClickHouse

ClickHouse has an exciting feature called URL Table Engine that queries data from a remote HTTP/HTTPS server. This is precisely what we need: by creating a table pointing to Quickwit search stream endpoint, we will fetch ids that match a query from ClickHouse.

SELECT count(*) FROM url('http://127.0.0.1:7280/api/v1/gh-archive/search/stream?query=log4j+OR+log4shell&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64')

┌─count()─┐
│ 217469 │
└─────────┘

1 row in set. Elapsed: 0.068 sec. Processed 217.47 thousand rows, 1.74 MB (3.19 million rows/s., 25.55 MB/s.)

We are fetching 217 469 u64 ids in 0.068 seconds. That's 3.19 million rows per second, not bad. And it's possible to increase the throughput if fast field are already cached.

Let's do another example with a more exciting query that will match log4j or log4shell and count events per day:

SELECT
count(*),
toDate(fromUnixTimestamp64Milli(created_at)) AS date
FROM github_events
WHERE id IN (
SELECT id
FROM url('http://127.0.0.1:7280/api/v1/gh-archive/search/stream?query=log4j+OR+log4shell&fast_field=id&output_format=click_house_row_binary', RowBinary, 'id UInt64')
)
GROUP BY date

Query id: 10cb0d5a-7817-424e-8248-820fa2c425b8

┌─count()─┬───────date─┐
│ 96 │ 2021-12-01 │
│ 66 │ 2021-12-02 │
│ 70 │ 2021-12-03 │
│ 62 │ 2021-12-04 │
│ 67 │ 2021-12-05 │
│ 167 │ 2021-12-06 │
│ 140 │ 2021-12-07 │
│ 104 │ 2021-12-08 │
│ 157 │ 2021-12-09 │
│ 88110 │ 2021-12-10 │
│ 2937 │ 2021-12-11 │
│ 1533 │ 2021-12-12 │
│ 5935 │ 2021-12-13 │
│ 118025 │ 2021-12-14 │
└─────────┴────────────┘

14 rows in set. Elapsed: 0.124 sec. Processed 8.35 million rows, 123.10 MB (67.42 million rows/s., 993.55 MB/s.)

We can see two spikes on the 2021-12-10 and 2021-12-14.

Wrapping up

We have just scratched the surface of full-text search from ClickHouse with this small subset of GitHub archive. You can play with the complete dataset that you can download from our public S3 bucket. We have made available monthly gzipped ndjson files from 2015 until 2021. Here are 2015-01 links:

The search stream endpoint is powerful enough to stream 100 million ids to ClickHouse in less than 2 seconds on a multi TB dataset. And you should be comfortable playing with search stream on even bigger datasets.