A math puzzle and a better algorithm for top-k
A math brain teaser
An old friend is currently visiting Tokyo with his family. This weekend, we went for a walk with our daughters in Yoyogi park. Knowing I have a sweet tooth for math puzzles, he told me about one he got from a coworker.
While I had never exactly heard about it, something immediately felt familiar. I did not know the answer but suspiciously rushed in the right direction, and solved it a little bit too rapidly. At first, I could not explain what this déjà vu feeling was about, so I pocketed the petty glory without telling him.
Later, I understood the source of the déjà vu. This math problem is directly related to search, so I had probably derived the same calculation once in the past.
Here is the problem...
Let's put the world population (about 8 billion people) in a line. Each person is said to be the tallest-so-far if they are taller than all the people in front of them.
For instance, we could have a line with people with the following heights:
[175cm, 120cm, 172cm, 90cm, 156cm, 190cm, 192cm, 160cm, 185cm, 125cm, ...]
The bold numbers are the positions matching the tallest-so-far property.
The question is: in average, how many people in this 8 billions people line are tallest-so-far?
In mathy words, what is the expectancy of the total number of tallest-so-far people in the line?
Let me insert a rabbit to avoid any spoilers.
The solution relies on a common trick: express the problem as a sum of random variables and rely on the linearity of expectancy.
Let's consider the random variables
Then, we have our solution S equal to the sum of these 8 billion random variables:
Even if our random variables are not independent, the expectancy is still linear:
The probability for any given person to be the tallest out of the first people is . So finally, we get:
We recognize here the harmonic series . It is equivalent to . We can even use the Euler-Maclaurin formula to get a very accurate approximation.
Did you have an intuition about the range of the result?
If you squint a little, this little problem is also applicable to the expected number world records in the absence of technical progress or dopping in sports for instance.
And if you squint harder, you will discover this problem is also related to search!
Top-K in search
So how is this problem related to search? Well, search engines usually work by creating an iterator over the matching document IDs and their score.
Once the iterator is created, we are left with the need to iterate through the documents and keep track of the K documents with the highest score so far. This is a famous algorithm problem called "Top-K".
The usual solution consists of maintaining a min-heap with the top K documents so far. At all times, the min-heap gives us the threshold above which a document should enter the top K and replace one of the elements of the heap.
A textbook implementation could look like this:
pub struct Hit {
pub score: u64,
pub doc: u32,
}
impl Ord for Hit {
fn cmp(&self, other: &Self) -> Ordering {
self.score.cmp(&other.score).reverse()
}
}
/* ... implementations of PartialOrd, Eq, PartialEq ... */
pub fn heap_top_k(mut hits: impl Iterator<Item=Hit>, k: usize) -> Vec<Hit> {
assert!(k>0);
let mut top_k = BinaryHeap::with_capacity(k);
top_k.extend((&mut hits).take(k));
let mut threshold = top_k.peek().unwrap().score;
for hit in hits {
if hit.score <= threshold {
continue;
}
let mut head = top_k.peek_mut().unwrap();
*head = hit;
drop(head);
threshold = top_k.peek().unwrap().score;
}
top_k.into_sorted_vec()
}
The worst-case complexity is obtained for a list that would be inversely sorted. In that case, heapreplace
would be called for every single document, and we would have a complexity of , where n is the number of documents and k the number of documents we want to keep.
But what is the average-case complexity? Is it really the same? I have yet to see a text book mentionning it, and ChatGPT does not seem to know better.
The puzzle becomes handy
Our puzzle is equivalent to computing the average complexity of a top-1 collector.
The property of being the tallest-so-far just means that we exceed the threshold and need to update the heap.
For , the computation of the average complexity of top-K is just a generalization of our problem. Let's count the number of calls to heapreplace(..)
We use the following random variable:
The number of times we will call heapreplace
is the sum of these random variables:
\
For all i > k, the probability for the ith element ith to be in the top K of the first elements . So we have:
Each call to heapreplace
itself has a complexity of , so we end up
with a complexity in average of .
A better algorithm
Tantivy actually uses a different solution with a complexity that is , even in the worst case. It was taught to us by ChanMin, who was my team mate at Indeed, and was implemented years later by Pascal.
The intuition is that we do not need to update the threshold every time we insert a new element.
Instead of maintaining a perfect threshold at all time using a Min-heap, we can try and update it one out of k times. It can be done efficiently by to filling a buffer of size with the scored document IDs. When the buffer reaches its capacity, we can compute the new threshold using the median algorithm and discard all elements below the median.
We can then rinse and repeat the operation times.
In rust the implementation looks like this:
fn top_k(mut hits: impl Iterator<Item=Hit>, k: usize) -> Vec<Hit> {
assert!(k>0);
let mut top_k = Vec::with_capacity(2 * k);
top_k.extend((&mut hits).take(k));
let mut threshold = 0u64;
for hit in hits {
if hit.score <= threshold {
continue;
}
top_k.push(hit);
if top_k.len() == 2*k {
// The standard library does all of the heavy lifting here.
let (_, median_el, _) = top_k.select_nth_unstable(k - 1);
threshold = median_el.score;
top_k.truncate(k);
}
}
top_k.sort_unstable();
top_k.truncate(k);
top_k
}
select_nth_unstable
is linear in . Also, the number of flush operations is linear is of . Finally, we still have to sort our elements, hence the worst-case complexity:
Benchmarks
Of course, the theory does not capture the characteristics of modern CPUs. Performance is usually guided by the memory hierarchy and the branch predictor, so any decision should be validated by benchmarks.
Let's have a look at the performance of the two algorithms for different values of k.
The code used for the benchmark is slightly different from the one in the benchmark. You can find the code here.
First, let's have a look at the results for average case, in which all of the docs are randomly shuffled.
Here is performance for different values of k, (n = 1_000_000).
topk fastest │ slowest │ median │ mean
╰─ shuffled │ │ │
├─ HeapTopK │ │ │
│ ├─ 1 296.1 µs │ 462.8 µs │ 298 µs │ 304.4 µs
│ ├─ 2 294.1 µs │ 773.4 µs │ 299.6 µs │ 312 µs
│ ├─ 4 297.2 µs │ 331.6 µs │ 299.9 µs │ 303.6 µs
│ ├─ 8 299.2 µs │ 347.4 µs │ 301.7 µs │ 304.2 µs
│ ├─ 16 301.9 µs │ 350.4 µs │ 304.2 µs │ 308 µs
│ ├─ 32 307.4 µs │ 355.4 µs │ 310.7 µs │ 313.6 µs
│ ├─ 64 318.4 µs │ 364.4 µs │ 322.5 µs │ 326.6 µs
│ ├─ 128 339.8 µs │ 401.6 µs │ 345.2 µs │ 347.9 µs
│ ├─ 256 384.3 µs │ 438 µs │ 389.3 µs │ 392.1 µs
│ ├─ 512 467.5 µs │ 525.8 µs │ 474.2 µs │ 478.2 µs
│ ├─ 1024 625.7 µs │ 705.3 µs │ 642.9 µs │ 647 µs
│ ╰─ 2048 939.5 µs │ 1.061 ms │ 957.9 µs │ 961.9 µs
╰─ MedianTopK │ │ │
├─ 1 296 µs │ 329.2 µs │ 299.5 µs │ 302.9 µs
├─ 2 296.9 µs │ 335.7 µs │ 298.9 µs │ 301.5 µs
├─ 4 297.4 µs │ 348 µs │ 300 µs │ 303.1 µs
├─ 8 299.5 µs │ 335.6 µs │ 301.5 µs │ 305.2 µs
├─ 16 301.7 µs │ 354.9 µs │ 305.1 µs │ 308.9 µs
├─ 32 306.5 µs │ 346.2 µs │ 309.1 µs │ 312.5 µs
├─ 64 312.8 µs │ 355.5 µs │ 315.5 µs │ 319 µs
├─ 128 322.3 µs │ 369.3 µs │ 326.9 µs │ 330.4 µs
├─ 256 340.4 µs │ 374.6 µs │ 345.1 µs │ 347.2 µs
├─ 512 369.8 µs │ 426.6 µs │ 375.8 µs │ 379.3 µs
├─ 1024 415.8 µs │ 481.6 µs │ 423.3 µs │ 427.8 µs
╰─ 2048 491.5 µs │ 552.9 µs │ 503.8 µs │ 510.2 µs
We see that both algorithm behave relatively nicely. The heap-based algorithm performance degrades, but very slowly as k increases.
Overall however, it is already a win for the median-based algorithm.
Now here are the results for the worst case, in which values are already sorted:
topk fastest │ slowest │ median │ mean
╰─ sorted │ │ │
├─ HeapTopK │ │ │
│ ├─ 1 752.3 µs │ 833.6 µs │ 760.1 µs │ 765.8 µs
│ ├─ 2 1.677 ms │ 1.781 ms │ 1.684 ms │ 1.689 ms
│ ├─ 4 6.719 ms │ 7.372 ms │ 7.077 ms │ 7.072 ms
│ ├─ 8 11.24 ms │ 11.63 ms │ 11.35 ms │ 11.36 ms
│ ├─ 16 12.43 ms │ 12.88 ms │ 12.58 ms │ 12.59 ms
│ ├─ 32 13.98 ms │ 14.35 ms │ 14.11 ms │ 14.12 ms
│ ├─ 64 14.33 ms │ 28.94 ms │ 14.52 ms │ 14.71 ms
│ ├─ 128 14.33 ms │ 15.23 ms │ 14.51 ms │ 14.51 ms
│ ├─ 256 15.62 ms │ 16.22 ms │ 15.83 ms │ 15.84 ms
│ ├─ 512 18.24 ms │ 18.75 ms │ 18.44 ms │ 18.43 ms
│ ├─ 1024 21.84 ms │ 22.45 ms │ 22.07 ms │ 22.06 ms
│ ╰─ 2048 24.7 ms │ 25.47 ms │ 25.03 ms │ 25.03 ms
╰─ MedianTopK │ │ │
├─ 1 4.255 ms │ 4.487 ms │ 4.283 ms │ 4.293 ms
├─ 2 3.557 ms │ 3.708 ms │ 3.576 ms │ 3.588 ms
├─ 4 4.192 ms │ 4.744 ms │ 4.208 ms │ 4.236 ms
├─ 8 2.88 ms │ 3.06 ms │ 2.897 ms │ 2.908 ms
├─ 16 2.353 ms │ 2.489 ms │ 2.363 ms │ 2.372 ms
├─ 32 2.237 ms │ 2.437 ms │ 2.247 ms │ 2.259 ms
├─ 64 2.13 ms │ 2.328 ms │ 2.15 ms │ 2.159 ms
├─ 128 2.31 ms │ 2.466 ms │ 2.327 ms │ 2.338 ms
├─ 256 2.257 ms │ 2.389 ms │ 2.266 ms │ 2.273 ms
├─ 512 2.22 ms │ 2.386 ms │ 2.23 ms │ 2.243 ms
├─ 1024 2.195 ms │ 2.32 ms │ 2.201 ms │ 2.21 ms
╰─ 2048 2.196 ms │ 2.391 ms │ 2.206 ms │ 2.22 ms
This is where the linearity over K really delivers!