In learning about Haskell recently, I was introduced to a recursive data structure I had never used before (at least not knowingly) – the binary search tree (BST). The BST is made up of nodes that have three main features:
- They contain a value
- They can refer to another node to the left with a smaller value
- They can refer to another node to the right with a larger value
So let’s throw a simple problem at this: determining whether a given integer is a member of an array.
Can we construct a binary search tree in Ruby that is faster than Ruby’s native C-implemented Array class? (Spoilers: yes.)
1. Implementation
My first implementation of this had an overarching Tree
class, but it soon became clear that it’s entirely unccessary. As a recursive data structure, each node has a tree descending from it, so we just need a Node
. The tree will be implied.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Great! But also: awful. Instead of constructing the tree manually, we need to be able to treat it as if it were an array. We should be able to apply the shovel operator to the base node of the tree and have the tree place the value wherever it should rightfully or leftfully go.
1 2 3 4 5 6 7 8 9 10 11 |
|
This uses Ruby’s spaceshipesque comparator <=>
to determine if the value to be inserted is greater than, less than, or equal to the value of the current node, and then traverses the tree until …
1 2 |
|
Spectacular! Except: not. We have a node that expected its left value to respond to insert
, to which nil
annoyingly refused. We can redefine more specific insert methods to work around this issue (and, since it’s getting a bit hard to read, a new inspect
method):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
The next step is to determine whether our tree contains a given value.
This is where the Binary Search Tree has a reputation for speediness – unlike
iterating over every element of an array and checking for equality, the structure of the
tree provides a sort of automatic index that points us to where the value should be,
and then we can check if it’s there. It’ll look remarkably similar to our insert
method:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
If you were paying attention to the insert
method, you can probably guess that when this method reaches a left
or right
that is nil
, it will fail. Which is really annoying! But since this seems to be a pattern we have stumbled upon, let’s find a better way to solve this rather than peppering the code with nil checks. Enter our second class, EmptyNode
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
…and make sure instances of this class terminate our tree by default:
1 2 3 4 5 6 7 8 9 10 11 |
|
Note: the (*)
formal argument to these EmptyNode
’s methods simply states that we don’t
care how many arguments are passed to the method, and that we won’t be using them anyway.
The EmptyNode
class is useful in that it provides a meaningful end to the recursive
structure — specifically, that a given range of values in the tree are definitively
not present. Otherwise, it does very little. We don’t allow insert
to do anything with it,
because then it wouldn’t be an empty node. Unfortunately, we can’t simply tell it to replace itself
with a Node
object (as that isn’t possible in Ruby), so we have to change the reference back at the parent node:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Here, we use the or
control flow operator to perform one of two actions: if the first returns
a falsey value, the second (assign the new Node
object).
Ok! So we now have a binary tree that can insert new values at the correct location and tell you whether or not it contains a given value. Let’s check it out:
1 2 3 4 5 6 7 |
|
2. Benchmarks
Let’s benchmark it! This test populates an array with 5000 random values up to 50,000, that checks every value between 1 and 50,000 to see if the array includes it. The same benchmark is repeated for the binary tree containing an identical set of values.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
user system total real
test_array include 13.230000 0.020000 13.250000 ( 13.283172)
binary tree search 0.140000 0.000000 0.140000 ( 0.139983)
I have to say, I was a little surprised how much faster (~100x) this was. It makes sense when you think about the fact that
to check if an element is included in the Array, Ruby needs to run an equality comparison for up to 5000 values 50000 times.
That’s a lot of overhead, and Arrays simply aren’t optimized for this. Ruby has another built-in data structure that is
explicitly designed for fast lookups of arbitrary values — the venerable Hash
.
Similar to a binary search tree, Ruby’s hash tables follow a defined set of rules that guide it to the
proper places in memory when setting and retrieving values. For an in-depth exploration of what makes Hashes fast, read Pat Shaughnessy’s Ruby Under a Microscope.
Let’s rerun the benchmark again, but this time comparing hash lookups as well. For these purposes, it doesn’t matter what the values
are in the hash, so we’ll just make them all true
:
1 2 3 4 5 6 7 |
|
user system total real
test_array include 13.400000 0.020000 13.420000 ( 13.473588)
binary tree search 0.130000 0.000000 0.130000 ( 0.129013)
test_hash lookup 0.000000 0.000000 0.000000 ( 0.008119)
Ruby’s native C-implemented Hash is around 15 times faster than the Ruby-implemented binary search tree, which is about what I expected.
3. Array Conversions
In order to convert arrays into binary trees and back again, let’s introduce a few new methods. The first will be a module method:
1 2 3 4 5 6 7 |
|
from_array
simply assigns the root node of the tree as the first value of the array, then pushes
all array values on in order. Converting back to an array is a simple matter of traversing the recursive tree.
An interesting side effect is that if done in a particular way, this is equivalent to calling .uniq.sort
on the original array (as far as I know,
it’s impossible to maintain the original order).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
In case it’s not clear how the recursion works, here’s what the array expansion looks like for a simple tree {10:{5:{}|{}}|{15:{}|{}}}
:
- For both 5 and 15,
left.to_a
andright.to_a
are[]
(EmptyNode#to_a
), so the results are[5]
and[15]
respectively - For 10,
left.to_a
is[5]
andright.to_a
is[15]
, giving[5] + [10] + [15]
or[5, 10, 15]
We can test it on an example with more elements:
1 2 3 4 |
|
Interestingly, it’s faster to convert a large array into a binary tree and perform a search than it is to call include?
on the Array.
1 2 3 4 5 6 7 8 9 |
|
user system total real
array#include? 13.160000 0.020000 13.180000 ( 13.235368)
binary search 0.190000 0.000000 0.190000 ( 0.188989)
It takes about 50% longer than just the binary tree search itself, which makes sense because it traverses the tree twice (once to insert values and once to query them). It doesn’t take twice as long, because we start with a small tree (a single node) and build it up gradually as the values are inserted.
4. Why would I use this?
Because of nerdliness?
Honestly cannot think of an instance where this would have been useful to me in a Ruby project, including those where I’m juggling querying enormous quantities of data.
As much as we are loathe to admit it, there are other languages that behave differently from Ruby and are used for different things. Here’s a StackOverflow question describing some of them.
Click here to read part 2: Reimplementing Ruby’s Hash using binary search trees!
Appendix
Here is the code in its entirety: