Introduction to Functional Programming in Ruby: Linked Lists

Dec 15th, 2014

As a Ruby-centric web developer, I am well-versed in Object-Oriented Design. It’s a paradigm that serves its purpose well, but at times it can feel a bit insular. I feel it’s difficult to completely understand how and why a system is put together the way it is without studying systems based on other paradigms. So around six months ago I set myself to the task of understanding functional programming. And looking back on my naive mid-2014 self, I would never have guessed how different pure FP really is, even compared to a function-heavy language like JavaScript.

I chose Haskell as my first functional language for several reasons — it’s often regarded as the ‘purest’ functional language, it’s deeply rooted in academia (so a lot of FP research is based on it), and it has a reasonably mature ecosystem. This article isn’t about Haskell per se, though. It’s meant to be a brief introduction to functional concepts presented in the familiar Ruby syntax. I’ll use Haskell merely as a reference.

What follows is a solution to an exercism.io problem (Simple Linked List), written in a functional style, coming as close as possible to a Haskell implementation but still being generally usable as Ruby code. I hope this isn’t necessary to say, but this is not meant to be used in production code, for almost the same reasons a metal fork is not to be used to pluck bread out of a toaster. Given the differences in laziness, typing, etc, Ruby is not nearly as well suited for this as Haskell is. But my hope is that it impels a bunch of you OO purists to check out FP. It’s a real mind bender when you get started, but it will make you feel smart as hell once you start to understand it.

Who this is for:

Ideally, you’ll be well-versed in Ruby, particularly the way procs work. There’s a bit of OO design at work here, but that’s not a necessary skill to understanding this.

SO:

1. What’s a linked list?

There’s no built-in linked list in Ruby. In contrast, Haskell uses it everywhere. The default List implementation is a linked list with all sorts of fancy goodness attached, but the basic idea is that you have a node with two elements: a datum and a reference to the next node. The list terminates in an empty node. An explicit way to represent a linked list would be:

1 `cons` (2 `cons` (3 `cons` Nil))

In Haskell, cons is implemented as an infix : operator; Nil is []:

1 : 2 : 3 : []

…but it can also be written (and more often is) just like a Ruby array:

[1, 2, 3]

The important point is that it’s a recursive data structure, and most functions that operate on it are defined in a recursive style (or in terms of a lower-level recursively-defined function— we’ll see what that means shortly).

In Ruby, we’ll need a container to act as our list constructor, so let’s make a simple class:

class Element
  attr_reader :datum, :_next
  alias_method :next, :_next

  def initialize(datum, _next = nil)
    @datum, @_next = datum, _next
  end
end

Now we can represent our list as:

list = Element.new(1, Element.new(2, Element.new(3)))
list.next.datum #=> 2
list.next.next.datum #=> 3

It’s not a very compact way to represent our data, so let’s implement a temporary to_a method that converts our LL back to a standard Ruby array:

def to_a
  if _next
    [datum] + _next.to_a
  else
    [datum]
  end
end

Ah, we’re already into recursion territory. But five lines to generate an array! I’d rather just have [datum] + _next.to_a and be done with it, but the nil at the end of the list won’t let me do that. If only there was a way to represent nil as an empty list, with all of the behavior we expect from a linked list.

class Element
  def initialize(datum, _next = nil)
    @datum, @_next = datum, (_next || Null.new)
  end

  def to_a
    [datum] + _next.to_a
  end
end

class Null
  def to_a
    []
  end
end

Ok, so now that our list terminator can respond to the recursive .to_a call, it not only terminates the list, it also terminates the recursion. This is important, unless you have an infinite stack and a very large capacity for hanging around waiting for programs that will never complete.

And now it’s only fair that we should be able to convert a Ruby array into a Linked List. In other words, we need to turn a flat data structure into a recursive one (hint: it will probably involve recursion). In Haskell, an important feature of recursive list functions is pattern matching on the head and tail of lists, where head is the first element, and tail is another list consisting of everything except the head. So the head of 1 : 2 : 3 : [] is 1, and the tail is 2 : 3 : []. This is often represented in the form (head:tail), or more commonly, (x:xs). We can do approximately the same thing with a Ruby array using destructuring assignment: head, *tail = ary. Unfortunately, we can’t pattern match an empty array in Ruby, so we’ll have to introduce a guard clause to end the recursion when the list is empty:

class Element
  class << self
    def from_a(ary)
      head, *tail = ary
      return Null.new unless head
      new(head, from_a(tail))
    end
  end
end

Element.from_a([1, 2, 3]) # => #<Element:0x007f94f1b3e010 @datum=1, @_next=#<Element:0x007f94f1b3e088 @datum=2, @_next=#<Element:0x007f94f1b3e128 @datum=3, @_next=#<Null:0x007f94f1b3e178>>>>

2. Functions!

Now that we’ve got this recursive conversion boilerplate out of the way, it’s time for some actual functional programming.

2.1 `CONS`

The first thing we want to do is express the list constructor as a lambda function. In Haskell, which has first-class functions (pretty much everything’s a function), the list constructor (:) can be passed around as an argument to other functions. (:) is a binary function (meaning it takes two arguments), and if you boot up ghci and get its type signature (:t (:)), you’ll find:

(:) :: a -> [a] -> [a]

Which is to say that (:) is a function that takes an argument of type a, another argument that is a list of as, and it returns a list of as. In terms of our Element class, that’s really just the new class method. Wrapping a lambda around that is pretty trivial:

class Element
  CONS = -> (x, y) { new(x, y) }
end

CONS has most of the features of Haskell’s (:), except homogenous types are not enforced. It takes a value and a list, and prepends that value to the list. Pretty simple. We don’t really have much to do with CONS for the moment, because there aren’t yet any functions to which to feed it.

NOTE: Throughout this exercise, I’ll use a combination of explicitly-defined lambdas and Symbol#to_proc to generate functions. The decision of which one to use rests with maintaining the arity (number of arguments) of the desired function. CONS is a binary function as defined here; :new.to_proc would need Element passed in as an additional argument, making it a trinary function. We’ll see an example of a workable symbol-to-proc conversion later.

2.2 Folds!

Folds are functions that are used to implement a huge range of list operations. There are two basic types: left folds (foldl) and right folds (foldr), named for the side of the list they start with.

Here’s Haskell’s type signature for foldl:

foldl :: (a -> b -> a) -> a -> [b] -> a

That intimidating looking thing, in plain English, says: “foldl takes a function f (that’s the part in parentheses), a value a, and a list of bs, and returns another value a, where f is a binary function that takes an a and a b and returns an a.”

foldr is similar:

foldr :: (a -> b -> b) -> b -> [a] -> b

Or: “foldr takes a function f, a value b, and a list of as, and returns a b, where f is a function that takes an a and a b and returns a b.

Compare that to Ruby’s reduce/inject:

inject(init) { |accumulator, e| ... }

Or: “inject is an enumerable method that takes an initial value and a block (or function) that takes an accumulator and an element, and returns the accumulated result of iterating over the enumerator with that block.” Or: (a -> b -> a) -> a -> [b] -> a. Same deal.

Let’s look at foldl first:

class Element
  # ...

  def foldl(func, init)
    _next.foldl(func, func[init, datum])
  end
end

While in Haskell’s folds, the list is passed as an argument to the function, in Ruby, it’s the receiver of the fold methods (but the effect is the same). There’s a lot of behavior encapsulated in this short definition, so let’s explore the example of summing the elements of a list:

my_list = Element.from_a([1, 2, 3])

# we need a binary function that takes in an initial value and an element, and returns a result:
sum = -> (x, y) { x + y }

# The sum should start with 0, so we'll make that the `init`:
my_list.foldl(sum, 0)

# stepping through the recursion:
_next.foldl(sum, sum[0, 1]) # for (Element 1, ...)
_next.foldl(sum, 1)         # we've already arrived at something that looks
                            # just like our initial call, but for (Element 2, ...).

_next.foldl(sum, sum[1, 2])
_next.foldl(sum, 3) # now we're on (Element 3, ...)
_next.foldl(sum, sum[3, 3])
_next.foldl(sum, 6) # now we're on the terminator ...
# => NoMethodError: undefined method `fold' for #<Null:0...>

Whoops, Null has no idea what to do. Remember back when Null was responsible for terminating the recursive to_a calls? It’ll have a similar responsibility here, and it’ll do it by simply not calling the function. Remember, Null isn’t exactly nil; it represents an empty linked list. So folding it with a function and an initial value should just return the initial value:

class Null
  def foldl(_, init)
    init
  end
end

(The underscore ignores the first argument). Skipping back to the last step in the example, Null.new.foldl(sum, 6) #=> 6. The result of that entire recursive call stack for Element.from_a([1, 2, 3]).foldl(sum, 0) is 6, (which is correct).

foldr looks a lot like foldl, but it starts at the right of the list. Since the right of the list is always Null (always), let’s start with that definition:

class Null
  def foldr(_, init)
    init
  end
end

Not surprisingly, it’s identical to foldl. How about the rest of the list?

class Element
  def foldr(func, init)
    func[datum, _next.foldr(func, init)]
  end
end

I’m not going to step through this one, but if you want to, you’ll find that rather than evaluating the function with the first element and passing it along as the init (as with foldl), foldr passes on the recursion first, than pops the results back up the call stack as inits.

Also, I’d like to point out something cool: Symbol#to_proc works with foldl and foldr!

Element.from_a([1, 2, 3]).foldl(:+.to_proc, 0) #=> 6

A welcome compatibility.

2.3 Length

How would we duplicate the behavior of Array#length in a functional way? In a linked list, there’s no single overarching object to query. A recursive definition would be “1 plus the length of the tail”, where the length of Null is 0:

class Element
  def length
    1 + _next.length
  end
end

class Null
  def length; 0 end
end

Fine, whatever. But we worked so hard on our folds— both foldl and foldr represent abstractions around the idea of recursion, and can be used to define functions that would otherwise recurse. To wit:

class Element
  def length
    foldl ->(acc, _) { acc + 1 }, 0
  end
end

class Null
  # just delete that sucker
end

Now with foldl, we can just say, “starting at 0, add one for every element in the list.” The fold gives us the ability to do something like Ruby’s .each in a data structure that’s pretty unfamiliar with the concept of iterators. And Null doesn’t need to know a thing about it; it’ll simply not apply the function to init, as we’ve told it in its own foldl definition.

Element.from_a([1, 2, 3]).length # => 3

2.4 Map

Here’s Haskell’s type signature for map:

map :: (a -> b) -> [a] -> [b]

In plain English: “map takes a function f and a list of as, and returns a list of bs, where f maps from an a to a b (similar to how you would feed Ruby’s map a block that deals with individual values, and you get the array at the end for free). So if a linked list pops out at the end, map must know how to build it. As we’ve seen with length, there should be a way to define it in terms of a recursive fold. IS THERE SUCH A WAY?

What a silly question. Remember CONS? CONS is a binary function that takes a value and a list, and returns a list, or in plain Haskell type signature terms: (a -> b -> b), where a is the value and b is a list. And THAT type signature looks exactly like the one signing the function passed to foldr ((a -> b -> b) -> b -> [a] -> b). So what happens if we drop it into a foldr?

Element.from_a([1, 2, 3]).foldr(Element::CONS, Null.new) #=> #<Element:0x007fca4b9161e0 @datum=1, @_next=#<Element:0x007fca4b916230 @datum=2...

We get our same list back! Actually, if you compare the memory addresses, you’ll see it’s a new list with identical data. The same will be true for all of our operations — in the spirit of immutability, we’re not going to mutate anything today.

Let’s take this a step further and say that instead of applying CONS directly, we want to apply f to a datum, and THEN CONS it to the rest of the list. To do that, we’ll compose a new function out of f and CONS:

double = -> (x) { x * 2 }
double_cons = -> (x, new_list){ Element::CONS[double[x], new_list] }

# try it out:
double_cons[1, Null.new].to_a # => [2]

double_cons is the type of function (literally) we can feed to foldr, and can be abstracted for use in map. We want to be able to use it with a block like Ruby’s built in map, so let’s define it using &func as the function argument:

class Element
  def map(&func)
    foldr -> (x, accumulator) { CONS[func[x], accumulator] }, Null.new
  end
end

The empty list at right-hand side becomes our initial value, and we prepend the mapped values from the rest of the list in reverse order, leaving us with a mapped version of the original list:

Element.from_a([1, 2, 3]).map {|x| x * 2 }.to_a # => [2, 4, 6]

Again, since Null knows how to foldr, there’s no additional definition needed for mapping over an empty list.

2.5 Reverse

A recursive explanation of reversing a linked list might go something like this: starting from the left, prepend each element of a list onto a new list:

or:

: 2 : 3 : [] ----- []
: 3 : [] ----- 1 : []
: [] ----- 2 : 1 : []
[] ----- 3 : 2 : 1 : []

(It’s almost magical)

“Starting from the left” makes me think of foldl:

class Element
  def reverse
    foldl -> (acc, x) { CONS[x, acc] }, Null.new
  end
end

Element.from_a([1, 2, 3]).reverse.to_a # => [3, 2, 1]

That was pretty easy. Let’s take a closer look at what we’ve done: taken a binary function CONS and wrapped it in an anonymous binary function -> (acc, x) { ... }, the body of which simply calls CONS with the arguments flipped. This is such a common way of composing functions that Haskell includes it as a function called flip ((a -> b -> c) -> b -> a -> c).

Here’s how it looks in Ruby:

class Element
  FLIP = -> (f) { -> (x, y) { f[y, x] } }
end

FLIP is a function that takes a binary function f and returns a new function that takes two arguments, which it then applies in reverse order to f. This allows us to change -> (acc, x) { CONS[x, acc] } into FLIP[CONS], which handily sort of describes what we’re doing in reverse (but that’s just a semantic coincidence).

class Element
  def reverse
    foldl FLIP[CONS], Null.new
  end
end

Element.from_a([1, 2, 3]).reverse.to_a #=> [3, 2, 1]

2.6 Concat and Flatten

I’m including both of these in the same section because the terminology may get confusing. In Haskell, concat flattens a list of lists, where in Ruby, it connects one array to another (same as ary_a + ary_b). Since we’re writing Ruby, I’ll stick with the Ruby definitions.

concat is a method that takes another list and appends its contents to the end of the receiver list. But since we have defined our CONS function in terms of prepending, let’s think of it like this:

: 2 : 3 : [] ----- 4 : 5 : 6 : []
: 2 : [] ----- 3 : 4 : 5 : 6 : []
: [] ----- 2 : 3 : 4 : 5 : 6 : []
[] ----- 1 : 2 : 3 : 4 : 5 : 6 : []

We don’t actually have a way to pull out the second-from-last item in a linked list, so we’ll need to reverse the first list. Then we can successively prepend each item in the first list to the front of the second list (the second list can be used as the init argument to a fold).

class Element
  def concat(other_list)
    reverse.foldl -> (acc, x) { CONS[x, acc] }, other_list
  end
end

That looks suspiciously familiar, as if the folding function were identical to our initial pass at reverse (because it is). Since we reverse the first list to begin with, however, everything works out to the original order in the end. Simplifying:

class Element
  def concat(other_list)
    reverse.foldl FLIP[CONS], other_list
  end
end

Element.from_a([1, 2, 3]).concat(Element.from_a([4, 5, 6])).to_a #=> [1, 2, 3, 4, 5, 6]

flatten can be thought of in similar terms, though instead of CONSing individual elements to a second list, we will concat a list of lists together. Since concat is a ruby method rather than a proc object, we’ll need to turn it into a binary proc before passing it to FLIP:

class Element
  def flatten
    reverse.foldl FLIP[:concat.to_proc], Null.new
  end
end

list_of_lists = Element.from_a([[1,2,3], [4,5,6], [7,8,9]].map {|ary| Element.from_a(ary) })
list_of_lists.flatten.to_a # => [1, 2, 3, 4, 5, 6, 7, 8, 9]

Note: this isn’t quite like Ruby’s flatten, in that it will only flatten 1 level for each call, and will only work on lists where each level is homogenous, meaning valid Ruby arrays like [[1, 2, 3], 4, 5, 6] wouldn’t work correctly.

2.7 Select

Known as filter in Haskell, select returns a sublist of the receiver by only including objects that satisfy some predicate function: (a -> Bool) -> [a] -> [a]. This is fairly simple to define in terms of CONS using a ternary operation:

class Element
  def select(&func)
    foldr -> (x, acc) { func[x] ? CONS[x, acc] : acc }, Null.new
  end
end

Element.from_a([1, 2, 3, 4]).select(&:even?).to_a # => [2, 4]

Or: “If x satisfies the function f, prepend it to the list, otherwise, just return the list.” Again, we can take advantage of Ruby’s built-in to_proc functionality to use pre-defined predicate methods (like :even?) in our functional system.

2.8 `to_a`, redux

Remember way back toward the beginning of this article, I mentioned that to_a was a temporary method? That’s because it’s recursively defined, and we’ve worked hard to eliminate explicit recursion from every function except the folds. Luckily, Ruby’s Array has an instance method that works sort of like CONS in reverse, so as we use foldl to shift values off the front of the linked list, we can push them directly onto the back of a native Ruby array:

: 2 : 3 : [] ----- []
: 3 : [] ----- [1]
: [] ----- [1, 2]
[] ----- [1, 2, 3]

If we convert push to a proc, it’ll work with foldl

class Element
  def to_a
    foldl :push.to_proc, []
  end
end

class Null
  # delete to_a! The recursion terminates in `foldl`.
end

3. The Complete List

Here’s the complete functional linked list implementation. I’ve updated a few methods to make the exercism tests pass (Element::to_a, Element::from_a, and Null#nil?). I’ve also made Null a subclass of Element, so it responds to all of the methods we defined (this necessitates a no-op for initialize).

class Element
  CONS = -> (x, y) { new(x, y) }
  FLIP = -> (f) { -> (x, y) { f[y, x] } }

  class << self
    def to_a(arg)
      Array(arg)
    end

    def from_a(ary)
      head, *tail = Array(ary)
      return Null.new unless head
      new(head, from_a(tail))
    end
  end

  attr_reader :datum, :_next
  alias_method :next, :_next

  def initialize(datum, _next = nil)
    @datum, @_next = datum, (_next || Null.new)
  end

  def foldl(func, init)
    _next.foldl(func, func[init, datum])
  end

  def foldr(func, init)
    func[datum, _next.foldr(func, init)]
  end

  def to_a
    foldl :push.to_proc, []
  end

  def map(&func)
    foldr -> (x,acc) { CONS[func[x], acc] }, Null.new
  end

  def reverse
    foldl FLIP[CONS], Null.new
  end

  def concat(other_list)
    reverse.foldl FLIP[CONS], other_list
  end

  def length
    foldl ->(acc, _) { acc + 1 }, 0
  end

  def flatten
    reverse.foldl FLIP[:concat.to_proc], Null.new
  end

  def select(&func)
    foldr -> (x, acc) { func[x] ? CONS[x, acc] : acc }, Null.new
  end
end

class Null < Element
  def initialize(*);       end
  def nil?;           true end
  def foldl(_, init); init end
  def foldr(_, init); init end
end

4. Next Steps!

As a challenge to the reader, try implementing some additional functions (any?, include?, all?, take, and drop would be good choices).

There’s a lot of material I didn’t cover:

immutability (though none of the code here mutates anything, so it’s immutable in spirit)
type safety (this would be a lot more difficult to deal with in Ruby)
laziness, tail-call optimization and related topics. I may deal with this in a future article. We’ll see.

For now, I hope this has been enough to pique your curiosity about Haskell and FP in general. If you’re interested in learning more, I highly recommend Learn You a Haskell for Great Good to learn the mechanics of the language, and then check out the exercises on exercism.io. There are a few experienced programmers (and one in particular— thanks @etrepum!) who have been immensely helpful to me in learning how to write idiomatic Haskell. And if nothing else, I hope at least this has revealed a bit of wisdom from other programming paradigms. There’s more than one way to skin a monad, as they say (EDIT: no one says that).

Reimplementing Ruby’s Hash Using Binary Search Trees

Apr 28th, 2014

This is Part 2 of Binary Search Trees in Ruby.

Ruby’s native Hash implementation more or less follows the basic Hash table principle, wherein keys are hashed (using Object#hash) and stored in the appropriate ‘buckets’. In this exercise, we will look at an alternate implementation of a hash-like key-value store using binary search trees.

The basic structure of the nodes will be as follows:

We will use the existing hash method to generate integer values from objects. These will be the node values.
The key object itself will be stored in the node as a separate property.
The value object will also be stored in the node.

The binary tree hash will be a simple drop-in replacement for Ruby’s Hash class. It won’t have enumerable methods, but it will have square bracket notation for looking up and storing values, an implementation of fetch, and a default_proc feature.

1. Retrieving a value

We will use our implementation of the basic binary tree as a starting point (but renaming the classes). The first major difference will be that the value of the node will no longer be its address in the tree – instead, we use the hashed key.

module BinaryTree
  class HashNode
    attr_reader :hashed_key, :key, :value
    attr_accessor :left, :right
  end

  def initialize(key, value)
    @value      = value
    @key        = key
    @hashed_key = key.hash
    @left       = EmptyHashNode.new
    @right      = EmptyHashNode.new
  end
end

Ideally, we would like to be able to use this as a drop-in replacement for Hash, so next let’s implement the square-bracket notation used to retrieve a value. Also, why not write tests for this?

require 'minitest/autorun'

describe BinaryTree::HashNode do
  let(:bt_hash){ BinaryTree::HashNode.new(:test, 100) }
  specify { bt_hash[:test].must_equal 100 }
end

module BinaryTree
  class HashNode
    def [](k)
      lookup(k.hash)
    end
  end
end

Done! Almost. lookup will be a protected method very similar to the old include? that we used for the basic binary tree. The square bracket notation will simply be an interface that accepts the raw key, but lookup will work with the hashed key (so we don’t need to run the hashing algorithm for every node we traverse).

module BinaryTree
  class HashNode
    private
      def lookup(hk)
        case hashed_key <=> hk
        when 1 then left.lookup(hk)
        when -1 then right.lookup(hk)
        when 0 then value
        end
      end
  end
end

…and the test passes. Let’s try looking up a key that hasn’t been set:

specify { bt_hash[:missing].must_be_nil }

In order for this test to pass, we will need to traverse the entire tree until we end up at an empty node. That node should respond to the lookup method with nil:

module BinaryTree
  class EmptyHashNode
    def lookup(*)
      nil
    end
  end
end

2. Storing a value

So now let’s try inserting a key-value pair using the []= notation:

specify "inserting a new value" do
  bt_hash[:hello] = 200
  bt_hash[:hello].must_equal 200
end

specify "overwriting an existing value" do
  bt_hash[:test] = 101
  bt_hash[:test].must_equal 101
end

Similar to the [] method, we’ll implement []= as an interface to the protected method store, which will accept the hashed key and raw value:

module BinaryTree
  class HashNode
    def []=(k, v)
      store(k.hash, v)
    end
  end
end

There are two possibilities here: inserting a new value and overwriting an existing value. Both covered by our two new tests. We can handle both of these in one store method, but we need to change value from attr_reader to attr_accessor so we can update its contents:

module BinaryTree
  class HashNode
    attr_accessor :left, :right, :value

    protected

      def store(hk, v)
        case hashed_key <=> hk
        when 1 then store_left(hk, v)
        when -1 then store_right(hk, v)
        when 0 then self.value = v
        end
      end

    #...

But now we have a problem – we’ve been passing the hashed key down the tree looking for the correct place to store it, but when we get there, we don’t have the original key to pass to the constructor of HashNode. In fact, we don’t even really need it, because all lookups and stores operate on the hashed value anyway, but if we want to be able to inspect the tree, it would be nice to see the keys we’ve assumed we were using. It’s not the nicest looking solution, but since we’re operating with protected methods, I’m not very concerned about simply tacking another argument onto the end of store, store_left, and store_right:

# ...
  def []=(k, v)
    store(k.hash, v, k)
  end

protected

  def store(hk, v, k)
    case hashed_key <=> hk
    when 1 then store_left(hk, v, k)
    when -1 then store_right(hk, v, k)
    when 0 then self.value = v
    end
  end

private

  def store_left(hk, v, k)
    left.store(hk, v, k) or self.left = HashNode.new(k, v)
  end

  def store_right(hk, v, k)
    right.store(hk, v, k) or self.right = HashNode.new(k, v)
  end

# ...

3. Fetch

Ok! We’re actually almost done. Let’s add a whole slew of tests and see how well this conforms to Ruby’s native Hash implementation:

specify "storing arbitrary objects as keys" do
  obj = Object.new
  bt_hash[obj] = 1001
  bt_hash[obj].must_equal 1001
end

specify "nesting hashes" do
  other_hash = BinaryTree::HashNode.new(:world, 102)
  bt_hash[:hello] = other_hash
  bt_hash[:hello][:world].must_equal 102
end

Both of these tests pass without any further modifications to the HashNode. However, these:

specify { bt_hash.fetch(:test).must_equal 100 }
specify { bt_hash.fetch(:missing, 101).must_equal 101 }
specify { ->{ bt_hash.fetch(:missing) }.must_raise KeyError }
specify { bt_hash.fetch(:missing) { 101 }.must_equal 101 }

…use the powerful fetch method, which we have not implemented. fetch provides several ways of handling missing values in a hash: specifying a default missing value, returning a value from a block, or raising a KeyError. On its face, fetch is a modified version of the lookup method, implemented as a series of guard clauses:

#...
def fetch(k, default = nil, &block)
  v = lookup(k)
  return v if v
  return default if default
  return block.call if block_given?
  raise KeyError
end
#...

… and the tests pass.

4. `default_proc`

The last hash feature I use regularly is called the default_proc, which stores a block that is called when a given key is not present in the hash table. It is used like this:

hash = Hash.new {|hash, key| hash[key] = [] }
hash[0] << 1 << 2 << 3 #=> { 0 => [1, 2, 3] }

Since our binary tree hash is a distributed recursive data structure, at first glance this presents a special challenge — each node will need to store the default proc and pass it on to new node instances as they are created. But that’s not actually true. The recursive nature of the structure means that lookups bubble up through the tree and are returned through the root node’s lookup function, meaning the proc only has to be stored at the root.

Let’s start with some tests. Our hash will automatically fire up some arrays for us when a new key is given:

let(:defaulting_hash){ BinaryTree::HashNode.new(:test, []){|hash, key| hash[key] = []} }
specify { defaulting_hash[:empty].must_equal [] }

specify "inserting values" do
  defaulting_hash[:my_array] << 1 << 2 << 3
  defaulting_hash[:my_array].must_equal [1, 2, 3]
end

In Ruby’s hash implementation, the default proc is only called for [] lookups — fetch still works the same way regardless of the presence of the constructor block. In addition to altering [], We need to modify initialize and add a new accessor for default_proc:

module BinaryTree
  class HashNode
    attr_reader :default_proc

    def initialize(key, value, &default_proc)
      @value = value
      @key = key
      @hashed_key = key.hash
      @left = EmptyHashNode.new
      @right = EmptyHashNode.new
      @default_proc = default_proc
    end

    def [](k)
      v = lookup(k.hash)
      return v if v
      default_proc.call(self, k) if default_proc
    end
  end
end

The block passed to Hash#new uses the hash and a key as formal arguments, so we can simply call the proc using the current node and the given k. The recursive structure of the binary tree will take care of the rest:

my_hash = BinaryTree::HashNode.new(:test, 100){|hash, key| hash[key] = 100 }
my_hash[:my_array] << 1 << 2 << 3
# => {test => []:{my_array => [1, 2, 3]:{}|{}}|{}}

Appendix

Here is the complete source code for the binary tree hash:

module BinaryTree class EmptyHashNode def to_a [] end def inspect "{}" end def lookup(*) nil end def store(*) false end end class HashNode # our three features: attr_reader :hashed_key, :key, :default_proc attr_accessor :left, :right, :value def initialize(key, value, &block) @value = value @key = key @hashed_key = key.hash @left = EmptyHashNode.new @right = EmptyHashNode.new @default_proc = block end def [](k) v = lookup(k.hash) return v if v default_proc.call(self, k) if default_proc end def fetch(k, default = nil, &block) v = lookup(k.hash) return v if v return default if default return block.call if block_given? raise KeyError end def []=(k, v) store(k.hash, v, k) end def inspect "{#{key} => #{value}:#{left.inspect}|#{right.inspect}}" end protected def lookup(hk) case hashed_key <=> hk when 1 then left.lookup(hk) when -1 then right.lookup(hk) when 0 then value end end def store(hk, v, k) case hashed_key <=> hk when 1 then store_left(hk, v, k) when -1 then store_right(hk, v, k) when 0 then self.value = v end end private def store_left(hk, v, k) left.store(hk, v, k) or self.left = HashNode.new(k, v) end def store_right(hk, v, k) right.store(hk, v, k) or self.right = HashNode.new(k, v) end def left=(v) @left = v end def right=(v) @right = v end end end

And the tests:

require 'minitest/autorun' require 'minitest/pride' describe BinaryTree::HashNode do let(:bt_hash){ BinaryTree::HashNode.new(:test, 100) } specify { bt_hash[:test].must_equal 100 } specify { bt_hash[:missing].must_be_nil } specify "inserting a new value" do bt_hash[:hello] = 200 bt_hash[:hello].must_equal 200 end specify "overwriting an existing value" do bt_hash[:test] = 101 bt_hash[:test].must_equal 101 end specify "storing arbitrary objects as keys" do obj = Object.new bt_hash[obj] = 1001 bt_hash[obj].must_equal 1001 end specify "nesting hashes" do other_hash = BinaryTree::HashNode.new(:world, 102) bt_hash[:hello] = other_hash bt_hash[:hello][:world].must_equal 102 end specify { bt_hash.fetch(:test).must_equal 100 } specify { bt_hash.fetch(:missing, 101).must_equal 101 } specify { ->{ bt_hash.fetch(:missing) }.must_raise KeyError } specify { bt_hash.fetch(:missing) { 101 }.must_equal 101 } let(:defaulting_hash){ BinaryTree::HashNode.new(:test, []){|hash, key| hash[key] = []} } specify { defaulting_hash[:empty].must_equal [] } specify "inserting values" do defaulting_hash[:my_array] << 1 << 2 << 3 defaulting_hash[:my_array].must_equal [1, 2, 3] end end

Binary Search Trees in Ruby

Apr 25th, 2014

In learning about Haskell recently, I was introduced to a recursive data structure I had never used before (at least not knowingly) – the binary search tree (BST). The BST is made up of nodes that have three main features:

They contain a value
They can refer to another node to the left with a smaller value
They can refer to another node to the right with a larger value

So let’s throw a simple problem at this: determining whether a given integer is a member of an array.

Can we construct a binary search tree in Ruby that is faster than Ruby’s native C-implemented Array class? (Spoilers: yes.)

1. Implementation

My first implementation of this had an overarching Tree class, but it soon became clear that it’s entirely unccessary. As a recursive data structure, each node has a tree descending from it, so we just need a Node. The tree will be implied.

module BinaryTree
  class Node
    # our three features:
    attr_reader :value
    attr_accessor :left, :right

    def initialize(v)
      @value = v
    end
  end
end

tree       = BinaryTree::Node.new(10)
tree.left  = BinaryTree::Node.new(5)
tree.right = BinaryTree::Node.new(15)
puts tree.inspect
#<BinaryTree::Node:0x007f9ce207a770 @value=10, 
# @left=#<BinaryTree::Node:0x007f9ce207a748 @value=5>, 
# @right=#<BinaryTree::Node:0x007f9ce207a720 @value=15>>

Great! But also: awful. Instead of constructing the tree manually, we need to be able to treat it as if it were an array. We should be able to apply the shovel operator to the base node of the tree and have the tree place the value wherever it should rightfully or leftfully go.

module BinaryTree
  class Node
    def insert(v)
      case value <=> v
      when 1 then left.insert(v)
      when -1 then right.insert(v)
      when 0 then false # the value is already present
      end
    end
  end
end

This uses Ruby’s spaceshipesque comparator <=> to determine if the value to be inserted is greater than, less than, or equal to the value of the current node, and then traverses the tree until …

tree.insert(3)
# => binary_tree.rb:13:in `insert': undefined method `insert' for nil:NilClass (NoMethodError)

Spectacular! Except: not. We have a node that expected its left value to respond to insert, to which nil annoyingly refused. We can redefine more specific insert methods to work around this issue (and, since it’s getting a bit hard to read, a new inspect method):

module BinaryTree
  class Node
    def insert(v)
      case value <=> v
      when 1 then insert_left(v)
      when -1 then insert_right(v)
      when 0 then false # the value is already present
      end
    end

    def inspect
      "{#{value}::#{left.inspect}|#{right.inspect}}"
    end

    private

      def insert_left(v)
        if left
          left.insert(v)
        else
          self.left = Node.new(v)
        end
      end

      def insert_right(v)
        if right
          right.insert(v)
        else
          self.right = Node.new(v)
        end
      end
  end
end

tree.insert(3)
# => {10:{5:{3:nil|nil}|nil}|{15:nil|nil}}

The next step is to determine whether our tree contains a given value. This is where the Binary Search Tree has a reputation for speediness – unlike iterating over every element of an array and checking for equality, the structure of the tree provides a sort of automatic index that points us to where the value should be, and then we can check if it’s there. It’ll look remarkably similar to our insert method:

module BinaryTree
  class Node

    # named include? to parallel Array#include?
    def include?(v)
      case value <=> v
      when 1 then left.include?(v)
      when -1 then right.include?(v)
      when 0 then true # the current node is equal to the value
      end
    end
  end
end

If you were paying attention to the insert method, you can probably guess that when this method reaches a left or right that is nil, it will fail. Which is really annoying! But since this seems to be a pattern we have stumbled upon, let’s find a better way to solve this rather than peppering the code with nil checks. Enter our second class, EmptyNode:

module BinaryTree
  class EmptyNode
    def include?(*)
      false
    end

    def insert(*)
      false
    end

    def inspect
      "{}"
    end
  end
end

…and make sure instances of this class terminate our tree by default:

module BinaryTree
  class EmptyNode
    def initialize(v)
      @value = v
      @left  = EmptyNode.new
      @right = EmptyNode.new
    end
  end
end

happy = BinaryTree.new(10).left #=> 

Note: the (*) formal argument to these EmptyNode’s methods simply states that we don’t care how many arguments are passed to the method, and that we won’t be using them anyway.

The EmptyNode class is useful in that it provides a meaningful end to the recursive structure — specifically, that a given range of values in the tree are definitively not present. Otherwise, it does very little. We don’t allow insert to do anything with it, because then it wouldn’t be an empty node. Unfortunately, we can’t simply tell it to replace itself with a Node object (as that isn’t possible in Ruby), so we have to change the reference back at the parent node:

module BinaryTree
  class Node
    private
      def insert_left(v)
        left.insert(v) or self.left = Node.new(v)
      end

      def insert_right(v)
        right.insert(v) or self.right = Node.new(v)
      end

  end
end

Here, we use the or control flow operator to perform one of two actions: if the first returns a falsey value, the second (assign the new Node object).

Ok! So we now have a binary tree that can insert new values at the correct location and tell you whether or not it contains a given value. Let’s check it out:

tree = BinaryTree::Node.new(10)               #=> {10:{}|{}}
[5, 15, 3].each {|value| tree.insert(value) } #=> {10:{5:{3:{}|{}}|{}}|{15:{}|{}}}
puts tree.include?(10) #=> true
puts tree.include?(15) #=> true
puts tree.include?(20) #=> false
puts tree.include?(3)  #=> true
puts tree.include?(2)  #=> false

2. Benchmarks

Let’s benchmark it! This test populates an array with 5000 random values up to 50,000, that checks every value between 1 and 50,000 to see if the array includes it. The same benchmark is repeated for the binary tree containing an identical set of values.

require 'benchmark'

test_array = []
5000.times { test_array << (rand 50000) }

tree = BinaryTree::Node.new(test_array.first)
test_array.each {|value| tree.insert(value) }

Benchmark.bm do |benchmark|
  benchmark.report("test_array include"){ (1..50000).each {|n| test_array.include? n } }
  benchmark.report("binary tree search"){ (1..50000).each {|n| tree.include? n } }
end

                        user     system      total        real
test_array include 13.230000   0.020000  13.250000 ( 13.283172)
binary tree search  0.140000   0.000000   0.140000 (  0.139983)

I have to say, I was a little surprised how much faster (~100x) this was. It makes sense when you think about the fact that to check if an element is included in the Array, Ruby needs to run an equality comparison for up to 5000 values 50000 times. That’s a lot of overhead, and Arrays simply aren’t optimized for this. Ruby has another built-in data structure that is explicitly designed for fast lookups of arbitrary values — the venerable Hash. Similar to a binary search tree, Ruby’s hash tables follow a defined set of rules that guide it to the proper places in memory when setting and retrieving values. For an in-depth exploration of what makes Hashes fast, read Pat Shaughnessy’s Ruby Under a Microscope.

Let’s rerun the benchmark again, but this time comparing hash lookups as well. For these purposes, it doesn’t matter what the values are in the hash, so we’ll just make them all true:

test_hash = Hash[test_array.map {|x| [x, true] }]

Benchmark.bm do |benchmark|
  benchmark.report("test_array include"){ (1..50000).each {|n| test_array.include? n } }
  benchmark.report("binary tree search"){ (1..50000).each {|n| tree.include? n } }
  benchmark.report("test_hash lookup"  ){ (1..50000).each {|n| hash.has_key? n } }
end

                         user     system      total        real
 test_array include 13.400000   0.020000  13.420000 ( 13.473588)
 binary tree search  0.130000   0.000000   0.130000 (  0.129013)
 test_hash lookup    0.000000   0.000000   0.000000 (  0.008119)

Ruby’s native C-implemented Hash is around 15 times faster than the Ruby-implemented binary search tree, which is about what I expected.

3. Array Conversions

In order to convert arrays into binary trees and back again, let’s introduce a few new methods. The first will be a module method:

module BinaryTree
  def self.from_array(array)
    Node.new(array.first).tap do |tree|
      array.each {|v| tree.insert v }
    end
  end
end

from_array simply assigns the root node of the tree as the first value of the array, then pushes all array values on in order. Converting back to an array is a simple matter of traversing the recursive tree. An interesting side effect is that if done in a particular way, this is equivalent to calling .uniq.sort on the original array (as far as I know, it’s impossible to maintain the original order).

module BinaryTree
  class Node
    def to_a
      left.to_a + [value] + right.to_a
    end
  end

  class EmptyNode
    # unsurprisingly, an empty node returns an empty array
    def to_a
      []
    end
  end
end

In case it’s not clear how the recursion works, here’s what the array expansion looks like for a simple tree {10:{5:{}|{}}|{15:{}|{}}}:

For both 5 and 15, left.to_a and right.to_a are [] (EmptyNode#to_a), so the results are [5] and [15] respectively
For 10, left.to_a is [5] and right.to_a is [15], giving [5] + [10] + [15] or [5, 10, 15]

We can test it on an example with more elements:

array = [51, 88, 62, 68, 98, 93, 51, 67, 91, 4, 34]
tree = Binary.from_array(array)
# => {51:{4:{}|{34:{}|{}}}|{88:{62:{}|{68:{67:{}|{}}|{}}}|{98:{93:{91:{}|{}}|{}}|{}}}}
tree.to_a #=> [4, 34, 51, 62, 67, 68, 88, 91, 93, 98]

Interestingly, it’s faster to convert a large array into a binary tree and perform a search than it is to call include? on the Array.

array = 5000.times.map { rand 50000 }

Benchmark.bm do |benchmark|
  benchmark.report("array#include?") { (1..50000).each {|v| array.include?(v) }}
  benchmark.report("binary search") do
    tree = BinaryTree.from_array(array)
    (1..50000).each {|v| tree.include?(v) }
  end
end

                    user     system      total        real
array#include? 13.160000   0.020000  13.180000 ( 13.235368)
binary search   0.190000   0.000000   0.190000 (  0.188989)

It takes about 50% longer than just the binary tree search itself, which makes sense because it traverses the tree twice (once to insert values and once to query them). It doesn’t take twice as long, because we start with a small tree (a single node) and build it up gradually as the values are inserted.

4. Why would I use this?

Because of nerdliness?

Honestly cannot think of an instance where this would have been useful to me in a Ruby project, including those where I’m juggling querying enormous quantities of data.

As much as we are loathe to admit it, there are other languages that behave differently from Ruby and are used for different things. Here’s a StackOverflow question describing some of them.

Click here to read part 2: Reimplementing Ruby’s Hash using binary search trees!

Appendix

Here is the code in its entirety:

# https://zvkemp.github.io/blog/2014/04/25/binary-search-trees-in-ruby/ module BinaryTree class EmptyNode def to_a [] end def include?(*) false end def push(*) false end alias_method :<<, :push def inspect "{}" end end class Node # our three features: attr_reader :value attr_accessor :left, :right def initialize(v) @value = v @left = EmptyNode.new @right = EmptyNode.new end def push(v) case value <=> v when 1 then push_left(v) when -1 then push_right(v) when 0 then false # the value is already present end end alias_method :<<, :push def include?(v) case value <=> v when 1 then left.include?(v) when -1 then right.include?(v) when 0 then true # the current node is equal to the value end end def inspect "{#{value}:#{left.inspect}|#{right.inspect}}" end def to_a left.to_a + [value] + right.to_a end private def push_left(v) left.push(v) or self.left = Node.new(v) end def push_right(v) right.push(v) or self.right = Node.new(v) end end end

Blog Archives

Who this is for:

1. What’s a linked list?

2. Functions!

2.1 CONS

2.2 Folds!

2.3 Length

2.4 Map

2.5 Reverse

2.6 Concat and Flatten

2.7 Select

2.8 to_a, redux

3. The Complete List

4. Next Steps!

1. Retrieving a value

2. Storing a value

3. Fetch

4. default_proc

Appendix

1. Implementation

2. Benchmarks

3. Array Conversions

4. Why would I use this?

Appendix

2.1 `CONS`

2.8 `to_a`, redux

4. `default_proc`