Wednesday, July 22, 2015

How to implement a spreadsheet

My friend Lindsey Kuper recently remarked on Twitter that spreadsheets were commonly understood to be the most widely used dataflow programming model, and asked if there was a simple implementation of them.

As chance would have it, this was one of the subjects of my thesis work -- as part of it, I wrote and proved the correctness of a small dataflow programming library. This program has always been one of my favorite little higher-order imperative programs, and in this post I'll walk through the implementation. (You can find the code here.)

As for the proof, you can look at this TLDI paper for some idea of the complexities involved. These days it could all be done more simply, but the pressure of proving everything correct did have a very salutary effect in keeping everything as simple as possible.

The basic idea of a spreadsheet (or other dataflow engine) is that you have a collection of places called cells, each of which contains an expression. An expression is basically a small program, which has the special ability to ask other cells what their value is. The reason cells are interesting is because they do memoization: if you ask a cell for its value twice, it will only evaluate its expression the first time. Furthermore, it's also possible for the user to modify the expression a cell contains (though we don't want cells to modify their code as they execute).

So let's turn this into code. I'll use Ocaml, because ML modules make describing the interface particularly pretty, but it should all translate into Scala or Haskell easily enough. In particular, we'll start by giving a module signature writing down the interace.

 module type CELL = sig

We start by declaring two abstract types, the type 'a cell of cells containing a value of type 'a, and the type 'a exp of expressions returning a value of type 'a.

   type 'a cell
   type 'a exp

Now, the trick we are using in implementing expressions is that we treat them as a monadic type. By re-using our host language as the language of terms that lives inside of a cell, we don't have to implement parsers or interpreters or anything like that. This is a familiar trick to Haskell programmers, but it's still a good trick! So we first give the monadic bind and return operators:

 
   val return : 'a -> 'a exp
   val (>>=) : 'a exp -> ('a -> 'b exp) -> 'b exp

And then we can specify the two operations that are unique to our monadic DSL: reading a cell (which we call get), and creating a new cell (which we call cell). It's a bit unusual to be able to create new cells as a program executes, but it's rather handy.

   val cell : 'a exp -> 'a cell exp 
   val get :  'a cell -> 'a exp

Aside from that, there are no other operations in the monadic expression DSL. Now we can give the operations that don't live in the monad. First is the update operation, which modifies the contents of a cell. This should not be called from within an 'a exp terms --- in Haskell, that might be enforced by giving update an IO type.

 
   val set : 'a cell -> 'a exp -> unit 

Finally, there's the run operation, which we use to run an expression. This is useful mainly for looking at the values of cells from the outside.

   val run : 'a exp -> 'a 
 end

Now, we can move on to the implementation.

 
 module Cell : CELL = struct
The implementation of cells is at the heart of the dataflow engine, and is worth discussing in detail. A cell is a record with five fields:
   type 'a cell = {
     mutable code      : 'a exp;
     mutable value     : 'a option;
     mutable reads     : ecell list;
     mutable observers : ecell list;
     id                : int
   }
  • The code field of this record is the pointer to the expression that the cell contains. This field is mutable because we can alter the contents of a cell!
  • The value field is an option type, which is None if the cell has not been evaluated yet, and Some v if the code had evaluated to v.
  • The reads field is a list containing all of the cells that were read when the code in the code field was executed. If the cell hasn't been evaluated yet, then this is the empty list.
  • The observers field is a list containing all of the cells that have read this cell when they were evaluated. So the reads field lists all the cells this cell depends on, and the observers field lists all the cells which depend on this cell. If this cell hasn't been evaluated yet, then observers will of course be empty.
  • The id contains an integer which is the unique id of each cell.

Both reads and observers store lists of dependent cells, and dependent cells can be of any type. In order to build a list of heterogenous cells, we need to introduce a type ecell, which just hides the cell type under an existential (using Ocaml's new GADT syntax):

   and ecell = Pack : 'a cell -> ecell

We can now also give the concrete type of expressions. We define an element of expression type 'a exp to be a thunk, which when forced returns (a) a value of type 'a, and (b) the list of cells that it read while evaluating:

   and 'a exp = unit -> ('a * ecell list)
 

Next, let's define a couple of helper functions. The id function just returns the id of an existentially packed ecell, and the union function merges two lists of ecells while removing duplicates.

   let id (Pack c) = c.id

   let rec union xs ys =
     match xs with
     | [] -> []
     | x :: xs' ->
       if List.exists (fun y -> id x = id y) ys then
         union xs' ys
       else
         x :: (union xs' ys)

The return function just produces a thunk which returns a value and an empty list of read dependencies, and the monadic bind (>>=) sequentially composes two computations, and returns the union of their read dependencies.

   let return v () = (v, [])
   let (>>=) cmd f () =
     let (a, cs) = cmd () in
     let (b, ds) = f a () in
     (b, union cs ds)

To implement the cell operator, we need a source of fresh id's. So we create an integer reference, and new_id bumps the counter before returning a fresh id.

   let r = ref 0
   let new_id () = incr r; !r 

Now we can implement cell. This function takes an expression exp, and uses new_id to create a unique id for a cell, and then intializes a cell with the appropriate values -- the code field is exp, the value field is None (because the cell is created in an unevaluated state), and the reads and observers fields are empty (because the cell is unevaluated), and the id is set to the value we generated.

This is returned with an empty list of read dependencies because we didn't read anything to construct a fresh cell!

   let cell exp () =
      let n = new_id() in
    let cell = {
      code = exp;
      value = None;
      reads = [];
      observers = [];
      id = n;
     } in
    (cell, [])

To read a cell, we need to implement the get operation. This works a bit like memoization. First, we check to see if the value field already has a value. If it does, then we can return that. If it is None, then we have a bit more work to do.

First, we have to evaluate the expression in the code field, which returns a value v and a list of read dependencies ds. We can update the value field to Some v, and then set the reads field to ds. Then, we add this cell to the observers field of every read dependency in ds, because this cell is observing them now.

Finally, we return the value v as well as a list containing the current cell (which is the only dependency of reading the cell).

  let get c () =
    match c.value with
    | Some v -> (v, [Pack c])
    | None ->
      let (v, ds) = c.code ()  in
      c.value <-Some v;
      c.reads <- ds;
      List.iter (fun (Pack d) -> d.observers <- (Pack c) :: d.observers) ds;
      (v, [Pack c])

This concludes the implementation of the monadic expression language, but our API also includes an operation to modify the code in a cell. This requires more code than just updating a field -- we have to invalidate everything which depends on the cell, too. So we need some helper functions to do that.

The first helper is remove_observer o o'. This removes the cell o from the observers field of o'. It does this by comparing the id field (which was in fact put in for this very purpose).

  let remove_observer o (Pack c) =
    c.observers <- List.filter (fun o' -> id o != id o') c.observers

This function is used to implement invalidate, which takes a cell, marks it as invalid, and then marks everything which transitively depends on it invalid too. It does this by we saving the reads and observers fields into the variables rs and os. Then, it marks the current cell as invalid by setting the value field to None, and setting the observers and reads fields to the empty list. Then, it removes the current cell from the observers list of every cell in the old read set rs, and then it calls invalidate recursively on every observer in os.

  let rec invalidate (Pack c) =
    let os = c.observers in
    let rs = c.reads in 
    c.observers <- [];
    c.value <- None;
    c.reads <- [];
    List.iter (remove_observer (Pack c)) rs;
    List.iter invalidate os

This then makes it easy to implement set -- we just update the code, and then invalidate the cell (since the memoized value is no longer valid).

    
  let set c exp =
    c.code <- exp;
    invalidate (Pack c)

Finally, we can implement the run function by forcing the thunk and throwing away the read dependencies.

  let run cmd = fst (cmd ())
end

That's pretty much it. I think it's quite pleasant to see how little code it takes to implement such an engine.

One thing I like about this program is that it also shows off how gc-unfriendly dataflow is: we track dependencies in both directions, and as a result the whole graph is always reachable. As a result, the usual gc heuristis will collect nothing as long as anything is reachable. You can fix the problem by using weak references to the observers, but weak references are also horribly gc-unfriendly (usually there's a traversal of every weak reference on every collection).

So I think it's very interesting that there are a natural class of programs for which the reachability heuristic just doesn't work, and this indicates that some interesting semantics remains to be done to explain what the correct memory management strategy for these kinds of programs is.

4 comments:

  1. You should have a look at spreadsheet-implementation-technology(https://mitpress.mit.edu/books/spreadsheet-implementation-technology) - a scientific book with an actual implementation in C#. Quite interesting read.

    ReplyDelete
  2. Thanks to Neel for the post and the tiny elegant evaluator, and to weismat for mentioning my book :-)

    Spreadsheets admit function arguments with range references such as SUM(A1:B10000), which means that N cells may have O(N^2) dependencies. For this reason one should not represent the dependencies explicitly as references. The above-mentioned book shows a couple of ways to represent them compactly and symbolically in a way that should be effective for typical spreadsheets. This representation works well with current garbage collectors.

    ReplyDelete
  3. Have you seen Incremental : https://blogs.janestreet.com/introducing-incremental/

    ReplyDelete
  4. There's a bug in the list-as-set union code. The base case should be

    [] -> ys

    ReplyDelete