Almost all the functions are pure–the only impure functions we use read from or ...

Almost all the functions are pure–the only impure functions we use read from or write to a data store. We use Apache Spark, which lets you write pure functions that can operate on data too large to handle on a single box, and it overall works quite well. Eg when designing started this project we wrote something like:

g = f5 . f4 . f3 . f2 . f1

where f1 reads from s3 and f5 writes to a db. Then the implementation work mostly involved breaking these down further, e.g. f1 = h4 . h3 . h2 . h1, where only h1 is stateful and everything else is pure.

Spark is lazily evaluated, and in practice it will stream through many operations–f1 will generally not be done consuming the input by the time f2 starts, although sometimes we force it to for debugging purposes.

Lazy evaluation and the discarding of side effects make logging difficult, which is one of the downsides. There are various monitoring and debugging tools that help but it's still definitely harder than the single machine case.