Over the last decade we’ve experienced the rise and rise of open-source software for big data and big scale. That was indeed a big leap forward, yet I’ve become disillusioned with much of the surrounding hype. Rants aside, here’s where I think the underlying issues are, and most importantly: what we can do about it. It’s my personal take, so perhaps I should open with my own story.
Each of us have our own path in software development, from adolescence to adulthood. I can trace my own first real job to around 1998.
Back then, real servers were still made by companies like Sun, running Solaris on expensive proprietary metal. I recall catching a glimpse of a purchase order for a few Sun Ultra workstations circa 2000. One line item in particular caught my eyes, as I could directly compare it to my PC at home: a cool $150 for a CD drive, worth about $230 today. It’s very probable that this drive came in Sun’s iconic purple, thus exuding its commanding authority over the pedestrian CD drive in your home, but still — it didn’t feel right. Savvy young developers started pushing for Linux. People rushed to join the first wave of the Internet bubble.
Fast forward to 2013, when I joined a tiny startup called Dynamic Yield and started working on “web scale” problems in earnest.
Till then, scale for me usually meant tuning a beastly Java application server doing maybe, maybe 100 requests per second, each request typically hitting the relational DB a dozen times within a transaction. Each single request used to mean something important that someone cared about. Now, data became a clickstream from across the web: hardly transaction-worthy, but coming in at thousands of requests per second, and growing fast.
The founding team already had their first database (a single MySQL instance) brought to its knees on that traffic, and moved on to Apache HBase for anything non-administrative. Before a year went by, I introduced Redis for fast access to user data and Elasticsearch for ad-hoc analytics to our growing spiderweb of an architecture. We were deep in NoSQL territory now. For batch jobs, custom Java processes and clunky Hadoop MapReduce jobs started making way to the new shiny tech of the day: Apache Spark. Kafka and stream processing frameworks would come next.
Much of that new wealth of tools for scale was directly inspired by Google’s whitepapers. These marked a big shift in how scale and resilience were tackled, or indeed thought about:
First it was about GFS (the Google File System) and MapReduce, then BigTable. It pushed the wider tech community to think in terms of commodity hardware and inexpensive hard disks rather than purple hardware and RAID, relying on multiple replicas for high availability and distributing work. You realized hardware will fail and keep failing, and what you needed were tools that will happily skip over such road bumps with minimal fuss and grow in capacity as fast as you can throw new machines at ’em.
Not having to pay a hefty per-server license fee to Larry (or his colleagues) was a big plus. The concept of GFS inspired Apache HDFS, Google MapReduce’s principles were recast as Apache Hadoop M/R, and BigTable was remade as Apache HBase. Later on, Google’s Dremel(which you probably know today in its SaaS incarnation — BigQuery) inspired the Parquet file format and a new generation of distributed query engines.
To me, these were exciting times! being able to scale so much with open source, provisioning new servers in hours or minutes…
The promises of elasticity, of cost efficiency and high availability were to a large extent realized — especially if you’ve been used to waiting months for the servers, for IT priests to install a pricey NAS, for procurement of more licenses of some commercial middleware.
I guess this goes against the grain of nostalgia permeating so much of the discourse. Usually it goes along the lines of ‘we did so much in the old days with so little memory and no-nonsense, hand-polished code! oh, these kids and their multi-megabyte JS bundles will never know it…`. I like to compare that to parenting: there is always someone who’s achieved so much more with a lot less and is just eager to tell you all about it; it’s never ending.
The Burden of Complexity
You only get a look at the real, gnarly underbelly of a cluster at the worst of times, it seems: on peak traffic periods, at nights, on holidays.
There is a complexity to a cluster’s state management that just does not go away, instead waiting to explode at some point. Here’s how it usually happens, to me at least: whether you’re managing a random-access database cluster, a batch processing or a stream processing cluster, you get it to work initially and for a while it seems to work. You don’t really need to know how it elects a master, how it allocates and deallocates shards, how precious metadata is kept in sync, or what ZooKeeper actually does in there.
At some point, you hit some unexpected threshold that’s really hard to predict — because it’s related to your usage patterns, your specific load spikes, and things go awry — sometimes the system gets completely stuck, sometimes limping along and crying out loud in the log files. Often, it’s not a question of data size or volume as you might expect, but a more obscure limit being crossed. Your Elasticsearch cluster might be fine with 2,382 indices, but one day you get to 2,407 and nodes start breaking, pulling the rest with them down to misery lane. I just made up these numbers — it’s gonna be different metrics and different thresholds for you.
In the best case scenario, you solve the issue at hand on the day, but often the same issue will repeat — just give it time. Sometimes it takes weeks to get things back to stable, or months to hunt down a recurring issue. Over time, it gets to be a big time drain. Sometimes, multiple and seemingly unrelated incidents happen to all blow up over a short period of time, and the team gets fatigued from fighting the fires. Eventually you’ve put enough work hours and thrown enough extra resources at the problem (“let’s provision a few more servers, give it more headroom”) and it goes away, but you know it will return. As head of R&D this can get especially tiring, because whether the fire is at team A, B, or C — it’s ultimately your fire as well, every time. Better adopt some personal philosophy to make it sustainable for you.
Sometimes people will respond to your troubles with how great and reliable their tooling is/was at company X: “we’re using this awesome database Y, and never had any significant issue!”, or: “you’re over-complicating, just use Z like us and be happy.” Often people are simply ignorant of the complexity and nuance of your workload, or just acting a bit douche. However, I’ve come to think about this as mostly a matter of time: systems can work fine for years till something happens. During that time you get bragging rights for how well things work, but adopt more tools, give it more time and feed the system with growth — and you will get your firefighting alright.
Operational complexity is thus an enduring burden, but I should also mention limited elasticity and its friend, high cost. While tools do get better in that respect, what may have been considered “elastic” in 2012 is simply not what I’d call elastic today.
Let’s take Apache Spark for example. There’s still a steady stream of people going into Big Data who think it will solve all their problems, but if you’ve spent significant time with it, you’d know that for your jobs to work well you need to carefully adjust the amount of RAM vs. CPU for each job, to dive into config settings, perhaps tinker a bit with Garbage Collection. You’d need to analyze the “naive” DAG it generated for your code to find choke points, then modify the code so it’s a bit less about the cleanest functional abstraction and more about actual efficiency. In our case, we also needed to override some classes. But such is life: care-free big data processing by high-level abstractions never lived up to the hype.
The challenge goes further, though.
One idea that Hadoop pushed for since its inception was that you could build a nice big cluster of commodity servers, and then throw a whole bunch of jobs from different teams at different times at its resource manager (nowadays known as YARN), letting it figure out which resources each job needs and how to fit all these jobs nicely. The concept was to think about the capacity of the cluster as an aggregate of its resources: this much CPU, storage, memory. You would scale that out as necessary, while R&D teams kept churning out new jobs to submit. That idea pretty much made its way into Spark as well.
The problem is one single cluster really doesn’t like juggling multiple jobs with very different needs in terms of compute, memory and storage. What you get instead is a constant battle for resources, delays and occasional failures. Your cluster will typically be in one of two modes: either it’s currently over-provisioned to have room for extra work (read: idle resources that you pay for), or it’s under-provisioned and you wish it was easy or fast to scale depending on how well jobs progress right at this moment. I’m not saying it’s not possible, but it’s definitely not easy, quick or out-of-the-box.
If you’re used to scaling, say, web servers via a single metric you’ll be surprised at this. Web servers usually “have one job” and you can measure their maximum capacity as X requests per second. They rely on databases and backend servers, but they don’t need to know each other, let alone pass huge chunks of data between them over the network or thrash the disks. Data processing clusters do all these different things, concurrently, and never exactly in the same timing.
If you run into such problems, you’ll probably start managing multiple clusters, each configured to its needs and with its own headroom. However, provisioning these multiple clusters takes precious time and operating them is still a hassle. Even a single job in isolation typically has multiple stages which stress different resource types (CPU, memory, disk, network), which makes optimal resource planning a challenge even with multiple dedicated clusters.
You could get some of that work off your shoulders if you go for a managed service, but the “management tax” can easily reach ludicrous amounts of money at scale. If you’re already paying, say, $1 million a year for self-managed resources on the cloud, would you pay 1.5x-2x that to get a managed-to-an-extent service?
New technologies (e.g. k8s operators) do provide us with faster provisioning and better resource utilization. Here’s one thing they cannot solve: spending precious engineering time to thoroughly profile and tune a mis-behaving component is always an issue — which makes it very tempting to throw ever more compute resources at the problem. Over time, these inefficiencies accumulate. As the organization grows, you get to a pretty significant dollar amount across the system.
Once your Spark cluster is properly set-up and running well, it can output huge amounts of data quickly when it reaches the stage of writing results. If the cluster is running multiple jobs, these writes come in periodic bursts.
Now assume you want these results written into external systems, e.g. SQL databases, Redis, Elasticsearch, Cassandra. It’s all too easy for Spark to overwhelm or significantly impair any database with these big writes. I’ve even seen it break a cluster’s internal replication.
You can’t really expect Elasticsearch to grow from, say, 200 cores to a 1,000 for exactly the duration where you need to index stuff in bulk, then get back to 200 immediately afterwards. Instead, there are various things you could do:
- Aggressively throttle down the output from Spark — i.e. spend money being near-idle on the Spark side, or:
- Write and manage a different component to read Spark’s output and perform the indexing (meaning dev time and operations), and/or:
- Over-provision Elasticsearch (more money).
In other words, not really the elasticity I was hoping for.
Over time, patterns have evolved to alleviate some of these pain: Apache Kafka is frequently used as the “great decoupler”, allowing multiple receiving ends to consume data at their own pace, go down for maintenance, etc. Kafka, though, is another expensive piece of the puzzle that is definitely not as as auto-magically scalable or resilient as the initial hype had me believing.
On the storage front there have been improvements as well: instead of using HDFS, we switched to S3 for all batch outputs so we don’t need to worry about HDD size or filesystem metadata processes. That means giving up on the once-touted idea of “data locality” which, in hindsight, was a big mismatch: storage size tends to go only one way: up and up, and you just want it to work. At the same time you want compute to be elastic as possible, utilizing as much or as little as your need right now. Marrying both was always quite clumsy. Fortunately AWS got its act together over time, improved intra-datacenter network performance considerably and finally sorted out strong consistency in S3 (oh, the pain, the workarounds!). That brought it in line with Google Cloud on these points, making the storage-compute decoupling viable on AWS as well.
That last note is important: the building blocks offered by a cloud provider may encourage a good architecture, or push you away from one.
Things to Want
This isn’t an article in the style of “Stop using $X, just use $Y”. I believe you can create a wildly successful product, or fail spectacularly, with any popular technology. The question, for me, is what constructs we need to make systems easier, faster, cheaper. Here’s a partial list:
- I want to launch jobs with better isolation from each other — each getting the resources it needs to run unhindered, rather than needing to fight over resources from a very limited pool. I want to avoid getting to a domino effect because a single job has derailed.
- I want the needed resources to be (nearly) instantly available. Yeah, I guess I’ve become that spoiled kid…
- I want to only pay for the actual work done, from the millisecond my actual code started running to the millisecond it ended.
- I want to push the complexity of orchestrating hardware and software to battle-tested components whose focus is exactly that, not any applicative logic. Using these lower-level constructs, I could build higher-level orchestration that is way more straightforward. Simple (rather than easy) is robust.
- I want jobs to run in multiple modes: it could be “serverless” for fast, easy scaling of bursty short-lived work and interactive user requests at the expense of higher price per compute unit/second. Or, it could be utilizing spot/preemptive instances — somewhat slower to launch and scale, but very cheap for scheduled bulk workloads.
I’m not inventing any new concept here, of course. The building blocks are by now pretty much available. The open software to take advantage of all these to the max — not as much, however. To demonstrate what I mean, let’s tackle a real challenge guided by these principles.
In the next part I will dive into Funnel Rocket, an open-source query engine that is my attempt at it. It was built to solve a very specific pain point we needed to address at Dynamic Yield, but as I worked on it I’ve realized it can also become a testbed for much more.
To part 2: Funnel Rocket: A Cloud Native Query Engine.
*) a note on the title: “All Clusters Must Die” is a paraphrase on “All Men Must Die” from the show Game of Thrones, a reminder of the ephemeral nature of life and all things. See: Memento Mori. Our case is a bit more upbeat, I think: if you know how and why things are fragile, you can better build for resilience.