Data Management, Facebook-style
Jeff Hammerbacher, the former lead of the Data Team at Facebook and now VP of Product at Cloudera, put up some great slides on the evolution of Facebook's data management strategy.
They're very interesting from many perspectives, so take a look and then stay tuned for my two cents.
Growing With Data
Jeff was at Facebook for about two and a half years and saw Facebook grow from a company dealing with gigabytes of data per day to a company dealing with terrabytes of data per day. It was his job to guide the process of making sense of this pile of semi-structured data.
The technical aspects are interesting, but what's more interesting to me is the story. A good title for the presentation might be "Growing With Data."
The Three Stages
As I said, the most interesting part to me was how Facebook's data initiatives evolved over time to meet their growing needs.
At first they did what everyone does — periodic offline batch processing. But we all know this doesn't scale forever, especially if your data is growing at an exponential rate.
Eventually you wind up in a situation where you produce more data in an hour that you can process. You can try to scale vertically, getting more bandwidth, more processing power, faster disks, etc., but the exponential nature of the situation will win in the end.
Once the ad hoc ETL system no longer met their needs they built a system for distributed logging. Unfortunately it didn't provide the flexibility they needed. Analysts couldn't run SQL and maintaining the system was difficult.
Eventually they hit upon Hadoop, an open source implementation of Google's MapReduce. They built Hive, a system for querying datasets stored in Hadoop files. This means you get the scalability of Hadoop and the flexibility of a SQL-like language. It's very slick.
They also built Cassandra, which provides a BigTable-like system for storing massive amounts of structured data.
Evolution, not Revolution
As I said, I like the story. They didn't start by building these complex tools, but rather they evolved to fit a growing need within the company. Beyond that I like that their approach to Hive was so customer-centric. The analysts wanted SQL so they built a SQL-like language on top of their fancy distributed technology. Very cool.
There's a lot more where that came from over at the Cloudera blog, so check it out. The future is data.