Is your data out of shape? Why is it that you can never find the information you need? Surely you’re familiar with that sinking feeling. When you see a glaring error in the numbers, but you have no idea where it’s coming from. Your data lake is bloated beyond belief, so your team doesn’t know where to start looking. You’re now in panic mode. A looming report deadline, a shortly due presentation. How are you going to fix things in short order and maintain your credibility?
Here’s the bad news – it’s too late at this point. You’ve let seep in a history of bad data habits, and you no longer have the nimbleness to respond to contingencies. It’s just like what happens when you overeat and don’t exercise. I see many parallels between the areas of personal fitness and data hygiene. Similar incentives fuel vicious cycles of overconsumption and underutilization. And you need similar disciplined approaches to break free from those traps.
There’s so much data!
We have so much data. How did this happen? The simple answer is economics. When something gets cheap, we consume a lot of it. Just like how Nixon’s 1971 farm subsidies bill gave us an America flooded with cheap calories, rapid advances in storage technologies brought us the mind boggling amount of cheap storage that we take for granted today.
Cheap storage drives data volumes. We no longer have to be selective about what data we store. On a global level, 2.5 exabytes (2.5 billion gigabytes) of data are produced everyday. Take a moment and think about it – in 1995, one gigabyte of hard disk storage cost $250. If that were still the case today, would we be collecting so much data without first having a plan for it? Just in case some of it might come in use one day? No. But at current prices of $0.03 a gigabyte, cost is no longer a constraint.
Which means we can collect all the data we want to collect. The promise of a data lake that stores all of our raw data forever – data that we can process at any later time – is irresistable.
So where’s the issue?
Houston, we have a problem
Here’s the downside to having a lot of data. You will not use it, unless you understand what’s actually in it. When data gets old, it’s easy to forget the context under which it was collected. All the more so when you have many rapidly changing sources of data. You need some way to address context drift, otherwise you’re going to struggle to connect the dots between old data and new data. You’ll often find your data loses all its value as it gets old – unless you have a way to capture and preserve the signal in the data.
Another risk to watch out for is data quality. Many analytical errors can be traced back to bad data – incomplete, duplicated or corrupted records. There’s no way you can manually inspect petabytes of data, so it’s easy for errors slip past unnoticed. Unless you have automated ways for detecting and fixing such errors.
Commit to a fitness plan
You need a fitness program to whip your data into shape. Instead of a sluggish, lumbering hulk, you want something that’s lean and mean. Your raw data needs to be processed into a form that makes it fast and simple to study high level trends, and also conduct needle in the haystack type of searches.
It’s tempting to go the quick and dirty route of ad-hoc scripts for data processing. But as you introduce more sources and dependencies, it doesn’t take long for your data pipelines to devolve into a tangled mess. It then becomes impossible to guarantee correctness or troubleshoot data errors. You’re now back at square one.
Personal trainers make you consistent
What you want is a personal trainer for your data. A transparent, flexible and consistent system that enforces a data fitness regimen. You can specify a data processing plan and schedule, and trust that it will be executed on time. You can also monitor data quality, and get alerted as soon as issues are encountered. Data pipelines can be replayed after fixing those errors. If any assumptions about your data change, or if you want to add more data sources, the system is flexible enough to easily make those modifications.
There are many options to choose from for such a data processing system – Spotify’s Luigi, Pinterest’s Pinball, StreamSets, AWS Data Pipeline and Google Dataflow to name a few. I’m going to show you an example using Airflow, an awesome data processing engine open sourced by Airbnb. It traces its roots to Dataswarm, a similar internal system that I enjoyed using at Facebook.
Case Study: Airflow
I recently gave a talk on Airflow at SF Bay Area Data Ingest. While the Airflow documentation is great, I found the examples rather abstract and could use some help. So I put together a more concrete toy example for the talk to illustrate Airflow’s usefulness.
Tech Industry Sentiment Analysis
Let’s say you want to monitor the sentiment of the tech industry by following what’s popular on Hacker News. You could do that as follows:
- Pull top news articles for the day using the Hacker News API.
- Run sentiment analysis on the headline and link content using IBM Watson’s Sentiment Analysis API.
- Store the data in a reporting database and point your dashboarding software to it. I used Airbnb’s Caravel for the dashboard demo.
How do you configure this processing flow in Airflow? You setup a daily job with two tasks, one for pulling the latest Hacker News feed and storing it in a staging table, and another for running sentiment analysis on the staging table data. You call out the data dependency between the two tasks, so that the sentiment analysis task will not run before the news feed pull is complete. By splitting the job this way into two tasks, you have the flexibility of separately monitoring and ensuring data quality for both stages.
You can now put together a dashboard that points to the reporting table and track trends in tech industry sentiment.
Now let’s say you decide to write your own sentiment analysis algorithm. You can easily extend your job with a third task and run it on all historical staging data – using an operation known as a backfill. You can now compare the results of your algorithm with that of Watson. On future job runs, Airflow will execute both tasks after the news feed pull is complete, so you can continue comparing performance trends without manual intervention.
That wraps up the case study. Feel free to reach out to me if you want details on how I set up Airflow, or how I put together the data analysis pipeline.
Walk in the footsteps of giants
Hopefully I’ve convinced you of the value of a flexible data processing system to keep your data fit and nimble. Tech giants like Google and Facebook have been using such systems for years, and now they are available for everyone to use. There’s no excuse for the statistic that only 27% of companies believe that they are getting any value out of their data. I think anyone who wants to benefit and keep benefiting from their data for years to come needs to follow in the footsteps of these giants.