The previous post described some points that need considering when creating a Big Data Architecture. Here the focus is on what is arguably the central activity, designing a data pipeline using tools from the Hadoop Ecosystem. The result is necessarily generic, though based on experience, in order to avoid revealing client specific and potentially sensitive information.
A generic Data Pipeline
A data pipeline moves data from a data source through a series of transformations that reduce the quantity and improve the quality of the data at each stage, with the latency of the data generally increasing. At each stage the reduced data is stored in case future insights require further processing.
|A generic common sense based data pipeline|
- In the ETL (Extract, Transform, Load) stage the data is loaded into HDFS.
- The data is then cleaned, transformed into a form suitable for analysis, and stored.
- The cleaned data is then subjected to standard analyses and the results again stored.
- Finally the data is stored in a form suitable for analysis by standard Business Intelligence tools or for visualisation.
The final stage may involve data virtualisation, that is, making the data look like it came from a relational database or other conventional store.
Raw data may come from a variety of sources, sensors, real time feeds, relational databases or log files. The data has to be stored in HDFS. Here only data from relational databases and log files is considered, though there are ways to handle continuous data feeds. The content of a Relational Database can be imported using Apache SQOOP and the content of log files can be imported using Apache FLUME. The exact means of data ingestion requires knowing the maximum acceptable latency. In some cases this may be as long as a day in others 15 minutes or less. Storm, an open source alternative to FLUME may be a good alternative .
Don't try to build the perfect system right away, you may need to start with something that takes a day to update HDFS then refine the architecture.
After the data is loaded and stored in HDFS it must be cleaned and pruned. Since moving data along the pipeline can be expensive the data should be reduced as much as possible at each step. Start by removing data that is not needed. This can dramatically cut down future processing times. Next remove duplicate records and finally remove “dirty” records. Do not be in a hurry to apply statistical analysis and remove outliers: it may be better to perform some ad-hoc analyses, at least initially, to identify outliers and determine whether they should be removed.
With this truncated, cleaned data safely stored standard analyses can be run. The results should be small enough to be handled by a conventional database like MySQL or a NoSQL Database like Mongo DB. Alternatively Hive or HBASE can be used perhaps behind a data virtualisation engine which will allow analysts to carry out experiments using tools with which they are familiar.
|General Data Pipeline based on Hadoop Ecosystem|
This last stage is the end of the pipeline. Make sure everyone who needs the data knows where the output from each stage is stored since experiments may have to be run starting at or combining data from different stages. Make sure the data cannot be overwritten by an enthusiastic new starter and get your backup strategy right: the author remembers a project where the main database vanished and had not been backed up, and one where an entire hard disc, again not backed up for months, was accidentally wiped. In this field Paranoia is a survival trait.
The outline design of a generic Data Pipeline was discussed. This is a good starting point for most projects but should be considered only one possibility: Avoid the “To a man with a hammer everything looks line a nail” syndrome. The kind of architecture described here can be used to help the client clarify or even formulate their requirements. Think of it initially as a kind of Rorsach test that enables the client and architect to gain insight into the project.
The Author is a freelance Data Scientist with long experience of Software development and is always interested in hearing about potential new collaborations preferably in Mainland Europe, or SE Asia.