The previous post described some points
that need considering when creating a Big Data Architecture. Here
the focus is on what is arguably the central activity, designing a
data pipeline using tools from the Hadoop Ecosystem. The result is
necessarily generic, though based on experience, in order to avoid
revealing client specific and potentially sensitive information.
A generic Data Pipeline
A data pipeline moves data from a data
source through a series of transformations that reduce the quantity
and improve the quality of the data at each stage, with the latency
of the data generally increasing. At each stage the reduced data is
stored in case future insights require further processing.
![]() |
| A generic common sense based data pipeline |
- In the ETL (Extract, Transform, Load) stage the data is loaded into HDFS.
- The data is then cleaned, transformed into a form suitable for analysis, and stored.
- The cleaned data is then subjected to standard analyses and the results again stored.
- Finally the data is stored in a form suitable for analysis by standard Business Intelligence tools or for visualisation.
The final stage may involve data
virtualisation, that is, making the data look like it came from a
relational database or other conventional store.
Raw data may come from a variety of
sources, sensors, real time feeds, relational databases or log files.
The data has to be stored in HDFS. Here only data from relational
databases and log files is considered, though there are ways to
handle continuous data feeds. The content of a Relational Database
can be imported using Apache SQOOP and the content of log files can
be imported using Apache FLUME. The exact means of data ingestion
requires knowing the maximum acceptable latency. In some cases this
may be as long as a day in others 15 minutes or less. Storm, an open
source alternative to FLUME may be a good alternative [1].
Don't try to build the perfect system
right away, you may need to start with something that takes a day to
update HDFS then refine the architecture.
After the data is loaded and stored in
HDFS it must be cleaned and pruned. Since moving data along the
pipeline can be expensive the data should be reduced as much as
possible at each step. Start by removing data that is not needed.
This can dramatically cut down future processing times. Next remove
duplicate records and finally remove “dirty” records. Do not be
in a hurry to apply statistical analysis and remove outliers: it may
be better to perform some ad-hoc analyses, at least initially, to
identify outliers and determine whether they should be removed.
With this truncated, cleaned data
safely stored standard analyses can be run. The results should be
small enough to be handled by a conventional database like MySQL or
a NoSQL Database like Mongo DB. Alternatively Hive or HBASE can be
used perhaps behind a data virtualisation engine which will allow
analysts to carry out experiments using tools with which they are
familiar.
![]() |
| General Data Pipeline based on Hadoop Ecosystem |
This last stage is the end of the
pipeline. Make sure everyone who needs the data knows where the
output from each stage is stored since experiments may have to be
run starting at or combining data from different stages. Make sure
the data cannot be overwritten by an enthusiastic new starter and get
your backup strategy right: the author remembers a project where the
main database vanished and had not been backed up, and one where an
entire hard disc, again not backed up for months, was accidentally
wiped. In this field Paranoia is a survival trait.
The wrap
The outline design of a generic Data
Pipeline was discussed. This is a good starting point for most
projects but should be considered only one possibility: Avoid the “To
a man with a hammer everything looks line a nail” syndrome. The
kind of architecture described here can be used to help the client
clarify or even formulate their requirements. Think of it initially
as a kind of Rorsach test that enables the client and architect to
gain insight into the project.
The Author is a freelance Data Scientist with long experience of Software development and is always interested in hearing about potential new collaborations preferably in Mainland Europe, or SE Asia.






