Tuesday, 23 April 2013

Designing a Data Pipeline: A general approach


The previous post described some points that need considering when creating a Big Data Architecture. Here the focus is on what is arguably the central activity, designing a data pipeline using tools from the Hadoop Ecosystem. The result is necessarily generic, though based on experience, in order to avoid revealing client specific and potentially sensitive information.

A generic Data Pipeline

A data pipeline moves data from a data source through a series of transformations that reduce the quantity and improve the quality of the data at each stage, with the latency of the data generally increasing. At each stage the reduced data is stored in case future insights require further processing.
A generic common sense based data pipeline


  • In the ETL (Extract, Transform, Load) stage the data is loaded into HDFS.
  • The data is then cleaned, transformed into a form suitable for analysis, and stored.
  • The cleaned data is then subjected to standard analyses and the results again stored.
  • Finally the data is stored in a form suitable for analysis by standard Business Intelligence tools or for visualisation.
The final stage may involve data virtualisation, that is, making the data look like it came from a relational database or other conventional store.


Raw data may come from a variety of sources, sensors, real time feeds, relational databases or log files. The data has to be stored in HDFS. Here only data from relational databases and log files is considered, though there are ways to handle continuous data feeds. The content of a Relational Database can be imported using Apache SQOOP and the content of log files can be imported using Apache FLUME. The exact means of data ingestion requires knowing the maximum acceptable latency. In some cases this may be as long as a day in others 15 minutes or less. Storm, an open source alternative to FLUME may be a good alternative [1].

Don't try to build the perfect system right away, you may need to start with something that takes a day to update HDFS then refine the architecture.

After the data is loaded and stored in HDFS it must be cleaned and pruned. Since moving data along the pipeline can be expensive the data should be reduced as much as possible at each step. Start by removing data that is not needed. This can dramatically cut down future processing times. Next remove duplicate records and finally remove “dirty” records. Do not be in a hurry to apply statistical analysis and remove outliers: it may be better to perform some ad-hoc analyses, at least initially, to identify outliers and determine whether they should be removed.

With this truncated, cleaned data safely stored standard analyses can be run. The results should be small enough to be handled by a conventional database like MySQL or a NoSQL Database like Mongo DB. Alternatively Hive or HBASE can be used perhaps behind a data virtualisation engine which will allow analysts to carry out experiments using tools with which they are familiar.



General Data Pipeline  based on Hadoop Ecosystem



This last stage is the end of the pipeline. Make sure everyone who needs the data knows where the output from each stage is stored since experiments may have to be run starting at or combining data from different stages. Make sure the data cannot be overwritten by an enthusiastic new starter and get your backup strategy right: the author remembers a project where the main database vanished and had not been backed up, and one where an entire hard disc, again not backed up for months, was accidentally wiped. In this field Paranoia is a survival trait.


The wrap

The outline design of a generic Data Pipeline was discussed. This is a good starting point for most projects but should be considered only one possibility: Avoid the “To a man with a hammer everything looks line a nail” syndrome. The kind of architecture described here can be used to help the client clarify or even formulate their requirements. Think of it initially as a kind of Rorsach test that enables the client and architect to gain insight into the project.

The Author is a freelance Data Scientist with long experience of Software development and is always interested in hearing  about potential new  collaborations preferably in Mainland Europe,  or SE Asia.


Sunday, 14 April 2013

Designing a Big Data Architecture


Big Data has been characterised as data that is too big, changes too fast, or is too varied to be handled by conventional tools. To a great extent there has always been such data earlier generations did not need to handle petabytes of information, information from a multiplicity of sensors, and process images and videos. At one time sensor data, such as missile or torpedo telemetry data could be analysed by humans converting analogue traces to numbers with a ruler and comptometer. In a day a fast operator could convert the amount of information an average system generates in less than a second. The growth in the volume of data may be outstripping Moores Law, and in any case the variety and speed with which it arrives together with the variety of sources that need to be linked and integrated, perhaps in real time, require special treatment and novel tools.

Business feels that the hoard of data they own contains new insights that will allow them to improve customer loyalty and spot risks and opportunities. Government, obsessed as always by the insane desire to monitor everything and the delusion that they can, wants to use the vast amounts of data they have for various purposes, crime fighting, health care and the identification of subversives (defined in the UK by one Labour Home secretary as “Anyone who causes problems for the government”. Doubtless organised crime, already an economy of more than $3 Trillion a year is also gearing up to steal and mine data for blackmail and fraud and to identify targets for the BBC (Bribery, Blackmail, and Coercion in case you were wondering) business model. Whatever the aims of the customer Big Data is here to stay. It is probably just leaving the summit of hype and entering the trough of disillusionment before hitting the platform of productivity.

Specialist tools have been developed to handle Big Data and Architects are getting to grips with the need to integrate these tools into a coherent structure for handling data flow. Here the focus is on the batch-oriented Hadoop ecosystem.


The Importance of Being Earnest about Architecture

Architecture is like Physics, it sets the ground rules for just about everything. When it comes to Big Data the architecture is determined by the clients requirements and not only sets the ground rules for developers but may also affect performance and scalability. Many problems can be avoided by using a cloud and problems with the cloud can be sidestepped by using a cloud provider. Architecture is challenging because, to paraphrase one of the chief architects of the Bush Administration, it involves the known, the unknown and the unknown unknown. At the start of a project nothing is known and factors can shift between these three categories very rapidly.



Hadoop

Hadoop is a term given to a variety of tools developed by the Apache foundation for handling Embarrassingly Parallel data processing problems using the Map-Reduce paradigm pioneered by Google. In this paradigm data items are fed into a pipeline and transformed to key value pairs ( The MAP portion) which are then combined in the REDUCE portion. This is inherently Batch Oriented, though moves are afoot to change this. Data may be subject to multiple Map-Reduce transforms before being finally stored ready for processing. Here the environment in mind is Hadoop though much of what is said is not specific to Hadoop.

A Generic Big Data Architecture

At the start of a project all that can be identified is a generic architecture with a stage that parallels the ETL ( Extract, Transform, Load) stage of traditional data warehousing, an analysis stage where data is transformed into knowledge and stored, and a final stage, perhaps carried out with traditional analysis tools where the knowledge is transformed into insight and ,with luck, wisdom. At this stage the high level architecture corresponds to common sense, an invaluable commodity for which no one is willing to pay. As time passes, which it does if left to its own devices, the boxes can be filled in more precisely. The ambiguities in the client's needs mean the architecture expands to look like it includes everything, including the kitchen sink, and is then reduced to a skeleton of its former self, and is just about finished when a developer ( most architects are also very good, perhaps excellent developers) can look at it and say “I think I could implement a prototype”.

The Generic Architecture may thus use SQOOP to pull data from relational databases and store it in the Hadoop Distributed File System (HDFS) or to pull it from HDFS into a relational database, and use Apache Flume to extract data from log files and store it in HDFS. Custom Map-Reduce code will then be used to clean the data. This may be a mixture of PIG scripts (PIG is a Map-Reduce scripting language) and raw Map-Reduce code. The cleaned data will then be available for analysis and the original data available for re cleaning in the light of future knowledge.

Once standard analyses have been carried out on the cleaned data the results are again stored and made available to data analysts and others. Some form of data virtualisation may be required, or analysts may need to retrain with new tools such as Alpine Data Miner, Rapid Miner, Precog, or Knime which either work with Hadoop directly out of the box, or have extensions that allow this. Whichever tools are used it is a good idea to ban analysts from programming, or better still use a tool that does not require programming (PIG is acceptable and R may be acceptable in moderation). If programming is needed hire developers. This frees analysts to think like analysts not developers.

One option for final storage is to use one of the Hadoop databases, either Hbase or HIVE which allow easier data virtualisation.

Cleaning the data

Cleaning the data, like politics, is a dirty business which may leave permanent scars on the soul. Considering log files records may be polluted by incomplete records, test data or the inevitable errors that sometimes occur in transmission and recording: One error every billion lines mounts up when terabytes a day come in. Data may be missing or wrong and there is often a legal requirement to remove data that can be used to identify an individual (Organised Crime would love to get this). There are estimates that 80% of the time processing Big Data is spent cleaning it up.

A quick tip: try to ensure a way of labelling records that identifies test data, for example from mobile phone application testing. It will save a lot of grief.

Moving on from Log data sensor data, pictures, videos, soundtracks and humanly input data such as Tweets will have their own problems. The architect needs to be aware of the formats and problems of such data but there will always be the unexpected.


Best Practises

Best Practises for Big Data Architecture are evolving though there are some guidelines [2] which the author found useful and motivated much of this paper. Until the architect has gained some experience these, like the rules of composition in photography, should be followed as much as possible.

  • Structure data round analytics not ad hoc queries or standard reporting.
  • Corral Chaos: Don't even think of building an architecture that will be there in 25 years.
  • Prefer Anarchy at the borders: Let analysts play in sandboxes with their favourite tools then turn their techniques over to a specialist team who integrate them into the standard processes.
  • Start simple: a “simple” task like ETL is enough for a start.

Best Practises for the Architect include

  • Build a data pipeline or highway which starts with raw data and moves to ever more stable data. Kimball recommend up to 5 stages or caches ending with an enterprise data warehouse (EDW)
  • Use analytics at one stage to extract facts to input to the next stage
  • Integrate data from as many sources as possible
  • Make sure data quality gets better as it moves along the pipe.
  • Reduce the data as much as possible at each stage: filter, clean, prune and diagnose all the time

The wrap

This has been a brutally short introduction to a complex subject, based on emerging best practises and on the author's experience. The take home points here are

  • Expect to have to design in the face of uncertainty.
  • Have a generic design in mind and be aware it will change soon and often.
  • Try to design for data quality as well as performance

    The author is a Data Scientist specialising in Big Data, with an emphasis on Hadoop and Map Reduce and  has an interest in Visualisation, Analytics and Architecture.  The author is always interested in hearing about potential collaboration.


    Further Reading

    [1] 15 Global Challenges for the next Decade: Jerome C Glenn, in Futures: Visions for a Better World, BBVA 2012 p. 86


Friday, 29 March 2013

Decisions and Decision Trees


Learning is about classification and prediction and the gap between the two can get a little hazy.
As a motivating example consider an archeologist with a heap of pottery shards. They can describe the shards in terms of a number of features ( also known as attributes or variables) and apply a machine learning algorithm such as clustering to classify the shards into types. Alternatively the learning algorithm can be a predictive model that will assign a shard found in the future to one of a number of types based on the attributes rather than on the cluster. One such method is the decision tree.

This note provides a simple look at the use of decision trees in data mining to build predictors. For technical details consult Wikipedia or similar sources.

What is a decision tree?

Everyone has created a decision tree at some point. Basically start with a decision to be made and make that the centre of a radiating branch of labelled lines, one for each option. If there are forther decisions to be made for each option draw a square on the end of the line and again draw lines for each option. If chance events could affect the outcome draw a circle. When you have reached y goal or final decision stop. Label each line with how good or bad the outcome is and how likely it is.

Try it now: assume you want to go out. You have three choices. Beach, Bar or Shops. Draw a tree for these options. How does the chance of bad weather affect your decision. Will you still go to the bar if all your friends went to the beach? How likely is it the shops will include one that has what you need? And so on.

A decision tree shows in graphical form the options, the likely outcome of the options and the way chance events will affect the outcome (you won't head for the beach of it is likely to be raining there). It can be a useful teaching and learning tool by means of which an expert can pass on not only their knowledge but also their way of thinking.

Using Decision Trees for prediction.

Using a learning algorithm for prediction needs data about the past and an assumption that the past will statistically at least be like the future. The data comes in the form of a records divided into a vector of data attributes and a vector of attributes known as the target. The goal is, for a future vector of data attributes, to be able to predict the value of the target. In the pottery example above the goal is to be able to assign an unknown shard to a date and place of origin and a style.

The textbook way to do this is to split the data into subsets that best predict the target. For example the best predictor of survival of the sinking of the Titanic is whether the person is male or female. And, if male, whether they had a family with them or not. If the attributes are simple binary variables, for example Male/female Young/old First class/economy class then for N attributes the tree will have 2^N end nodes and the same number of paths. Each end node will contain the probability of survival and the percentage of sample points that correspond to the end node. It is then simple, in theory at least predict the target value for a new datum.

Statistics and testing

The data used for learning has to be split into a large subset of training data and a smaller subset of test data. There are some technical issues involved in sampling the data but these are not relevant here, though if possible one should start using stratified sampling. The model is built using the training data and then used to “predict” the target for each element of the test data. Success means scoring significantly above chance.

Since the results from one run could be  a statistical fluke the learning and testing process  must be run a number of times to reduce the probability of flukes and spurious correlations. The larger the data set the more reliable the result, but also the probability of a fluke increases.

In the above it was assumed that all data attributes were relevant to the prediction. In real life this is not the case and removing irrelevant attributes, known as confounding factors, is important. One way to achieve this is to run the learning process with different subsets of attributes and pick the smallest set of attributes that performs well. This is the basis of a technique called the Random Forest.

Decision tree advantages and disadvantages
A decision tree whether generated by humans or a machine is easy to understand, little data cleaning is needed and they can handle numerical and category data and statistical tests can determine the reliability of the model. They are also robust: the model generated may differ quite a bit from the true model and still work well ( so insights gained from a decision tree must be checke in other ways) and the simplicity of calculation means a decision tree algorithm can eat large amounts of data and process it without raising a sweat.

On the other hand learning a decision tree is an NP complete problem and practical algorithms uses some form of heuristic that means it is not possible to guarantee getting the best possible tree. Advanced techniques such as Simulated Annealing are hard to apply and slow the learning process considerably. They also overfit the data unless care is taken and some concepts are hard to learn because they do not fit well into the decision tree paradigm.


The Wrap

Decision trees are a useful tool for predictive analytics but must be used with care. Ideally the results they give must be tested for reliability, tested against other methods, tested against expert opinion and finally tested against that rare quality, common sense. This applies whether the results are counter intuitive or not. Data analysis packages such as Knime and Rapid Miner  include decision tree learners but the user should  ideally be aware of the  algorithms they use. 

About the Author
At the time of writing the author was a Big Data Consultant and Architect for a Bank in Spain and is always interested in hearing about Data and Big Data opportunities. The author has many years experience in software development and data analysis and is skilled in Java, competent in Python and able to use Hadoop Map Reduce and PIG.

[3] Introduction to Decision Science, Jeffrey Stanton, Syracuse University 2012







Saturday, 16 March 2013

Cyclic behaviour in a finite state automaton with Python and Mlutithreading


Partly as a result of having been burned outsourcing coding clients are increasingly requiring canddates to complete a coding test this one was hard to solve elegantly in the thirty minutes available but revealed some interesting aspects when analysed at leisure including a hint of a Hadoop MapReduce based solution. As a side note research shows such tests predict short term performance not long term performance and the author notes they are not designed to elicit creative solutions from candidates

The problem

Given an Array of digits A(k) and the function f: k ---> A(K) count the number of distinct cycles of the function f(k). A cycle is an iterative sequence of values of f(k), starting from a given initial value that ends back where it started.

Think of the array as a tape with a series of digits on it. A chess piece such as a pawn placed on a digit jumps to the location specified by that digit.

Suppose the tape has the following digits

2,3,1,1,3,5,7,6

Then a pawn placed on location zero jumps to location 2. Landing there it jumps to location 1 it then jumps to location 3 and then back to one. There is a cycle (1<-->3) another cycle is at the end where the pawn stays where it is, perhaps restlessly wanting to be out of the pawnshop. Two cycles are distinct if they do not cover the same set of locations.

Analysis

A deterministic finite state Automation is formally a set S of states {s}, a set I of inputs {I} that includes a null input ( that is no input) and a rule that specifies which state the automaton jumps to for each input.

The array and the function f can be considered as the resting behaviour of a a deterministic finite state automaton when there is no input.

Minsky's theorem that says any deterministic finite state automaton with no inputs will either cycle or stop ( a cycle where the state does not change). With a bit of thought this is obvious: there are only a finite number of states and eventually the automaton will return to an already visited state. This is the case considered here

A few things are obvious.
  1. The cycles are like magnets. If the system enters a cycle it can't get out. for aa system with a continuous set of states a cycle becomes a set called an attractor.
  2. No two cycles can have any states in common for if they did the next state rule would be ambiguous. In particular a cycle cannot contain a smaller cycle.
  3. Two cycles of different length are distinct

It follows a cycle is a subset of the set of states and the set of cycles is a set of disjoint subsets of the set of states. Not all subsets are cycles.

The number of cycles is at most 2N where N is the number of states. The largest possible cycle is of length N.

One elegant and stupid way to tackle the problem is to list all the subsets and check each to see whether it is a cycle. Don't try this.

The Simple Algorithm

More sensible is to look at each state and see if the transition function generates a cycle. One need only look at N +1 transitions, and if the system is not in the starting state by then it has become trapped in a cycle that does not contain the starting state.

The following python code applies the transition rule for at most N+1 hops from a given starting position and returns every state in the cycle if the starting point is in a cycle and None if the starting point is not in a cycle. It could take a long time (N**2) if the number of states is large, but in any practical case if you are writing a state machine with say a million states you are probably insane (“It never did THAT before”).

There is always a faster algorithm, and I may find it later. Starting at the given state the states traversed are stored in a list. If the system returns to the start state the list is returned. If not the value None is returned.

The simplest algorithm just uses this to collect all cycles and then eliminate duplicates.

def findcycle(self, start, data):
        currentplace = start
     thecycle = [start]
        
     for times in range(1,len(data)):
         currentplace= data[currentplace]
         if currentplace== start:
         return thecycle
         thecycle.append(currentplace)
     return None


Threaded Python code

The simple algorithm can perhaps be speeded up by assigning each potential start point to a separate thread. Threading is painful and Python threading is especially unpleasant but by setting up the problem as a producer-consumer architecture the problems of race conditions and deadlocks can be avoided. The following code was developed from a tutorial example byBeazly,


The code here was written to look like a Map-Reduce problem, so solving it with Hadoop will speed up execution.  

Start with the producer.

Detector is an instance of a class cycledetector that has the findcycle method above . This method checks each state and, if the state is the starting state of a cycle ( a cycle can start at any of its states) a list representing the cycle is placed in the output queue. This is almost identical to a Hadoop map() function in a multiprocessor environment, the difference being that a key is neither needed nor generated and, of course that it runs on a single machine.

def producer(sequence, output_q):
    detector = cycledetector(sequence)
    for item in sequence:
       thecycle = detector.findcycle(item, sequence)
     if thecycle:
        output_q.put(thecycle)
     return None

The consumer takes a cycle and compares it with all cycles found so far. If it is a new cycle it is added to the list of cycles found so far. Again the codes should be obvious. It uses the fact that no two cycles can share a common state. It is like the reduce() function in the Hadoop Map-Reduce paradigm except for the absence of a key and the fact it is running on a single processor. The results are sent to standard IO but it is a minor matter to store the list where it is available to the main execution thread.

def consumer(input_q,sequence):
    distinctcycles=[]
    while True:
        thecycle = input_q.get()
        # check the cycle against all cycles found so far. 
        #If it is in a cycle
        # then continue. 
        # If the first state of the cycle is in an existing cycle
        # the cycle is not new.
        isnew=True
              for acycle in distinctcycles:
              if thecycle[0] in acycle:
                   isnew=False
                   break
     # add a new cycle to the list of cycles found
     if isnew:
        distinctcycles.append(thecycle)
     print thecycle
     # signal that the task is finished
input_q.task_done()


Now the main code. Again the code should be self explanatory.

from multiprocessing import Process,JoinableQueue
from threadedcycledetector import threadedcycledetector

     if __name__ == '__main__':
        detector = threadedcycledetector();
        sequence = range(100)
        sequence = [2,3,1,1,3,5,7,6]
        # start the consumer as a demon thread running for ever
        thequeue = JoinableQueue()
        cons_p = Process(target=consumer, args=(thequeue,sequence))
        cons_p.daemon = True
        cons_p.start()

       # start the producer
        producer(sequence,thequeue)
       # wait for both threads to finish
        thequeue.join()
        print "finished"

Summary

It felt good to have solved a problem that seemed to have no practical use. It is an interesting problem. It is a good example of a Deterministic Finite Automaton, it illustrates Minsky's theorem and using Python's multiprocessing queues it can be formulated as a Map-Reduce algorithm indicating that if used in a practical case, should one exist, execution can be speeded up using Hadoop or a similar framework, for example MongoDB Map-Reduce - which would allow solution in C# as well as Java and a number of other languages. It is not clear whether Python is the best language for solving this problem- Erlang may be a better choice as it is designed to handle concurrency.  

Sunday, 10 February 2013

An unreversed list of ways to reverse a List.



Reversing a list is simple. But the difference between good and average developers is, the point at which you stop asking if there is another way to do things.

There is some interest in list reversal on the internet, at least in the context of Java, and this prompted a decision to play the geek for a while.

The possibilities include in place and not in place reversal, reversal via Stacks, Queues and Threads, permutation based reversals and remembering Parallel Data Transforms, a useful subgroup of permutations of 2**n elements also came into play. Note that in all cases if the list is shared between threads the reversal method should be synchronized


Recursive and non-recursive reversal

Start off by letting startpoint be the start of the list, and endpoint be the end of the list.

  1. Swap the two ends of the list,
  2. Increment startpoint and decrement endpoint
  3. repeat from 1 while startpoint is less than endpoint.

This is an in-place reversal and needs minimal memory. The non recursive and recursive reversals are almost identical. Both take about the same time to run. There are several ways to implement this in Java: which you use is a matter of taste.

Stack Based Reversal

A Stack is a last in first out (LIFO) Collection.

Read the list into the stack
Read everything back from the stack into the list.

This is not an in-place reversal so it takes more memory. It was a lot less efficient and the execution time was very variable. This may mean that the time taken was long enough that the probability of a garbage collection sweep became high. It is elegant but not very efficient.

Producer Consumer Threads

Start a thread that reads the list into a Queue ( a First in First out Collection) starting from the back. Let another thread use the same Queue and pop elements from the Queue into the list. This let the author explore the Producer Consumer pattern in Java, something they had already done on Python to find cycles in the behaviour of an automaton.

Reversal as a permutation.

The next step is to treat reversal as a permutation, which it is, and led to thinking about Parallel Data Transforms.

To apply a permutation just specify where each element of the list is to go. This cannot be done in place. We can then write a permutation very easily for example

{1,2,3,4,5,6,7,0} is a cyclic shift of the list [0,1,2,3,4,5,6,7] and {7,6,5,4,3,2,1,0} is the reversal of the list. This is easy to implement.


Shuffle reversal
One useful permutation is a shuffle. As a concrete example consider a set of 8 elements

[0,1,2,3,4,5,6,7]

There are two ways to interleave the elements as in a perfect card shuffle

[0,4,1,5,2,6,3,7]
and
{4,0,5,1,6,2,7,3]

The respective permutations are {0,2,4,6,1,3,5,7} and {1,3,5,7,0,2,4,6};

The second one is of interest here since all elements change position.

After three applications of this permutation to the original set the order of the elements is reversed. Which brings us to parallel data transforms.

Parallel Data Transforms

Stuart Reddaway originally developed Parallel Data transforms for the ICL Distributed Array Processor, a Single Instruction Multiple Data computer comprising a 64 by 64 grid of single bit processors with nearest neighbour connections. Reddaway deserves great respect not only for devising them but retaining his sanity while doing so and finding efficient ways to implement these transforms. As a side note the DAP was eventually purchased by a US company and then hailed in the British Press as an American Invention.

The indices of an array of 2N elements may be treated as N-bit unsigned integers and a list of the indices represents a permutation as defined above (if there are no duplications and the integers are all in range). The subgroup of the permutations of the elements we get this way is very useful. The first shuffle above corresponds to a cyclic shift of the bits of each index, and the second to flipping the most significant bit of the index then cyclically shifting the bits. Reversal simply means flipping all bits of each index. The operations on the N bits of the indices yield N! Permutations without bit flipping and when bit flipping is allowed the total number of permutations rises to N!log2N.

Obviously if we keep flipping the most significant bit of an N-digit number and cyclically shifting the result then N repetitions result in a reversal. On a serial machine this takes N Log2N operations but if there is an efficient hardware unit available to carry out the shuffles in negligible time, for a fixed array of 2N elements then reversal will take Log2N parallel operations. Of course there is limited demand for reversal but these shuffles have been used as the basis for parallel sorting algorithms and, as I recall, for parallel Fast Fourier Transform algorithms.

The important point to note, especially for Architects, is that use of specialised hardware may radically change the characteristics of algorithms which in turn will affect the design of the system.

The Wrap

Various ways to reversing a List have been outlines. They sidestep problems with reversing Java's Linked Lists and start with simple reversal and proceed to more elegant methods.
By creating dedicated hardware to perform shuffles the complexity of the task can be reduced to log2N.

The take home points for architects are that they need to experiment with different ways of carrying out a task “Is there another way” and that dedicated hardware may radically alter the complexity of a task. Dedicated hardware can also reduce total lifecycle cost by eliminating maintenance for the tasks carried out by the hardware, once the hardware has been thoroughly debugged.


You can see code samples at this location



Tuesday, 15 January 2013

Java Object Serialization: Risks and benefits


Serialization allows writing a Java object to persistent storage and reading it back when needed. This is a useful technique for saving time and space but has dangers in terms of protecting sensitive data,, transferring objects between applications and the stale code bug pattern.



What is serialization and why do it

Serialization lets you write a java object as a series of bytes either to persistent storage or across a network. Enterprise Java beans (EJB) transparently serialize objects in order to transfer them between a server and a client.

Some automatic code generation Frameworks may not produce serializable code and may thus be unsuitable for applications with serialization needs or for use with EJB

Sometimes an object can take a long time to create but rarely needs to be created. In this case serialization can save time, memory and bandwidth.
If all the objects fields are serializable serializing and deserializing an object is one way to create a deep clone of the object.

Essentially, in Java, the object is written to an object output stream and read from an object input stream. 

Confidential data

Serialized objects may contain sensitive data, for example credit card numbers. The risks may be minimised by marking sensitive variables as transient, for transient fields are not included in the serialization.

Serialization may include any fields that are also objects hence these should be examined for sensitive data. Fields that are not serializable will result in a NonSerializableException.

Since a serialized object is merely a string of bytes the string should be encrypted before sending it over an unprotected transmission channel. If this is done obviously the risk in sending confidential data is minimised


The Stale Code Bug Pattern

The Stale Code Bug Pattern arises if the code you are running is not the code you think you are running. Often this arises when using the wrong version of a jar file, though this risk is minimised if the project is managed with Maven. This bug pattern can also happen with serialized objects.

Serialization produces a frozen snapshot of a Java object. If the functionality of the object's class changes but the object is not updated the behaviour of the reconstituted object read from storage is hard to predict and the problem is often hard to identify, leading to a lot of deleted expletives from developers who wonder why the debugger is taking them into unexpected territory till the penny drops. The other risk is wrongly identifying a bug as an instance of this pattern.

A more obvious problem can arise in development when there are multiple functionally identical copies of a class in different packages. The decision to standardise on one package is obvious and may be sensible but serialized objects created from the discarded versions of the class in different packages give rise to a ClassCastException on deserialization. This problem can be resolved by opening (a copy of) the serialized object in a text editor, locating the class name and changing it to the fully qualified name of the class to be used. Other metadata may have to be changed depending on the application.

The Wrap

Serialization is a powerful and useful technique that comes with risks and dangers, not all of which are apparent before starting development, though an architect should be aware of these and communicate them to developers who may not have used this technique before.

The design phase of a project should identify which classes and fields should be serialized and identify procedures for preventing the Stale Code Bug Pattern from causing problems in production systems, and tests to identify the subtler versions of this bug pattern. Where possible Maven should be used in order to specify all dependencies.

The author is a Data Scientist specialising in Big Data, with an emphasis on Hadoop and Map Reduce and  has an interest in Visualisation, Analytics and Architecture and  is always interested in hearing about potential collaboration.



Sunday, 13 January 2013

How to think about sorting data


It may be exaggerating to say sorting is what computers are for, but sorting data is important and some tasks, such as searching, can be done much faster on sorted data than on random data and if summing an array of floating point numbers problems with rounding error can be minimised by sorting the data in ascending order then doing the summation starting with the smallest numbers . number of programming languages provide support for sorting but a knowledge of the issues involved can be valuable in deciding the algorithm or even the programming language to use.

What does sorting involve?
Sorting means rearranging a set to conform to some ordering rule. Any non trivial set of data allows several possible orderings, For example a list of cities can be sorted alphabetically, by size, by distance or travel time from your current location, or by cost of living.

Sorting data needs a way to compare any two datas items with respect to the chosen ordering. If there is no way to compare two elements the data cannot be sorted. Regardless of the language used the first thing to do is specify the rules for comparing data. In Java one can define a comparison method that takes a pair of data items, taking one as the first and the other as the second and returns -1 if the first is less than the second, +1 if the first is greater than the second, and 0 otherwise then use the sort methods the language provides.

There are basically two kinds of sort, one where the data is presented in one go and another where the data may arrive at random intervals. If the data is already presented the sort may have to be done in place by swapping data items, or as an insertion, where a new sorted version of the data may be created: this is almost mandatory if the data arrives at random intervals. Duplicate values may be allowed or forbidden.

The bottom line is when faced with a sorting problem you may have to consider a number of issues. Having said that the author has found that most sorting problems involve in place sorting of data already present.
A few algorithms

The simplest and least efficient way to sort a collection of data is called the bubble sort.

In this case one just passes through the data swapping any two elements that are out of order. Eventually the set will be sorted but in the worst case sorting a set of N elements will involve N2 passes through the data. This is not good. A thousand elements will need a million passes and a million elements (a small dataset by commercial standards) will need a trillion passes.

Nevertheless the bubble sort has one attractive feature: It is an in-place sort which makes it a contender for memory limited environments with small data sets. for example in a mobile phone or personal organiser. There are more efficient in-place sorts but these may need far more code.

Divide and conquer algorithm roughly sort the data into smaller heaps, sort the small heaps and then combine the results. Such algorithms are much faster than a bubble sort but need more memory. On the other hand with a little care they can be used in a parallel processing environment.

A key concept with divide and conquer algorithms is PIVOTING. Pivoting means selecting a member of the set as a pivot and splitting the set into two parts, one of which contains all members less than or equal to the pivot, and the other contains the rest. The two parts can then be sorted and the results combined.

One algorithm, quicksort, uses divide and conquer recursively by pivoting the data on (for example) the first element, pivoting each subset in the same way till only two elements are left then recombining the results. This is easy to program with modern languages and much faster. Technically the sort time increases as N.Log2(n). Thus a million elements will not need a trillion passes, but only about 20 million. If each pass takes a microsecond ( a fairly long time on modern hardware) the time is reduced from 15 minutes to about 1.5 seconds.

Another type of sort is the insertion sort where each element is examined and inserted into the right place in a sorted data set. This requires a single pass through the data but determining where to place it can require log(N) tests of the sorted data set using a binary search algorithm. Another option might be to create a hash for each incoming datum and place the datun into an appropriate collection, known as a bucket which can be located using the hash. The main problem with this approach is ensuring an even distribution of data over the collection of buckets but the sort is then basically constant time.

The insertion sort is useful when data is streaming in from an external source, possibly at irregular times, provided the time to place the data in the sorted set is less than the minimum time between the arrival of one datum and another. When dealing with this type of problem a buffer may be needed to hold incoming data and a guarantee is needed that the average time between additions to the buffer is significantly greater than the time taken to insert the data in the sorted set.

An algorithm described in Dr Dobbs Journal a few years ago illustrates how a lateral approach may result in a novel solution. The problem was to sort an array of phone numbers and remove duplicates. The solution was to set up an array of single bits ( possible in some languages) of size equal to the largest phone number, considered as a number. then in a single pass through the array for each phone number P set bit P of the array to be true. This just needed two passes through the data and one pass through the bit set to generate the sorted number list. It would be straightforward to modify this to count any duplicate, for example for data visualisation.

Another option, illustrated for numbers here and using an SIMD parallel processing architecture is first to pivot on the most significant bit, regrouping the data in a single parallel operation, and then on the next most significant bit. This results in an algorithm that depends on the size in bits of each data item. It seems straightforward to extend this to strings or complex structures.

Practical matters

Divide and conquer sorts generally need a time proportional to N.log2(N). The time taken also depends on the time taken to move the data. When each datum is large. Or in a parallel architecture this can be a bottleneck and it is more efficient to create a list of addresses for each datum and then move the addresses around, if necessary rearranging the data once the sort is complete. The saving can be considerable.

When faced with a sorting need ask whether an acceptable solution already exists. Java and Javascript for example support sorting collections in place given a valid comparison function. It is reasonable to assume that the language developers have had time to do a better job than you can (don't be arrogant here ), and you have better uses for your time.

Other points to consider include how the data is obtained? Is it an array or a stream? how large is each datum? If the data are floating point numbers is rounding error likely to be a problem? If the dataset is very large can it be processed in parallel (perhaps via a pipeline) and the sorted data recombined?

For most people the information given above is likely to remain academic but it is something of which one needs to be aware. Hopefully this note has sharpened awareness of issues involved in sorting and will prevent a few bugs at levels from architecture to code.