Saturday, 24 August 2013

Simple Particle Motion Simulation using D3

D3 is a good library for data visualisation. I wanted to visualise an ideal particle in a rectangular 2 dimensional box bouncing off the walls then progress to including gravity in the solution. The mathematics for such a particle is simple. In the process I learned a few things about D3, for example use of map attribute to initialise data and the use of the timer function.

A particle moving in an otherwise empty two dimensional box has a velocity v = (vx,vy). The walls are taken to be parallel to the x and y axes. When the particle hits a wall parallel to the x axis vy → - vy.
When it hits a wall parallel to the y axis vx → - vx .

Clearly such a particle will bounce off the walls forever. In a package like Director this was easy to implement but proved a little tricky in D3

The effect of gravity is included by noting that for the y velocity

dv = g.dt and setting dt = 1. 

Computational Aspects

A real particle moves continuously. In the simulation the virtual particle moves in discrete steps and may never “hit” the virtual wall of the box. Suppose it moves vx pixels along the x axis per time step then if at the end of a step it is vx/2 pixels from the wall it will go through the wall unless told to stop.

So the particle must be told to reverse its motion when it is about to cross a wall. This risks the virtual particle bouncing before it touches the wall, which is unconvincing, though it can be prevented. Here the problem of non rectangular boxes is ignored.

Some code details

The first step is to initialise the data.

// Map the range to an array of data of the same size
// For a large number of balls produces seemingly random motion instantly

var numballs=<your chosen value>;

var acceleration = <your chosen value>;

var data = d3.range(numballs).map(
// Assign random x and y velocities to each particle
var vx = randomVelocity(1);
var vy = randomVelocity(1);

// make some velocity components negative
if(Math.random()<0.5) vx = -vx;
if(Math.random()<0.5) vy= -vy;

// Assign random positions to the particles.
var itsx = itsx = 10*Math.random() -5;
var itsy = itsy =10*Math.random()-5;
return {xloc: itsx, yloc: itsy, xvel: vx, yvel: vy};

Scaling the data

var x = d3.scale.linear().domain([-5, 5]).range([0, width]);
var y = d3.scale.linear().domain([-5, 5])range([0, height]);

This puts the origin at the centre of the screen.

The Timer Function
The key to this exercise was the timer function which simple repeats what it is told to do for ever.


data.forEach( function(d) { update(d);} );

// Move each circle; define a coordinate transform
// translate circle by scaled coordinates
circle.attr("transform", function(d)
{ return "translate(" + x(d.xloc) + "," + y(d.yloc) + ")"; }
// just for fun shrink circle radius near the origin
.attr("r", function(d)
{return Math.sqrt(d.xloc*d.xloc +d.yloc*d.yloc)} ) ;

Updating coordinates

function update(d)

// record old positions, calculate new positions, update positions.
var oldx = d.xloc;
var newx = oldx + d.xvel;
d.xloc= newx;

var oldy = d.yloc
var newy= d.yloc + d.yvel;
d.yloc = newy;

// The upper and lower limits were established empirically-
// to convincing right bounce action
var lowerlimit = -5.1;
var upperlimit = 4.7;

// iscrossing returns true if the ball is about to hit a wall.
xcrossing = iscrossing(oldx,newx,upperlimit,lowerlimit);
ycrossing= iscrossing(oldy,newy,upperlimit,lowerlimit);
// reverse appropriate velocities if about to hit a wall.
if(ycrossing )
d.yvel = -d.yvel;
d.yloc += d.yvel;

if(xcrossing )
d.xvel = -d.xvel;
d.xloc += d.xvel;

Here is the function that decides if the particle is about to hit a wall.
Basically if it is about to cross a wall reverse the velocity component.

function iscrossing(old, new, upperlimit,lowerlimit)
var crossing = old< upperlimit && new > upperlimit;
crossing = crossing || old > upperlimit && new < upperlimit;
crossing = crossing || old < lowerlimit && new>= lowerlimit;
crossing = crossing || old > lowerlimit && new<=lowerlimit;
return crossing;

Including Gravity

The effect of gravity was included by updating the y velocity immediately before updating the position. Failure to do this resulted in unrealistic behaviour.

var oldy = d.yloc
// The velocity update needs to be applied before not after updating position
d.yvel += acceleration;
var newy= d.yloc + d.yvel;
d.yloc = newy;

The initial value of the acceleration was set at 0.01.


After some experimentation with the parameters I managed to get realistic behaviour. For an initially localised swarm of particles the swarm gradually diverged and appeared eventually to reach a state close to random motion, but with some sets of particles moving in unison.

The behaviour was improved, as would be expected, by assigning each particle a random initial velocity and position.

Introducing acceleration led to an apparent bunching of particles at “ground” level while high accelerations resulted in the swarm rising and falling together something like the motion of waves on the sea

The Wrap
The code snippets above were combined, together with standard boilerplate to create a virtual particle moving in a virtual box using d3. The best way to learn is to modify something that works and the code here was adapted from Bostock's SVG Swarm example 

The results so far show that an originally localised swarm would slowly tend to approximate particles in an ideal gas but assigning random initial positions resulted in almost instant approximation to an ideal gas.

They also showed that watching the swarm is hypnotic. 

Monday, 19 August 2013

Computing the mean and variance of a dataset from the mean and variance of two subsets.

The aspects of Big Data are that Volume, Velocity and Variety. Some types of data are adequately summarised by the mean and variance of the dataset. But if the volume is large enough computing these parameters could take too long, and if the data comes from multiple sources aggregating the data before computing the mean and variance could also exceed the computational deadline for the task. The mean and variance of multiple datasets can however be computed in parallel, possibly at different times, and the mean and variance of the total data set, or combinations of the component datasets can be obtained.

It is possible, given the means and variances of two datasets to calculate the mean and variance of the aggregate of the two sets. Here the method is shown for two datasets, but it is obvious how to extend it to multiple datasets. It is theoretically possible to compute this from the variances alone if the covariance of the datasets is known but this is not possible here.


We have two datasets, x of size N and y of size M   and and we know the mean and variance of each set. Using for the aggregate of the two sets, we want its mean and variance.

We know the average:

<x> = sum(x)/N


sum(x) = N<x>

And the estimate of the combined average is

(N<x> + M<y>)/(M+N)

The variance is a bit harder. The variance is given by

var(x) =  <x^2> - <x>^2


  <x^2>  = var(x) +  <x>^2

With analogous results for y

These results can be manipulated into a formula fo the combined variance if desired. It is easier however to use the last result and say

N<x^2> = sum(x^2)

and combine the sums for the two datasets. with the combined average

var(x+y) = N<x^2> + M<y^2>  + (N<x> +M<y>)/(M+N)

This Python code returns the combined mean and variance

# given the sample mean and variance of two datasets returns
# mean and variance of a dataset made by combining the two
# datasets
# may suffer from numerical problems in some circumstances.
def combine(meanx, varx, N, meany, vary, M):
totalsize = N + M;
combinedaverage = meanx*N + meany*M;
combinedaverage = combinedaverage/totalsize;
sumxsq= (varx + meanx**2);
sumxsq = sumxsq*N;
sumysq= vary + meany**2;
sumysq = sumysq*M;
combinedvariance= (sumxsq+ sumysq)/(totalsize) - combinedaverage**2;
return combinedaverage,combinedvariance;

The code was tested by taking two datasets and computing the mean and variance of the aggregated dataset directly and comparing it with the result returned by the method above. This is proof of concept code so issues of numerical stability and edge cases have not been addressed. Use at your own risk. 

If the size of the datasets used to compute the means and variances is not known this method cannot be used. Putting in random values showed that the result the method returns becomes highly inaccurate. 

The Wrap
A way of computing the mean and variance of a dataset given the mean and variance of two subsets of the dataset is presented. The extension to multiple subsets is obvious. If necessary datasets can be combined by recursive halving on a parallel architecture: lining up the pairs of means and variances and adding each even value to each odd value. This is repeated, each repetition halving the number of mean-variance pairs till only one is left. For large datasets this could represent a big speedup of the time needed to compute the mean and variance.

And it was fun  working this out

Sunday, 28 July 2013

Drawing a linegraph with D3

Current Specialties
I was first introduced to D3 late in 2011 and after some experimentation I decided it was great way to visualise data, and it had the added virtue of allowing multiple representations of the same data. It was too late for the project I was on, migrating compley text documents to the web, which was about to finish,but realised it could have simplified maintenance of tables and graphs in the migrated documents since the data owners could be left to maintain spreadsheets of their data and D3 would then present the data. Decoupling of business and technical functions seems line an increasingly good idea. One problem I had with D3 was that the examples on the web were mainly write only code and even tutorial application were harder to follow. I therefore spent a lot of time refactoring my code in an effort to ease maintenance. On the way I found a couple of issues that the code here adresses.

As always any code here comes with no warranty: It worked for me.

The Project

More recently a small web project required me to use D3 to display a history of Exchange rate data. I managed to get the example in
to do what I wanted but  eventually wanted a general purpose linegraph plotting routine. So once the project was essentially finished I returned to the web example and created something closer to what I wanted. In particular I wanted the routine to accept arbitrary x and y datasets, whereas the example above complicated matters by assuming the x values were simply the integers up to a maximum value.

The first step was to separate javascript and html. Following the rule that paranoia is never enough I created a simple function bridge() which initially only threw up an alert to say I had connected properly to the javascript library when the page was loaded. After that bridge was responsible for setting up the parameters of the file, creating data and calling the plot routine, in effect it was a test harness. In the real application bridge would be replaced by an AJAX method that got transformed data from a URL into a form D3 could use.

function bridge()
// create simple X and Y data arrays that we'll plot with a line
var datay = [3, 6, 2, 7, 5, 2, 0, 3, 8, 9, 2, 5, 9, 3, 6, 3, 6, 2, 7, 5, 2, 1, 3, 8, 9, 2, 5, 9, 2, 7,0];
var datax= d3.range(0,datay.length,1);
// define dimensions of graph
var m = [80, 80, 80, 80]; // margins of graph xleft, xright, ytop,ybottom
var w = 1000 - m[1] - m[3]; // width of graph
var h = 400 - m[0] - m[2]; // height of graph
var adiv= "#graph"; // #id of the div that holds the graph. The # is vital
drawgraph(datax,datay,m,w,h, adiv);

Now I could test the function drawgraph(). Drawing a graph involves

  • Creating/reserving an svg area inside the specified div
  • Scaling the data and drawing axes
  • Adding legends to the graph
  • drawing the trace.

Creating the Graph area

function creategraphbackground(adiv,aheight,awidth,margins)
// clear the div that will hold the graph. Without this it is not possible to show another
// graph in the same div or update the existing graph with new data.'');
// Add an SVG element with the desired dimensions and margin.
var svg ="svg");
var fullwidth= awidth + margins[1] + margins[3];
var fullheight = aheight+ margins[0] + margins[2];
// Data will be show in this area
var graph = svg
.attr("width",fullwidth )
.attr("height", fullheight)
.append("svg:g") // append an svg group
.attr("transform", "translate(" + margins[3] + "," + margins[0] +")");

// Put a green rectangle into the svg area
.attr('width', awidth).attr('height', aheight).attr('x', 0).attr('y', 0)
.style('fill', 'lightgreen')
.attr('stroke', 'black')
// put legends on the X and y axes.
graph.append('text').text(' X ☞ ')
.attr('x', awidth/2)
.attr('y', aheight + margins[3]/2)
graph.append('text').text(' Y☝ ')
.attr('x', 20 -margins[0])
.attr('y', aheight - margins[3])
return graph;

At this point all that will be seen if the prgoram pauses is an empty graph with axes and legends
Note that the legends include unicode characters and this needs the line
<meta charset="utf-8">
in the <head> of the html file. I have not yet solved the general problem of positioning the x and y legends nicely.

Adding the axes
Drawing axes requires creating a d3 scale, which appears to be an array of scaled data values and using these when drawing the axes. Creating the scales required knowing the maximum and minimum values of the data in x and y. For this note using D3's max and min functions is enough but in the application one set of data seemed to throw these off and I had to write my own code to find these.

function addaxes(datax,datay,awidth, aheight, margins,graph)
ymin= d3.min(datay);
ymax= d3.max(datay);
// X scale will fit all values from datax[] within pixels 0-w
var x = d3.scale.linear().domain([0, datax.length]).range([0, awidth]);
// Y scale will fit values from ymin to ymax within pixels h-0
// (Note the inverted domain for the y-scale: bigger is up!)
var y = d3.scale.linear().domain([ymin, ymax]).range([aheight, 0]);
// create xAxis
var xAxis = d3.svg.axis().scale(x).tickSize(-aheight).tickSubdivide(true);
// Add the x-axis.
.attr("class", "x axis")
.attr("transform", "translate(0," + aheight + ")")
// create yAxis to left
var yAxisLeft = d3.svg.axis().scale(y).ticks(5).orient("left");
.attr("class", "y axis")
.attr("transform", "translate(-" + awidth-margins[0] + ",0)")
return { x : x, y :y}

There is still some “magic” here to internalise but the interesting point is the return statement which allows multiple results to be returned from the function.

Putting it together
The final result: so far

The final routine for drawing the graph uses the above methods, the code should now be fairly clear

function drawgraph(datax,datay, margins,awidth,aheight,adiv)
var graph = creategraphbackground(adiv,aheight,awidth,margins);
var result = addaxes(datax,datay,awidth, aheight,margins,graph);
var xscale = result.x;
var yscale = result.y;
// create a line function to convert dataset (merged datax and datay) into D3 forma
var line = d3.svg.line()
.x(function(d) { return xscale(d[0])}) // return the scaled X coordinate where we want to plot this datapoint})
.y(function(d) { return yscale(d[1]); });// return the scaled Y coordinate where we want to plot this datapoint })
// D3 requires the dataset to be a list of lists of form [x,y] and merge converts datax and datay into this form
var dataset= merge(datax,datay);
// Add the line by appending an svg:path element with the line function created above
// do this AFTER the axes above so that the line is above the tick-lines
graph.append("svg:path").attr("d", line(dataset));

and that is it

In brief
The code above differs from the original code in that
  • It takes x and y data arrays rather than making assumptions about the data
  • It includes legends on the axes
  • It clears the html div before plotting the graph allowing redrawing the graph in the same div

The interesting point for those not too familiar with javascript is the returning of multiple parameters from a function.

The code is better structured than the original but I would have taken far longer to develop it without such a good starting point. From here I can move onto polar, logarithmic and other plots

Sunday, 21 July 2013

Cleaning a dataset of UFO sightings using python

I found a file, in json format, containing details of UFO sightings on my disc in a folder used for MongoDB experiments. I later found the same dataset on Infochimps [1] but the one I had also holds latitude and longitude data for some sightings. I therefore decided to concentrate on this dataset and, as is normal, start by cleaning up the data. Since JSON is not the pleasantest format for humans I wanted to clean it and transform it into a TSV file. I chose to use Python for this since it allows much more compact code than Java. I hope someone finds this useful.

Unicode and JSON

Python supports jason conversion via the json library allowing conversion of the input to a python dictionary and in outline this is simple

import json;

pythondictionary = json.loads(jsonstring);

As an example this input line

{ "_id" : { "$oid" : "4eda7403a232d6a083b3abed" },
"description" : "descriptive text",
"duration" : "5-7 min.",
"geoloc" : [ 31.3382406, -94.729097 ],
"location" : "Lufkin, TX",
"reported_at" : "19970116",
"shape" : " delta",
"sighted_at" : "19930209" }

becomes the Python Dictionary

{u'description': u'descriptive text',
u'reported_at': u'19970116',
u'geoloc': [31.338240599999999,-94.729096999999996],
u'shape': u' delta',
u'location': u'Lufkin, TX',
u'duration': u'5-7 min.',
u'sighted_at': u'19930209',
u'_id': {u'$oid': u'4eda7403a232d6a083b3abed'}}

And extracting the data becomes trivial. Except for a few things like missing fields.

thefile = "path to input file";
content = open(thefile,'r',);

output = "path to output file";
results = open(output,"w",);

# write a header or go insane
results.write("description" + "\t" + "duration" + "\t" + "lat" + "\t" + "lon" + "\t" + "city" + "\t" + "state" + "\t" + "shape" + "\t" + "sighted" + "\t" + "reported" + "\n");

for line in content :
record = json.loads(line);
outputrecord= cleanRecord(record);
results.write(outputrecord + "\n");

the cleanRecord method does the hard work. It handles extracting the data and replacing html entities in the original with unicode characters.

The first or second line of the file MUST be a comment specifying the encoding and this one worked

# This Python file uses the following encoding: utf-8

the cleanRecord method needs a bit of refactoring ( a task for later) but looks like this

# cleans a single record from the file and converts it to TSV format
def cleanRecord(record):
description = str(record['description']);
duration = str(record["duration"]);
location = str(record['location']);
# separate city and state
city = location.split(",")[0]
state = location.split(",")[1];
reported = str(record['reported_at']);
shape = str(record['shape']);
sighted = str(record['sighted_at']);
# some records do not have a latitude and longitude. This results in a KeyError
# the obvious solution is to use Python's exception mechanism
geoloc = record['geoloc'];
except KeyError:
# use a nonsense value. Lat and long cannot, by definition be this high
lat = str(geoloc[0]);
lon = str(geoloc[1]);
# replace html entities in the description with unicode characters.
# these are not recognised by python or w3c
description = description.replace("&apos;", "\'");
description = description.replace("&quot;", "\"");
description = description.replace("&amp;", " and ");
# convert the rest
# unescape was gratefully taken from
description = unescape(description);

# if nothing prints out then we got all the entities.
if "&" in description: print description;
# If no shape reported use a query. Do the same for reporting dates
if shape == "": shape="?";
if reported == "": reported="?";
# generate the output record
outputrecord = description + "\t" + duration + "\t" + lat + "\t" + lon + "\t" + city + "\t" + state + "\t" + shape + "\t" + sighted + "\t" + reported;

return outputrecord;

Basic data statistics

The next stage was to see how dirty the data is. The results were

number of reports 61067
number of records with report date before sighting date: 979 (1.6%)
Number of records with no sighting date 254 (0.4%)
maximum gap between sighting and reporting 606 years mean gap = 1383 days
number of reports without geolocation: 23446 (38.3%)

The last statistic was lower than I expected. This suggests the data is clean enough to use. Obtaining these statistics revealed more problems with the data, for example records with no sighting date which raised a ValueError because the dataset compiler had used a value that would scream when processed.

Lessons (re)learned and next steps
Reading json is simple
Specify character encoding at start of the file as a comment.
Anything humans type will neither be clean nor complete
It takes longer to clean the data than to analyse it.
The neatness of say json can hide a lot of problems.
When data is missing use a sensible default value
Some hand editing may be unavoidable

The next step is to start looking at the descriptions, plot locations on a map (probably using R and shapefiles). I expect to spend a while finding decent open source software to handle natural language mining. This could be challenging since not all descriptions are in English.

One interesting experience was that while this dataset is only 75MB the TSV file would not display properly in Open Office. I still do not know why.

Sunday, 23 June 2013

Writing a single page Javascript web application using Ajax, CSS, Javascript and HTML5

Rough output. first stage: works but is ugly. 

I was recently asked to write a single page weather forecasting application using
My current specialties
Javascript, Ajax, HTML5 and CSS3 as a test for a contract position. Since I have been a backend Java developer for a long time this was mildly challenging. However it proved simpler than expected, though the research needed took longer than desired and the impression I had was that the client had deliberately introduced some problems into the project

This note describes the way the project was developed and includes code snippets but not a full application.
The tasks fell into the following phases

  • Making an ajax call to a URL
  • Parsing the xml returned
  • Writing the data into an HTML Page
Subsidiary tasks involved getting the CSS right and laying out the HTML page. I found the w3schools site and stack overflow helpful

Getting the data with Ajax.
To get the data it is necessary to create an xmlHTTPrequest object and use that to make an asynchronous request for the data.

Creating the request object

The code needed for this is simple, though as always Microsoft Internet Explorer causes problems. I note that clients are increasingly adopting the solution of not supporting old versions of Explorer. Here however it is easy to use the factory pattern to allow this.

function getRequestObject()
var xmlhttp;
if (window.XMLHttpRequest)
{// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
{// code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
return xmlhttp;

Sending the request

Sending the request is equally simple.

// create the request object, open the URL and send a request
var xmlhttp=getRequestObject();"GET",locationsURL,true);

This is an asynchronous request. This means the application has to wait till the request has been serviced. To do this one has to create a function on the request object. When the request has been serviced the returned data must be processed in this function.

if (xmlhttp.readyState==4 && xmlhttp.status==200)
// deal with response XML
var locations = xmlhttp.responseXML;
// parse the XML response
// The locations are of form
// <location code="xxx" name="yyy" country="zzz"/>
// this call returns an array of <weatherlocation> tags
// In this case there is only one but we must still treat it as an array
weatherlocations = locations.getElementsByTagName("weatherlocations")[0];
// this call returns all the <locations> tags. There are a lot.
displayLocation(weatherlocations,0, 'code');

// create specified number of buttons in a div on the page. The argument says add all buttons

This, at a high level is how to handle AJAX data.

Processing the XML data

Once the data is received it has to be processed and put into the HTML page. Putting it into the page just involved writing to the an element's InnerHTML

The call

var locations = xmlhttp.responseXML;

returns an XML object. This object looks like
<location code="xxx" name="yyy" country="zzz"/>

To simplify matters I created a method to return the value of an attribute of a tag in the array created by geElementsByTagName

// returns the named attribute of the specified tag in the array, starting from 0.
// example getAttribute(channels,1,'name'); returns the value of the name attribute for tag 1
function getAttribute(aTagArray,anIndex, anAttribute)
return aTagArray[anIndex].getAttribute(anAttribute);

This is fine but an XML tag may have a value as well as attributes in this case you need to navigate to the node and extract its value for example


If you choose to use the response or reponseText methods on the request writing the data to an element is not so easy.
The code below outlines how this can be done

var response = xmlhttp.response;
var xmlTextNode = document.createTextNode(response);
var parentDiv = document.getElementById('myDiv');

Generating HTML
Development was simplified and made more flexible by generating html. This involves creating a string as a template with place holders and replacing the placeholders with parameters or if it is short, creating the string and writing it to the page for example to create an image dynamically

var image = "<img src=\"" + name + "\"/>";

The Wrap

These notes are intended to get you started, not give you a complete application. Hopefully they achieved this goal and left enough for you to do to stimulate your creativity