Tuesday, 14 October 2014

Cyber Attacks for Beginners: Cross Site Request Forgery (CSRF)


What is it
Cross site request forgery is a special case of session hijacking. When you log on to a site it starts a session for you. If you navigate away without signing you may still have a session on that site. If you then visit a malicious site the site can send you malicious code that impersonates you on the site. For example, you visit a bank and close down the page without signing off then visit a malicious site it could then send the bank a POST request to transfer money from your account to theirs without your knowledge. As I said in a previous post, they could also make it look like you are committing some sort of crime and then report it to the police.

Such an attack is not easy. It requires you to navigate away from a site without signing off and visit a malicious site before your session on the first site expires. It also requires getting past any confirmation pages the bank or other site put up. Which makes it a numbers game for the attacker. All they need is a couple of visits a day and they are in business. At no extra cost.

Why is it Dangerous
This attack becomes very dangerous when uses with Javascript and AJAX, which lest them send asynchronous Post requests without your knowledge. When combined with Cross Site Scripting it risks you machine being turned into a zombie controlled by the attacker.

Signs of vulnerability to this include accepting HTTP requests from an authenticated user without having some control to verify that the HTTP request is unique to the user's session and very long session timeouts, which increase the chance an attack is made while the session is valid.
Defences
This section outlines some basic defences against CSRF. A full set of defences is given in (1) below.

One powerful defence is to send a random secret shared with the server on each request, something an attacker cannot access and cannot guess

More precisely

A request from a Web application should include a hidden input parameter (token), with a common name such as "CSRFToken", that has a random value (which should be long), generated by a cryptographically strong Random Number Generator, whenever a new session starts. An alternative might be to use a secure hash, Base64 encoded for transmission. As randomness and uniqueness must be used in the data that is hashed to generate the random token this has little advantage over a secure Random Number generator, though local considerations might make this attractive.
The token should only be sent via POST requests and server side actions (that change state) should respond only to POST requests (this is referred to as HTTP Method Scoping). More details (lots) in (1)

Another method is double cookie Submission, where a random token is sent both as a request parameter and as a cookie. The server compares the two to check they are equal. By default the attacker cannot read any data sent by the server or modify cookie values. This is called the same origin policy and requires some skill and effort to disable in the browser (just don''t do it, right?) .

Note that any cross-site scripting (XSS) vulnerabilities (2) can be used to defeat the defences above but there are defences it cannot evade, such as Captcha, reauthentication and one-off passwords such as thos generated by RSA tokens

Do it Yourself protection
  1. Logoff immediately after using a web application (especially online banking)
  2. Do not let your browser store user names and passwords (though the risk is less if the password is stored encrypted)
  3. Using the same browser for sensitive applications and general surfing is a bad idea and leaves you dependent on the security of the site you are visiting. This may be fairly safe for online banking, but not for watching porn.
  4. Use a plugin that disables JavaScript wherever possible so an attacker cannot submit an attack unless they persuade you to submit a form manually.
  5. The above recommendations come from (1). In addition I suggest using an Incognito window, if your browser allows it, so passwords etc do not hang around in your cache.


The Wrap
There is no sure defence against any attack, since attackers and defenders both evolve their techniques and every now advance in technology brings new weaknesses and strengths. The defences outlined here may change the economics of an attack.

References

  1. https://www.owasp.org/index.php/Cross-Site_Request_Forgery_(CSRF) Attacks, Defences and how to review code with CSRF in mind

Thursday, 9 October 2014

Beginners Guide to Cyber Attacks: Cross Site Scripting

Cyber Attacks for Beginners: Cross Site Scripting
Keep your eyes open when developing applications

What is it
Cross Site Scripting (XSS) occurs when a website sends untrusted malicious data, for, example HTML or Javascript, to a browser that then runs the code. A typical point of attack is when user input is reflected back to the user, for example when you input your name and the next page says “Welcome “ followed by your name. (examples are given in the links at the bottom of this article)

There are three types of cross site scripting

  1. Reflected XSSwhere malicious data is embedded in the page that is returned to the browser immediately following the request. One example of this is where an attacker tricks a victim into loading a url containing malicious code into their browser. This is sent to a legitimate server and a response containing the malicious code is sent to the victim's browser where it is executed, perhaps sending the victim to the attacker's site where an effort may be made to rob them
  2. Stored XSSwhere malicious script an attacker previously managed to get stored on a server is sent to all users at some later time
  3. DOM based XSSwhere malicious code is injected into the pages DOM.

Reflected and persistent XSSassume that the payload moves from the browser to the server and back: if it goes back to the same browser it is reflected, if it goes to different browsers it is stored. DOM based XSS does not have this limitation.

Why is it Dangerous
XSSis dangerous because the code injected into the browser could do almost anything. The response could contain malicious code invisible to the user (DOM based Cross Site Scripting) that is then executed. The possibilities are endless. The page could display an image which contains some form of malware which is triggered from the page.The code could redirect the user to a site that downloaded malware, or it could send details of what the user does to an attacker, whether in government or private crime. Ir it could frame the user, making it look like they were committing a crime. Fortunately almost all attackers ar only in it for the money, which simplifies the defender'''s job immensely. '

DOM Based Cross site Scripting
When a browser executes Javascript it makes a number of Javascript Objects available to the code. These represent the Document Object Model (DOM) which is what the browser experiences. The DOM is populated according to the browser's understanding of the model. For example document.URL and document.location are populated with the URL of the page, as the browser understands it. They are invisible to the user. Sometimes the page will hold Javascript that parses document.URL and decides on an action, for example writing the value of a parameter to the page.
The danger arises when the original request sends a parameter value that contains malicious code. If the malicious code is hidden behind a # (known technically as a fragment identifier the code after the # is treated as a comment and may not even be sent to the server.

Defences
his section outlines some basic defences against XSS. A full set of defences is given in the cheat sheets below. Mostly they rely on escaping and encoding whaich is best handled by a trusted thord party library

Some browsers provide some protection against cross site scripting, for example by encoding special characters that Javascript uses, such as “<” and “>” into safe forms such as “%3C” and “%3E” but these can be evaded by an attack that does not need the raw forms of these characters. Encoding provides a useful layer of defence and there are third party libraries that provide this function but it is not a silver bullet.
Apart from encoding all input and output (and the encoding needed differs according to the context), two techniques are useful: Blacklisting, where the request is rejected if it contains dangerous characters, and white listing, where the request is rejected unless it contains only safe characters.

Generally speaking both black and whitelisting should be used. For example a request containing a name can be rejected if it contains “<” and accepted if it contains only alphanumeric characters. Of course whitelisting gets more complicated for languages like Chinese, Hebrew or Arabic and can be vetoed by budget conscious managers, but the principles remain valid.

In practice white and blacklisting tend to rely on regular expressions, and these need to be tested thoroughly before a product is released.

In brief an effective defence against Stored and Reflected Cross Site Scripting, is server side data validation. Both types can be detected by manual fault injection e.g typing an alert scriptlet into a field. An effective defence againsr DOM based XSS is client side validation of all DOM objects as they are used or changing server side logic to avoid using DOM properties. Since client pages are often server supplied pages it is again the responsibility of the server to protect the user.

Another, perhaps weaker defence is to include an anti-XSS header in the HTML response supplied by the server. This can either be done programatically in a servlet filter or the web server configuration. Which to use is often a matter of taste: filters still need to be configured in the server or application configuration files, but if you have a number of filters already the extra cost is marginal. This option may not be supported by all browsers and is not supported by old browsers.

Dom Based XSS is a bit trickier to handle. Defences include
  • Avoiding parsing and manipulating DOM objects
  • Sanitising and handling references to DOM Objects carefully. 
The Wrap

The constant arms race between attackers and defenders means there is no sure defence against XSS, or any other attack. The defences here and in the references may however change the economics of an attack so that it is not worthwhile for the attacker who may decide other tactics, like setting up a “legitimate” server that provides a service with a bit of theft on the side, would be easier and more profitable. This seems to be why we have phishing sites.

If you want to try out XSS attacks it is best to do so on a site you own. Doing so on a site you do not own could lead to a knock on the door at 6am. This would tend to ruin your day.

References

http://excess-xss.com/ A clear explanation of how XSSworks



http://www.webappsec.org/projects/articles/071105.shtml DOM Based XSS or XSS of the Third Kind




https://www.owasp.org/index.php/DOM_based_XSS_Prevention_Cheat_Sheet  DOM Based Cross Site sScripting Prevention Cheat Sheet

http://digitalsteampunk.blogspot.co.uk/2014/10/cyber-attacks-for-beginners-cross-site.html Beginners guide to Cross Site Scripting (XSs)

Saturday, 24 August 2013

Simple Particle Motion Simulation using D3



D3 is a good library for data visualisation. I wanted to visualise an ideal particle in a rectangular 2 dimensional box bouncing off the walls then progress to including gravity in the solution. The mathematics for such a particle is simple. In the process I learned a few things about D3, for example use of map attribute to initialise data and the use of the timer function.
Theory


A particle moving in an otherwise empty two dimensional box has a velocity v = (vx,vy). The walls are taken to be parallel to the x and y axes. When the particle hits a wall parallel to the x axis vy → - vy.
When it hits a wall parallel to the y axis vx → - vx .


Clearly such a particle will bounce off the walls forever. In a package like Director this was easy to implement but proved a little tricky in D3

The effect of gravity is included by noting that for the y velocity

dv = g.dt and setting dt = 1. 


Computational Aspects


A real particle moves continuously. In the simulation the virtual particle moves in discrete steps and may never “hit” the virtual wall of the box. Suppose it moves vx pixels along the x axis per time step then if at the end of a step it is vx/2 pixels from the wall it will go through the wall unless told to stop.


So the particle must be told to reverse its motion when it is about to cross a wall. This risks the virtual particle bouncing before it touches the wall, which is unconvincing, though it can be prevented. Here the problem of non rectangular boxes is ignored.


Some code details


The first step is to initialise the data.


// Map the range to an array of data of the same size
// For a large number of balls produces seemingly random motion instantly


var numballs=<your chosen value>;

var acceleration = <your chosen value>;


var data = d3.range(numballs).map(
function(d,i)
{
// Assign random x and y velocities to each particle
var vx = randomVelocity(1);
var vy = randomVelocity(1);


// make some velocity components negative
if(Math.random()<0.5) vx = -vx;
if(Math.random()<0.5) vy= -vy;


// Assign random positions to the particles.
var itsx = itsx = 10*Math.random() -5;
var itsy = itsy =10*Math.random()-5;
return {xloc: itsx, yloc: itsy, xvel: vx, yvel: vy};
}
);


Scaling the data

var x = d3.scale.linear().domain([-5, 5]).range([0, width]);
var y = d3.scale.linear().domain([-5, 5])range([0, height]);


This puts the origin at the centre of the screen.




The Timer Function
The key to this exercise was the timer function which simple repeats what it is told to do for ever.


d3.timer(


function()
{
data.forEach( function(d) { update(d);} );


// Move each circle; define a coordinate transform
// translate circle by scaled coordinates
circle.attr("transform", function(d)
{ return "translate(" + x(d.xloc) + "," + y(d.yloc) + ")"; }
)
// just for fun shrink circle radius near the origin
.attr("r", function(d)
{return Math.sqrt(d.xloc*d.xloc +d.yloc*d.yloc)} ) ;
}
);


Updating coordinates


function update(d)
{


// record old positions, calculate new positions, update positions.
var oldx = d.xloc;
var newx = oldx + d.xvel;
d.xloc= newx;

var oldy = d.yloc
var newy= d.yloc + d.yvel;
d.yloc = newy;

// The upper and lower limits were established empirically-
// to convincing right bounce action
var lowerlimit = -5.1;
var upperlimit = 4.7;

// iscrossing returns true if the ball is about to hit a wall.
xcrossing = iscrossing(oldx,newx,upperlimit,lowerlimit);
ycrossing= iscrossing(oldy,newy,upperlimit,lowerlimit);
// reverse appropriate velocities if about to hit a wall.
if(ycrossing )
{
d.yvel = -d.yvel;
d.yloc += d.yvel;
}

if(xcrossing )
{
d.xvel = -d.xvel;
d.xloc += d.xvel;
}
}




Here is the function that decides if the particle is about to hit a wall.
Basically if it is about to cross a wall reverse the velocity component.


function iscrossing(old, new, upperlimit,lowerlimit)
{
var crossing = old< upperlimit && new > upperlimit;
crossing = crossing || old > upperlimit && new < upperlimit;
crossing = crossing || old < lowerlimit && new>= lowerlimit;
crossing = crossing || old > lowerlimit && new<=lowerlimit;
return crossing;
}


Including Gravity


The effect of gravity was included by updating the y velocity immediately before updating the position. Failure to do this resulted in unrealistic behaviour.


var oldy = d.yloc
// The velocity update needs to be applied before not after updating position
d.yvel += acceleration;
var newy= d.yloc + d.yvel;
d.yloc = newy;


The initial value of the acceleration was set at 0.01.


Results


After some experimentation with the parameters I managed to get realistic behaviour. For an initially localised swarm of particles the swarm gradually diverged and appeared eventually to reach a state close to random motion, but with some sets of particles moving in unison.


The behaviour was improved, as would be expected, by assigning each particle a random initial velocity and position.


Introducing acceleration led to an apparent bunching of particles at “ground” level while high accelerations resulted in the swarm rising and falling together something like the motion of waves on the sea


The Wrap
The code snippets above were combined, together with standard boilerplate to create a virtual particle moving in a virtual box using d3. The best way to learn is to modify something that works and the code here was adapted from Bostock's SVG Swarm example 


The results so far show that an originally localised swarm would slowly tend to approximate particles in an ideal gas but assigning random initial positions resulted in almost instant approximation to an ideal gas.



They also showed that watching the swarm is hypnotic. 

Monday, 19 August 2013

Computing the mean and variance of a dataset from the mean and variance of two subsets.

The aspects of Big Data are that Volume, Velocity and Variety. Some types of data are adequately summarised by the mean and variance of the dataset. But if the volume is large enough computing these parameters could take too long, and if the data comes from multiple sources aggregating the data before computing the mean and variance could also exceed the computational deadline for the task. The mean and variance of multiple datasets can however be computed in parallel, possibly at different times, and the mean and variance of the total data set, or combinations of the component datasets can be obtained.


It is possible, given the means and variances of two datasets to calculate the mean and variance of the aggregate of the two sets. Here the method is shown for two datasets, but it is obvious how to extend it to multiple datasets. It is theoretically possible to compute this from the variances alone if the covariance of the datasets is known but this is not possible here.


Scenario

We have two datasets, x of size N and y of size M   and and we know the mean and variance of each set. Using for the aggregate of the two sets, we want its mean and variance.


We know the average:

<x> = sum(x)/N

Hence


sum(x) = N<x>

And the estimate of the combined average is

(N<x> + M<y>)/(M+N)

The variance is a bit harder. The variance is given by

var(x) =  <x^2> - <x>^2


Whence

  <x^2>  = var(x) +  <x>^2



With analogous results for y



These results can be manipulated into a formula fo the combined variance if desired. It is easier however to use the last result and say

N<x^2> = sum(x^2)



and combine the sums for the two datasets. with the combined average

var(x+y) = N<x^2> + M<y^2>  + (N<x> +M<y>)/(M+N)


This Python code returns the combined mean and variance


# given the sample mean and variance of two datasets returns
# mean and variance of a dataset made by combining the two
# datasets
# may suffer from numerical problems in some circumstances.
def combine(meanx, varx, N, meany, vary, M):
totalsize = N + M;
combinedaverage = meanx*N + meany*M;
combinedaverage = combinedaverage/totalsize;
sumxsq= (varx + meanx**2);
sumxsq = sumxsq*N;
sumysq= vary + meany**2;
sumysq = sumysq*M;
combinedvariance= (sumxsq+ sumysq)/(totalsize) - combinedaverage**2;
return combinedaverage,combinedvariance;

The code was tested by taking two datasets and computing the mean and variance of the aggregated dataset directly and comparing it with the result returned by the method above. This is proof of concept code so issues of numerical stability and edge cases have not been addressed. Use at your own risk. 

If the size of the datasets used to compute the means and variances is not known this method cannot be used. Putting in random values showed that the result the method returns becomes highly inaccurate. 

The Wrap
A way of computing the mean and variance of a dataset given the mean and variance of two subsets of the dataset is presented. The extension to multiple subsets is obvious. If necessary datasets can be combined by recursive halving on a parallel architecture: lining up the pairs of means and variances and adding each even value to each odd value. This is repeated, each repetition halving the number of mean-variance pairs till only one is left. For large datasets this could represent a big speedup of the time needed to compute the mean and variance.

And it was fun  working this out


Sunday, 28 July 2013

Drawing a linegraph with D3

Current Specialties
I was first introduced to D3 late in 2011 and after some experimentation I decided it was great way to visualise data, and it had the added virtue of allowing multiple representations of the same data. It was too late for the project I was on, migrating compley text documents to the web, which was about to finish,but realised it could have simplified maintenance of tables and graphs in the migrated documents since the data owners could be left to maintain spreadsheets of their data and D3 would then present the data. Decoupling of business and technical functions seems line an increasingly good idea. One problem I had with D3 was that the examples on the web were mainly write only code and even tutorial application were harder to follow. I therefore spent a lot of time refactoring my code in an effort to ease maintenance. On the way I found a couple of issues that the code here adresses.

As always any code here comes with no warranty: It worked for me.

The Project

More recently a small web project required me to use D3 to display a history of Exchange rate data. I managed to get the example in http://benjchristensen.com/2012/05/02/line-graphs-using-d3-js/
to do what I wanted but  eventually wanted a general purpose linegraph plotting routine. So once the project was essentially finished I returned to the web example and created something closer to what I wanted. In particular I wanted the routine to accept arbitrary x and y datasets, whereas the example above complicated matters by assuming the x values were simply the integers up to a maximum value.

The first step was to separate javascript and html. Following the rule that paranoia is never enough I created a simple function bridge() which initially only threw up an alert to say I had connected properly to the javascript library when the page was loaded. After that bridge was responsible for setting up the parameters of the file, creating data and calling the plot routine, in effect it was a test harness. In the real application bridge would be replaced by an AJAX method that got transformed data from a URL into a form D3 could use.

function bridge()
{
// create simple X and Y data arrays that we'll plot with a line
var datay = [3, 6, 2, 7, 5, 2, 0, 3, 8, 9, 2, 5, 9, 3, 6, 3, 6, 2, 7, 5, 2, 1, 3, 8, 9, 2, 5, 9, 2, 7,0];
var datax= d3.range(0,datay.length,1);
// define dimensions of graph
var m = [80, 80, 80, 80]; // margins of graph xleft, xright, ytop,ybottom
var w = 1000 - m[1] - m[3]; // width of graph
var h = 400 - m[0] - m[2]; // height of graph
var adiv= "#graph"; // #id of the div that holds the graph. The # is vital
drawgraph(datax,datay,m,w,h, adiv);
}

Now I could test the function drawgraph(). Drawing a graph involves

  • Creating/reserving an svg area inside the specified div
  • Scaling the data and drawing axes
  • Adding legends to the graph
  • drawing the trace.

Creating the Graph area

function creategraphbackground(adiv,aheight,awidth,margins)
{
// clear the div that will hold the graph. Without this it is not possible to show another
// graph in the same div or update the existing graph with new data.
d3.select(adiv).html('');
// Add an SVG element with the desired dimensions and margin.
var svg = d3.select(adiv).append("svg");
var fullwidth= awidth + margins[1] + margins[3];
var fullheight = aheight+ margins[0] + margins[2];
// Data will be show in this area
var graph = svg
.attr("width",fullwidth )
.attr("height", fullheight)
.append("svg:g") // append an svg group
.attr("transform", "translate(" + margins[3] + "," + margins[0] +")");

// Put a green rectangle into the svg area
graph.append('rect')
.transition()
.Duration(5)
.attr('width', awidth).attr('height', aheight).attr('x', 0).attr('y', 0)
.style('fill', 'lightgreen')
.attr('stroke', 'black')
// put legends on the X and y axes.
graph.append('text').text(' X ☞ ')
.attr('x', awidth/2)
.attr('y', aheight + margins[3]/2)
.attr("fill","blue")
.attr("font-size","20px");
graph.append('text').text(' Y☝ ')
.attr('x', 20 -margins[0])
.attr('y', aheight - margins[3])
.attr("fill","red")
.attr("font-size","20px");
return graph;
}

At this point all that will be seen if the prgoram pauses is an empty graph with axes and legends
Note that the legends include unicode characters and this needs the line
<meta charset="utf-8">
in the <head> of the html file. I have not yet solved the general problem of positioning the x and y legends nicely.

Adding the axes
Drawing axes requires creating a d3 scale, which appears to be an array of scaled data values and using these when drawing the axes. Creating the scales required knowing the maximum and minimum values of the data in x and y. For this note using D3's max and min functions is enough but in the application one set of data seemed to throw these off and I had to write my own code to find these.

function addaxes(datax,datay,awidth, aheight, margins,graph)
{
ymin= d3.min(datay);
ymax= d3.max(datay);
// X scale will fit all values from datax[] within pixels 0-w
var x = d3.scale.linear().domain([0, datax.length]).range([0, awidth]);
// Y scale will fit values from ymin to ymax within pixels h-0
// (Note the inverted domain for the y-scale: bigger is up!)
var y = d3.scale.linear().domain([ymin, ymax]).range([aheight, 0]);
// create xAxis
var xAxis = d3.svg.axis().scale(x).tickSize(-aheight).tickSubdivide(true);
// Add the x-axis.
graph.append("svg:g")
.attr("class", "x axis")
.attr("transform", "translate(0," + aheight + ")")
.call(xAxis);
// create yAxis to left
var yAxisLeft = d3.svg.axis().scale(y).ticks(5).orient("left");
graph.append("svg:g")
.attr("class", "y axis")
.attr("transform", "translate(-" + awidth-margins[0] + ",0)")
.call(yAxisLeft);
return { x : x, y :y}
}

There is still some “magic” here to internalise but the interesting point is the return statement which allows multiple results to be returned from the function.

Putting it together
The final result: so far

The final routine for drawing the graph uses the above methods, the code should now be fairly clear

function drawgraph(datax,datay, margins,awidth,aheight,adiv)
{
var graph = creategraphbackground(adiv,aheight,awidth,margins);
var result = addaxes(datax,datay,awidth, aheight,margins,graph);
var xscale = result.x;
var yscale = result.y;
// create a line function to convert dataset (merged datax and datay) into D3 forma
var line = d3.svg.line()
.x(function(d) { return xscale(d[0])}) // return the scaled X coordinate where we want to plot this datapoint})
.y(function(d) { return yscale(d[1]); });// return the scaled Y coordinate where we want to plot this datapoint })
// D3 requires the dataset to be a list of lists of form [x,y] and merge converts datax and datay into this form
var dataset= merge(datax,datay);
// Add the line by appending an svg:path element with the line function created above
// do this AFTER the axes above so that the line is above the tick-lines
graph.append("svg:path").attr("d", line(dataset));
}

and that is it

In brief
The code above differs from the original code in that
  • It takes x and y data arrays rather than making assumptions about the data
  • It includes legends on the axes
  • It clears the html div before plotting the graph allowing redrawing the graph in the same div

The interesting point for those not too familiar with javascript is the returning of multiple parameters from a function.


The code is better structured than the original but I would have taken far longer to develop it without such a good starting point. From here I can move onto polar, logarithmic and other plots

Sunday, 21 July 2013

Cleaning a dataset of UFO sightings using python


I found a file, in json format, containing details of UFO sightings on my disc in a folder used for MongoDB experiments. I later found the same dataset on Infochimps [1] but the one I had also holds latitude and longitude data for some sightings. I therefore decided to concentrate on this dataset and, as is normal, start by cleaning up the data. Since JSON is not the pleasantest format for humans I wanted to clean it and transform it into a TSV file. I chose to use Python for this since it allows much more compact code than Java. I hope someone finds this useful.

Unicode and JSON

Python supports jason conversion via the json library allowing conversion of the input to a python dictionary and in outline this is simple

import json;

pythondictionary = json.loads(jsonstring);



As an example this input line

{ "_id" : { "$oid" : "4eda7403a232d6a083b3abed" },
"description" : "descriptive text",
"duration" : "5-7 min.",
"geoloc" : [ 31.3382406, -94.729097 ],
"location" : "Lufkin, TX",
"reported_at" : "19970116",
"shape" : " delta",
"sighted_at" : "19930209" }

becomes the Python Dictionary

{u'description': u'descriptive text',
u'reported_at': u'19970116',
u'geoloc': [31.338240599999999,-94.729096999999996],
u'shape': u' delta',
u'location': u'Lufkin, TX',
u'duration': u'5-7 min.',
u'sighted_at': u'19930209',
u'_id': {u'$oid': u'4eda7403a232d6a083b3abed'}}

And extracting the data becomes trivial. Except for a few things like missing fields.

thefile = "path to input file";
content = open(thefile,'r',);

output = "path to output file";
results = open(output,"w",);

# write a header or go insane
results.write("description" + "\t" + "duration" + "\t" + "lat" + "\t" + "lon" + "\t" + "city" + "\t" + "state" + "\t" + "shape" + "\t" + "sighted" + "\t" + "reported" + "\n");

for line in content :
record = json.loads(line);
outputrecord= cleanRecord(record);
results.write(outputrecord + "\n");

the cleanRecord method does the hard work. It handles extracting the data and replacing html entities in the original with unicode characters.

The first or second line of the file MUST be a comment specifying the encoding and this one worked

# This Python file uses the following encoding: utf-8

the cleanRecord method needs a bit of refactoring ( a task for later) but looks like this

# cleans a single record from the file and converts it to TSV format
def cleanRecord(record):
description = str(record['description']);
duration = str(record["duration"]);
location = str(record['location']);
# separate city and state
city = location.split(",")[0]
state = location.split(",")[1];
reported = str(record['reported_at']);
shape = str(record['shape']);
sighted = str(record['sighted_at']);
# some records do not have a latitude and longitude. This results in a KeyError
# the obvious solution is to use Python's exception mechanism
try:
geoloc = record['geoloc'];
except KeyError:
# use a nonsense value. Lat and long cannot, by definition be this high
geoloc=[999.999,999.999];
lat = str(geoloc[0]);
lon = str(geoloc[1]);
# replace html entities in the description with unicode characters.
# these are not recognised by python or w3c
description = description.replace("&apos;", "\'");
description = description.replace("&quot;", "\"");
description = description.replace("&amp;", " and ");
# convert the rest
# unescape was gratefully taken from http://effbot.org/zone/re-sub.htm#unescape-html
description = unescape(description);

# if nothing prints out then we got all the entities.
if "&" in description: print description;
# If no shape reported use a query. Do the same for reporting dates
if shape == "": shape="?";
if reported == "": reported="?";
# generate the output record
outputrecord = description + "\t" + duration + "\t" + lat + "\t" + lon + "\t" + city + "\t" + state + "\t" + shape + "\t" + sighted + "\t" + reported;

return outputrecord;

Basic data statistics

The next stage was to see how dirty the data is. The results were


number of reports 61067
number of records with report date before sighting date: 979 (1.6%)
Number of records with no sighting date 254 (0.4%)
maximum gap between sighting and reporting 606 years mean gap = 1383 days
number of reports without geolocation: 23446 (38.3%)

The last statistic was lower than I expected. This suggests the data is clean enough to use. Obtaining these statistics revealed more problems with the data, for example records with no sighting date which raised a ValueError because the dataset compiler had used a value that would scream when processed.

Lessons (re)learned and next steps
Reading json is simple
Specify character encoding at start of the file as a comment.
Anything humans type will neither be clean nor complete
It takes longer to clean the data than to analyse it.
The neatness of say json can hide a lot of problems.
When data is missing use a sensible default value
Some hand editing may be unavoidable

The next step is to start looking at the descriptions, plot locations on a map (probably using R and shapefiles). I expect to spend a while finding decent open source software to handle natural language mining. This could be challenging since not all descriptions are in English.

One interesting experience was that while this dataset is only 75MB the TSV file would not display properly in Open Office. I still do not know why.