Thursday, 19 November 2015

K nearest neighbour classification in Pyspark

K-nearest neighbour clustering (KNN) is a supervised classification technique that looks at the nearest neighbours, in a training set of classified instances, of an unclassified instance in order to identify the class to which it belongs, for example it may be desired to determine the probable date and origin of a shard of pottery. There are several variations and refinements of KNN clustering [1]. Some, perhaps most of these are not needed in a parallel bath processing environment such as Hadoop or Spark but may become relevant for the rapid processing portion of a lambda architecture.

The simplest KNN algorithm was implemented in Pyspark, a map-reduce framework that is some ten times faster than Hadoop when run from disk and 100 times faster when run in memory. The well known Iris dataset was used to verify the implementation. No effort was made to allow for the effects of skew in the training set Before Pyspark coding began the algorithm was implemented in a conventional way based on [2] with some slight improvments to the code, for example using list comprehensions rather than loops. Doing this allowed a deeper understanding of the algorithm and gave a reference point for assessing the Pyspark implementation.

Both implementations were developed in Eclipse with a view to eventual command line use of the Pyspark implementation. Eclipse/Pydev was configured as in [3] with the log output sent to a file. Eclipse being Eclipse occasionally the log output did not go to the file, but did go to the file on rerunning the programme

The Pyspark code is shorter and more easily understood ( once the basics of Pyspark are mastered) than the serial version. Please note that any code presented here is proof of concept code and used at your own risk.

The Algorithm
A set of known instances, in this case Irises, is used to train the algorithm. Since not many people go round measuring Iris details for fun the data set was split randomly split into a training set and a test set with roughly two thirds of the data used as a training set

The algorithm classifies an item in the test set by computing the distance from the test item to each member of the training set, taking the K nearest neighbours and assigning the test item to the class most represented in the K nearest neighbours.

The steps nthe algorithm are

  1. Load data (here data is loaded from a file)
  2. Split the data into test and training sets
  3. Compute the distance from a test instance (A member of the test set) to all members of the training set
  4. Select the K nearest neighbours of the test instance
  5. Assign the test instance to the class most represented in these nearest neighbours

Running the algorithm against every member of the test set and determining the percentage of correct classifications gives an estimate of the accuracy. Since the algorithm randomly splits the data into a test and training set the result will vary every time the test is run but the accuracy was around 90% with occasional extremes of 85% and 97%

Loading the data

Loading data from a file involves creating a spark context and using that to load the data. You can only have one spark context running at a time.

# Create spark context
sc = SparkContext(appName="PysparkKnearestNeigbours")
# Read in lines from file.
records = sc.textFile(sys.argv[1])

This creates an RDD (Resilient Distributed Dataset) holding all data records. For this exploratory exercise the first (header) row was removed. Since RDDs are not iterable, they must be transformed into lists using collect() for example

recordlist = records.collect()

The number of nearest neighbours is similarly read from the command line

numNearestNeigbours = int(sys.argv[2])
numfieldsInDistance = int(sys.argv[3])

Splitting data into test and training sets

# Split data into test and training sets in ratio 1:2
testset,trainingset = records.randomSplit([1,2])

This does what it says, it splits the data randomly into test and training sets with about one third in the test set.

The distance function

The distance function is an integral part of the algorithm. Various distance functions can be used but here, instead of the standard Euclidean distance the absolute values of the differences in each component was used in order to minimise numerical problems

d(x,y) = sum( |xi-yi|)

Finding the nearest neighbours

Finding the nearest neighbours ran into the restriction that only one RDD can be transformed at a time. Overcoming this involved creating an RDD comprising all pairs (training instance, test instance) using cartesian().

nearestNeigbours = trainingset.cartesian(testinstance) \
.map(lambda (training,test):(training, distanceAbs(training, test, numfields))) \
.sortBy(lambda (trainingInstance, distance):distance) \

Cartesian creates an RDD with the required K-V pairs. The map statement creates a pair
(training instance, distance to test instance). sortBy() method then sorts these pairs by value ( the distance), in ascending order and take(..) crops the result to the desired number of nearest neighbours.

Assigning the test instance to a class

The code below should be self explanatory. It transforms the neighbours into an RDD of classnames. To get the assigned class convert neighbourNames to a list as above and take the first element

# (kv) pair is typically (u'5;3.4;1.6;0.4;Iris-setosa', 0.08000000000000014)
# training = u'5;3.4;1.6;0.4;Iris-setosa'
# training.split(";")[-1] = Iris-setosa
neighbourNames = (trainingInstance, test):(trainingInstance.split(";")[-1]))

Check the accuracy by looping over the test set comparing the assigned class in the training set to the actual class in the test set.

Wrapping up

The steps needed to implement the KNN classification algorithm have been outlined. The two major Gotchas encountered were being able to run only one spark context at a time and being unable to use nested RDDs. The first problem was not a major one as the python methods have access to the global scope. The second one was overcome using the cartesian() method which may prove expensive for very large training sets.

The algorithm worked well on the iris dataset, but might not work so well on less well ordered sets. One improvement would be to deal with a skewed dataset ( where one class dominates) by weighting the data accordingly.

  1. Some notes on configuring Pyspark to run code from Eclipse or from the command line

Tuesday, 10 November 2015

Don't Blame Agile, blame the culture

Agile has had a lot of criticism. Agile can be used to destroy creativity

I am unaware of any successful efforts to apply Agile outside software development, and outside the business context. Thus Architects, Academics and Accountants to name but three do not have to handle the daily status updates, backlog grooming etc that tend to dominate developers days. This is partly because their work does not fit the standard pattern of atomised tiny tasks described in baby language partly because they tend to work more as individuals than as teams, relying on their colleagues only for occasional support and parlty because the Agile Manifesto was not designed to include them. In military terms a Software development team is like a platoon of squaddies whereas the higher  moe functional levels are more like commandos, sometimes banding together to use each one's individual skills but often acting as lone agents. 

 It is one of the tragedies of the software industry that development teams consider themselves the equivalent of the SAS or Seals but are, by statistical necessity and the Dunning-Kroeger effect, more often like a bunch of grunts.

The Agile Manifesto
The agile manifesto states

We are uncovering better ways of developing
software by doing it and helping others do it.
Through this work we have come to value:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

It is immediately obvious why it has been applied only to development teams and to business.

It states clearly it applies to the development of software, secondly it explicitly mentions customer collaboration and contract negotiation and working software. Thus explains why the agile process is applied only to development. It was always intended to apply only to software development.

The manifesto can be rephrased without reference to software

We are uncovering better ways of working
Through this research we have come to value:

Individuals and interaction over processes and tools
Responding to change over following a plan

Note that nowhere in either version is the notion of atomised user stories, work tracking etc and the second non-software non business version reads as people oriented common sense. Of course processes and tools and even plans are needed but these are guides not, as so often happens, straitjackets. When evaluating implementation of Agile we must ask, as Perceval first failed to ask in the saga of the Holy Grail “Whom does the process serve?”. Unfortunately the process seldom serves the team, seldom serves the enterprise and often only serves line managers.

Agile as a problem solving technique and as a process

As a problem solving technique Agile involves breaking a big task into smaller manageable tasks and replanning as change ( more often called manure) happens. Since every solution of a wicked problem, for example software development, reveals new aspects of the problem which may invalidate the solution found, replanning is unavoidable. Architects, Academics and Accountants in the forensic arena tend to meet these problems all the time and apply an agile methodology.

When we move from the individual to the team communications become vital and an Agile Process has evolved requiring communications within and between teams. Hence in the software development world JIRAs and Scrum meetings arose. And since senior management no longer walks the floors of today's code factories this process became used for management tracking and the team began to be expected to discipline itself and its own members. At the same time the process of team building allowed managers to build teams in their own image: A detail and process-obsessed manager with the soul of a sixties soviet civil servant will build a team of people like themselves. A laid back manager concerned mainly with results and innovation rather than process and an instinct for when process should be abandoned rather than followed blindly will probably build a more innovative team.

And this is the crux. Most critiques of Agile confuse the process with the problem solving technique. And it is the implementation of the process that magnifies dysfunctionality within the organisation which in turn reproduces the dysfunctionality of the industry and that in turn reproduces the dysfunctionality of business culture, which still tries to emulate a top down military organisation and at the top of this chain is the dysfunctionality of society, especially western society, with its notion of rulers (politicians, prime ministers, civil servants and employers) and the ruled. This tends to turn democracies into fake democracies and businesses that attempt workplace democracy tend to produce a sham with “empowerment” of workers a tool for further exploitation, regardless of the good intentions of those at the top.

So don't blame Agile, blame your corporate culture. The blame the business culture that arose when the rise of factories in the18th century together with the clearance of the commons and the Highland clearances destroyed the previous workscape of small cottage based industries. Yes, industrial capitalism raised living standards greatly, but at a great human and social cost. What would we have had today if the first factory owners had started their enterprises as cooperatives not employer dominated satanic mills.

Now, get back to debugging

Friday, 30 October 2015

Running Spark with Python under OS-X from command line and in Eclipse with Pydev

Spark is a “Next Generation” map reduce platform that claims to run much faster than Hadoop.
Apache offer support for programming in Java, Scala and Python.

This article describes setting up Pyspark (Spark with Python) on OS X Yosemite, though much of this should also work under Linux. For Windows you will have to write your own scripts.

The two scenarios described are development ( Under Eclipse) and “Production” which here means running from the command line. In both cases the standard log output is usually noise and was diverted to a file which can be reviewed in case of problems. This makes it easier to see what the results were.

The standard word count program supplied with the pyspark distribution was used to verify configuration. It is assumed you have Eclipse and Pydev installed using python 2.7 or greater interpreter and canH already develop in Python under Eclipse.

Running from the command line
  1. Download Pyspark from – I currently use spark 1.3.1 for Hadoop 2.6
  2. Note where you installed it (call it SPARK_ROOT for convenience but do not adjust your bash_profile)
  3. Create the following Script (called for convenience)
export SUBMIT_JOB=$PYSPARK_HOME/spark-submit
export NUM_CORES=<number of cores on your machine>
$SUBMIT_JOB --master local[$NUM_CORES] $* 
# To run: $1 is the file to run, $2 is the input file etc
# --master local[NUM_CORES] means the script runs locally on NUM_CORES cores.

No Argument check as that is best done inside each program.

Test by running <python program to run> <input files>

The next step is to get the log output into a file instead of the console From the command line this can be done in standard fashion $1 $2 2>< your logfile location>

where $1 is the file with the code to run. 
It proved convenient to put this into a second script called just runspark

This now give a clean output and the desired output can be sent to a file in the normal way

Development (Eclipse)
First read and follow the instructions in [1], which are excellently laid out. Just be sure you type everything in right.

  1. I use the Anaconda Python interpreter with no problems
  2.   When creating an Eclipse project use standard package name conventions (com.XXX. Etc) otherwise you may accidentally give the package the name of a module the code is trying to import.
  3. It is a good idea to get things working from the command line first so any problems when shifting to Eclipse will probably be Eclipse problems.
  4. Make sure the code runs ok in Eclipse

The next stage is to divert the log output to a file

  1. If you have followed the instructions you will have a conf directory at the top level of your Eclipse project which holds a file and the environment variable SPARK_CONF_DIR in your eclipse run configuration will point to this file
  2. Create a file Logs at the same level as the conf directory
  3. Replace the contents of the log4j properties file with the following
log4j.rootLogger=INFO, file
log4j.appender.file.File=Logs/<logfile name>
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose$exprTyper=INFO$SparkILoopInterpreter=INFO

Running again should (will) give you a clean console output with the log output available for review on your log file

Note that to run the script from the command line and send the log output to a named log file you need to  create a file in the conf directory in your spark installation and  change the line in bold above to point ot your desired standard log file


Friday, 23 October 2015

Changing Xmind Default Font size OS X

I have used Xmind on OS X for some months now and found the fonts too small.

I went to Google and found an ol answer. ( ) that related to Ubuntu but was a little unclear

Luckily OS X is based on fairly standard Unix

Here are the steps I took  Italics indicate something to type
  1. locate defaultStyles.xml 
  2. In Terminal navigate to the folder indicated:
     cd <path to>/Contents/Resources/plugins/org.xmind.ui.resources_3.5.1.201411201906/styles
  3. cp defaultStyles.xml defaultStyles.old
  4. open -a defaultStyles.xml   (replace with your favourite editor)
    There are styles definitions for level of topic: Central topic, Main Topic, Subtopic etc
    they start fo:font-size=......
  5. Change the font sizes to taste
  6. Double check your edits
  7. Save defaultStyles.xml
  8. Close and restart Xmind

And that was it.  The process for Windows should be similar.

Looking round inside the  folders under the folder suggested it would be possible to do a lot more but this did the business

Final step: Create an Xmind timesheet with a link to the styles folder for future reference, and put all the above in a note in that sheet

Sunday, 2 August 2015

Fixing outlook when it would not start

Trying to delete a large file in outlook (20MB) led to the window greying out and the only way to recover was to restart outlook.

After a few restarts a message came up saying Outlook could not start and pinpointed a file with a .ost suffix as  the problem

Outlook uses this file  to recreate your mailbox.  It could not be deleted as outlook was using it even after trying to stop it.

Since it would have taken longer to work out how to close the service  then reboot ( You can tell I am a UNIX guy) the solution went as follows

2. Rename the .ost file: I gave it a second .old suffix ( Did I say I was a UNIX guy :) )
3. Restart Outlook. A message flashed up saying that Outlook was being prepared for first use
4. OUtlook started with an empty inbox PANIC
5. Remove the new .ost file and remove the .old suffox from the old. .ost file
6. Restart Outlook

SUCCESS: all old mail appeared. Back in business

Your mileage may vary. This worked for me

Sunday, 26 April 2015

The problems involved securing a file

Security: a human as well as a technical problem
This note is a rumination on security that just shows that absolute security is unattainable. Some of the complexities of achieving a high level of security are presented. This level of security is only needed for highly sensitive documents and part of the ask of a secure systems architect must be to establish the level of security needed and determine whether the necessary hindrance of work is justified.
It is not suggested that this is a definitive list of the issues involved in securing a single file, let alone a system, more a consciousness raising effort.

Please consider supporting this blog financially. The money will be used  to fund research and the cost of maintenance.  You can use paypal via the button below

A file can be considered secure if
  1. CRUD (Create, Read, Update, Delete) access is available only to authorised users (Humans or processes)
  2. Only authorised users may modify a user's CRUD access
    1. As a corollary attackers may not deny authorised users access.
  3. Authorised users cannot compromise file security
The last condition is probably impossible to fulfil: anyone with a key to a safe can get at the contents of the safe.

I will have two scenarios in mind.
  1. The file is on a device (laptop, phone etc) to which a user has direct access.
  2. The file is on a server and accessible via a web service or web application.

Perimeter Security

Here the file is open to anyone who can access the device or the web application. Generally this means they need a password and username. This can be made more difficult by requiring a token or using biometrics ( and the pain of trying to register say fingerprints makes this option unattractive).
But if an attacker has physical access to the device all bets are off.
Password cracking is hard especially if there is a lockout after too many attempts. Shoulder snooping a legitimate user or using malware to install a key logger is possible but as much a matter of luck as skill. Of course the defenders may have installed a key logger to monitor access and malware may be able to get hold of those records.
But removing the hard disc, making a bit level copy and scraping all files off the copy is much easier. It also has the added bonus of getting all the files of the disc, with any valuable information they might contain.
Alternatively an attacker, having identified the location of the file could change the content on the original disc or make the fill unusable.
So why not hide the file itself on a secure server and only allow remote access? As long as no local copies can be made (for example in a browser cache) the content is secure, right? The user still needs an ID and password but extra security like a VPN tunnel and/or using the latest secure transport protocol plus two factor authentication should be enough?
The problem here is that you have to be sure the application is secure. Web Applications are not easy to secure properly, and just one flaw in the security protocol could let an attacker get hold of the data, for example via a directory traversal attack. At the worst text could be scraped from a browser screen.
To prevent man in the middle attacks you also need to ensure the file is encrypted in transit using TLS (Transport layer security) or ssh for internal access and ensure the user needs to submit a password before getting access. And make sure no plain text copies are left lying around.
The bottom line is that perimeter security, like a cylinder lock, deters only honest people. The measures here make it harder to breach security but the more valuable the data the harder attackers will try to get to it. The goal of security is to make the cost of getting the data more than the value of the data to the attacker.

Encrypting the file is an obvious next step. Legitimate users will be able to view the contents and the device can be shared with others not authorised to see the content. This is like having a locked cabinet marked “Top Secret” in the middle of an office. It tells attackers where to look for the gold.
Problems with encryption include back doors and insecure implementations not to mention ensuring the password is not inadvertently compromised (The more secure the password the more likely it is to be written down. Few people can remember passwords like sASks1029”))”!, and the problem gets worse the more such passwords people need to remember. Secure password wallets simply concentrate all passwords in a single weak spot.
For really sensitive material two users could be needed to unlock the file, each having half the password. But then backup users need to be available in case one person is ill or on vacation. And the more people know the parts of the password the more likely a breach will occur
Encryption is a useful aid to security then, but not a silver bullet. You need to be sure the algorithm is correct and securely implemented (no local copies in plain text in obscure directories) and that the users can remember the password.
Access Control
Granting or denying users access is a different kettle of fish. In a role based access system this can be delegated to users in a special administrator role and access, and the right to control access can be granted or removed by an administrator. As before there should be at least two administrators and ideally changes in access control would need second administrator approval. The question of who can create the special administrators raises the spectre of an infinite regress. This is a problem that has to be solved organisationally, though technology may help.

The Wrap
This has just scratched the surface of the problems involved in the apparently simple task of securing a file. It turns out that perimeter security deters only the honest, physical access lets attackers do whatever they want and encryption has its own minefield of problems. Finally the problems of granting and denying access need to be solved organisationally rather then technically. For really sensitive content a “four eyes” principle whereby two users must collaborate to access the content will minimise the risk of rogue users giving the content to unauthorised recipients.
The bottom line here is that total security is impossible and the amount of effort devoted to securing a document should be proportional to the value of the content and the cost of losing it or having it leaked to the wrong people. One should always have the hierarchy of secrecy in mind: Restricted, Confidential, Secret, Top Secret, List Access Only and Embarrassing and assigning content to one of these before deciding the effort needed to keep it secret.

Sunday, 21 December 2014

Basic Cyber defences: Secure cookies, Http headers and Content Security Policy

here is no such thing as total security either in real life or online. All security can do is make the cost of defeating it greater than the reward. Security measures, while not worthless, can only only reduce the risk.

For Cyber security some HTTP headers can be used to reduce the risk and are supported by all major modern browsers. A  list of useful headers is given in [2]

Three defences will reduce risk considerably

a. Setting the HttpOnly cookie flag
b. Setting the X-Xss-Protection: 1
c. Using Content-Security-Policy

None of these are magic bullets but they are valuable parts of a total security package.

The HTTP only cookie flag

The HTTP only flag is like a Yale lock. It keeps out the amateurs and the lazy bad guys, but not the determined attacker. One common cross site scripting attack is cookie theft especially a session ID  and one way to reduce this risk is to use the HTTP only flag which is supported by all modern browsers.

This ensures cookie values cannot be accessed by client side scripts e.g JavaScript.

The simple way to do this is to append “; HTTPOnly;” to the cookie value.

It is possible to get past this flag by using a combination of cross site scripting and cross site request forgery to force users to generate requests , which means attackers do not need to access the cookie. Other techniques to get round HTTP only have been considered such as using the HTTP TRACE verb. But this flag is cheap to use and stops simple attacks.

In servlet 3.0 you can configure this in web.xml as follows
< session-config>
<cookie-config >
< hTTP-only> true </ hTTP-only>
</cookie-config >
</ session-config >

Older versions of tomcat, That is the full version 7 only allow this to be set in the server.xml file
In tomcat 7.0 this attribute is enabled by default, which means your JsessionID Will be HTTP only unless you change the default behaviour in server.xml as well.

You can also programmatically add an HTTP only cookie directly to the request as follows

String cookie = "mycookie = test; secure; HTTP only";
Response.Addheader(" set-cookie", cookie);

The servlet 3.0 API is updated with the convenience method set HTTP only for adding this flag.

The Cross Site Scripting header

X-XSS-Protection is a header first created by Microsoft to block common reflected Cross site scripting. It's enabled by default in Internet Explorer Safari and chrome but not in Firefox.

There are three modes:

1)   the value of one is the default behaviour and tells the browser to modify the response to block deleted cross site scripting attacks

2) the value of zero Will disable cross site scripting protection completely

3) specifying the value of one and mode = block tells the browser to block the attack and also prevent the page rendering entirely. This means users will see only an empty browser page.
This mode should only be used after usability testing.

You can also set this header in Java
For example

X-XSS-Protection: 1
response.addheader(" X-XSS-protection","1");

X-XSS-Protection: 0
response.addheader(" X-XSS-protection","0");

X-XSS-Protection: 1; mode=block
response.addheader(" X-XSS-protection"," 1";mode=block);

It may be better to set this from the web server configuration, which will vary with the server [1] 

Content Security Policy

A content Security policy is like a higher level whitelist. The Content Security Policy (CSP) mechanism lets a site define trusted sources of content and reject content from other sources. Unfortunately this involves various restrictions which mean, for example that javascript cannot be run inline or even appear in the page, and the style in which javascript is written has to change, mainly for example, by adding event handlers, in separate files, to page elements [3] . This means CSP can be expensive to integrate and a site that uses it must ensure its developers know enough about CSP and the Javascript changes needed that they will not get confused by, for example, <button>a button </button> with no apparent handling code. .

CSP is not meant to be a frontline defence against attacks, but a defence in depth to minimise harm caused by content injection attacks [4]

CSP prevents man in the middle attacks, which are undetectable over HTTP, as well as most cross site scripting attacks, other than through user input, which can be escaped before use though with DOM injection, attacks may still be possible [3]

How To set up CSP

Include a CSP header in your response which looks like (note the colon)

    Content-Security-Policy: policy
    Content-Security-Policy-Report-Only: policy

where policy is a string of policy directives separated by semicolons.

A policy directive is a policy directive name followed by a list of urls separated by spaces
Content-Security-Policy: policy-directive_1 url1 url2....., policy directive2 url3 url3; etc

A full list of policy directives is given in [4] . The most important is default-src. The default-src directive specifies default sources for types of content. It is not obligatory but is a good idea. It is also a good idea to include the object-src directive which specifies the sites from which to accept scripts. The unsafe-in-line directive should not be used without a very good reason.

To enforce the default-src directive, the page or site must enforce the following directives:
  • object-src
  • style-src
  • img-src
  • media-src
  • frame-src
  • font-src
  • connect-src
If not specified explicitly in the policy, the directives listed above will use the default sources which may default to 'none' or 'all' depending on the browser.


1.Content-Security-Policy: default-src 
 Documents can only be accessed over https from the stated url. 

2. Content-Security-Policy: default-src 'self'; img-src *; media-src; script-src

 Content is only allowed from the document#s original host with these exceptions
  1. images can come from anywhere
  2. media can only come from  and
  3. scripts can only come from

  1. Content-Security-Policy-Report-Only: default-src ='self' report-uri=http:localhost/policyviolations.html
    Action will not be taken but violations will be reported on the specified URI and will show
        a) document-uri: the document where the attack took place,
        b) violated-directive : which directive was violated
        c) script-sample: a portion of the XSS attack
        d) line-number: the line number for debugging and research.
      4. response.setHeader(“Content-Security-Policy-Report-Only”, “default-src 'self', “report-uri http://someaccessibleuri
Note that you can have both the CSP header and the CSP-Report-Only header active so you can test only one policy at a time.

The Wrap
To reduce XSS risk
Set-cookie HTTP-only to prevent it session-hijacking
set XSS-Protection: 1
Plan to use Content-Security-Policy

The HTTP flag is not a header but setting ti will prevent a lot of common XSS attacks.

The Xss-Protection header is not supported by all browsers but is enabled by default on IE, Chrome and Firefox and prevents reflected XSS attacks

CSP is a full strength protective approach that allows definition of trusted sources for various types of contend but, since it forbids placing Javascript in a page it may break existing functionality unless extensively tested. Introduction of CSP should be regarded as a major change and planned thoroughly.


  2. Style changes needed with content Security policy