Monday, 4 January 2016

Karojisatsu: Mental health and suicide risks posed by IT culture for Managers and Developers

Kairoshi is a Japanese term meaning “death from overwork”. While most of the discussion around Kairoshi deals with older workers dying from heart attaches and strokes, the Japanese have also identified “KarĊjisatsu” as committing suicide due to overwork [4]. It is not unknown in the USA [4] and may occur in the UK.

Information Technology is not a profession where you can easily stay healthy, sane, fit and have a life outside work. Some of the problems are to do with the culture of the industry, some to do with business culture in general and some to do with Western culture and the Protestant Work Ethic that regards even valueless work as valuable in itself and has mutated into the Western Employment Ethic, work not being regarded as work unless an employer is involved.

I have discussed the physical problems and touched on the mental problems elsewhere [6] but the psychic problems and risk of burnout, breakdown or even suicide need further exploration. I can only comment on the Technology Industry in the Uk from experience but I get the impression that similar issues arise in Finance, though some may be peculiar to IT. I note the deterioration in the appearance, and possibly the health, of senior politicians and once went for an interview for a university lectureship, which I did not get. A year later, for reasons I forget, I had to register for a course taken by the successful candidate and noted that in the interview they looked 30 and after a year they looked 45 with grey hair and signs of stress. Maybe my guardian spirit was looking after me. Fifteen years experience as a contractor in various European countries strongly suggested that the problems I mention here are much rarer in mainland Europe than in the UK ans USA (these countries have their own corporate and industry dysfunctionalities). It also suggested contracting is, apart from the regular financial crises involved, better for mental health.

Some of the causes of Kairoshi are excessive hours, all night work and Holiday work plus stress caused by being unable to meet company goals and screwed up management.

Managers are not immune. They may have to lay off staff and feel guilty for being unable to protect their staff.

All this reduces morale and performance, often for no reason other than overly aggressive deadlines and macho posturing

Long Hours

There is a longstanding consensus that a forty hour week is optimal, which may be true for physical work but is almost certainly untrue for intense mental work. In Europe lorry drivers have restrictions on the number of hours they can drive because a fatigued driver is a hazard. Companies should restrict the hours their IT staff and other brainworkers put in. IT work is much more tiring than driving and Sweden has recently introduced a thirty hour week and companies there report an increase in productivity and profit. Te economy is likely to boom as well since workers have more free time in which to spend money.

Young IT professionals tend to regard burnout as a badge of honour, or at the very least, a rite of passage, and try for regular 100 hour weeks. That is 12 hours a day seven days a week. Nobody can keep that up Nobody can maintain good performance like that. Nobody can stay healthy like that. Ironically at one company a couple of contractors each put in time sheets for 360 hours in one month and the management response was to install time clocks, and in most places I worked contractors were not allowed to bill more than 40 hours a week without approval. In the UK and US time clocks would be used to note who was working long hours and to demonise others as slackers or uncommitted.

Moves to eliminate a long hours culture tend to be resisted by those who have benefited from it, whether in Finance, IT or when considering the 80-120 hours worked by Junior Doctors in the UK. The response is inevitably of the form “It never did me any harm” (How do they know?). In the case of Doctors the risk to the patients is ignored, for as the old saying goes “Lawyers bill their mistakes Doctors bury theirs”. Sometimes it requires a law suit for the company to change its ways.

Impostor Syndrome

Impostor Syndrome is the reverse of the Dunning-Kruger effect. To quote Wikipedia

Impostor Syndrome is a term coined in the 1970s by psychologists and researchers to informally describe people who are unable to internalize their accomplishments. Despite external evidence of their competence, those exhibiting the syndrome remain convinced that they are frauds and do not deserve the success they have achieved. Proof of success is dismissed as luck, timing, or as a result of deceiving others into thinking they are more intelligent and competent than they believe themselves to be.

The Dunning-Kruger effect is where people regard themselves as better than they are. Typically young IT professionals overrate themselves and more experienced professionals under rate themselves.

The risk is that programmers think they need to work harder to become good enough. That means spending more time coding — every waking minute — and taking on an increasing number of projects. And that leeds to burnout and possibly suicide

The incidence of impostor syndrome is around 40%, with a lifetime incidence of 70% and men and women are probably equally affected. It is common in professions were work is peer reviewed, for example software development [9] though it seems to be rarer in Academia where reviews of a paper are expected, anonymous and regarded as helpful. It also helps that Academic papers, other than conference papers, rarely have deadlines.
Impostor Syndrome is not a mental disorder more a reaction to certain situations. Undue susceptibility to Impostor Syndrome can be identified through personality tests but does not seem to be a distinct personality trait. Sufferers tend to reflect and dwell upon extreme failure, mistakes and negative feedback from others. If not addressed, impostor syndrome can limit exploration and the courage to delve into new experiences, in fear of exposing failure. High achievers or those who have achieved a lot in the past may well experience it in a new role.

A number of management options are available to ease impostor syndrome. The best is to discuss the topic with other individuals early on in the career path. Most sufferers are unaware others feel inadequate as well. Once this is addressed, victims no longer feel alone in their negative experience. Listing accomplishments, positive feedback (A simple well done from managers ) and success stories will also aid to manage impostor syndrome. Finally, developing a strong support system, that provides feedback on performance and has discussions about imposter syndrome on a regular basis is imperative for sufferers.

The Real Programmer Syndrome

The Real Programmer [11] is a cultural stereotype originating, possibly as satire in 1983. Real programmers disdain such luxuries as IDEs and where possible any high level language and sometimes even disdain assembler preferring microcode.

A Real Programmer codes all the time and doesn't consider it work. They live to code.

A Real Programmer volunteers to work 60 to 80 hour weeks for no extra monetary compensation, because it's "fun". ..

Management love Real Programmers and the image of the Real Programmer is now in the DNA of the Tech Industry. IT has always had a long hours culture but now, unlike Finance or a Japanese company, workers are supposed to do it out of the enjoyment of the work.

Impostor Syndrome can lead people to think they have to work harder to become good so they over load themselves. Then they slowly burn out. Sometimes they kill themselves [4] though doubtless some would say they would have to have been mentally unstable rather than blame the long hours.
The Older IT worker

It is no secret that the Technology Industry is Ageist. Mark Zuckerberg claimed young people are smarter but is doubtless redefining “young” with every passing year. By a strange coincidence the average age of Facebook employees matches his age exactly.

Older workers have more experience and this lets them be more productive but the long hours they had to put in when younger and fitter make their bodies less resistant to the stresses of The Job, in particular long hours. In some trades the younger people “carry” the older ones, for example in heavily physical jobs the older worker may find his team shift him to lighter tasks. This does not happen in IT. And so the Older worker gets stressed because of having to demonstrate their “Commitment” and “Passion” not only to management but to younger “Real Programmer” wannabees.

Trying to balance Family, work and learn new technologies on their own (almost all companies refuse to fund training for their staff reasoning that it is cheaper to hire a young person with new skills and a bit of experience than train an older person) is a high task. This leads to burnout and heart attacks [3]

Adam Smith, in the “THE WEALTH OF NATIONS”, 1776 stated the risks of overwork admirably when discussing piece workers, though this applies to Real Programmers and the Technology Industry generally, even though Tech workers are not hourly paid or paid by lines of code. Note the eight year threat.

Some workers, indeed, when they can earn in four days what will maintain them through the week, will be idle the other three. This, however, is by no means the case with the greater part. workers, on the contrary, when they are liberally paid by the piece, are very apt to overwork themselves, and to ruin their health and constitution in a few years. A carpenter in London, and in some other places, is not supposed to last in his utmost vigour above eight years. Something of the same kind happens in many other trades, in which the workers are paid by the piece, as they generally are in manufactures, and even in country labour, wherever wages are higher than ordinary. Almost every class of artificers is subject to some peculiar infirmity occasioned by excessive application to their peculiar species of work.”

To The Management

Discourage long hours wherever possible. Set an example and treat yourself well. Look out for unexpected changes in performance, whether sudden or gradual, especially for the worse. Remember that Long hours are bad. Sweden has recently introduced a six hour work day, and this is about the limit a brain worker, such as a developer, or you yourself, can handle. The loss to the company of an experienced developer, Architect or Sysadmin who is burning out can be significant. You can always manage such a person out and hire another person to replace them without damage to your career: once, maybe twice, but ultimately the resultant missed deadlines and reduced project scope will be tracked down to you. Look after your reports and they will look after the business.

To the Worker

If you are in a company with a long hours culture, and this can be subtle, for instance with only employees working long hours being promoted, try to get your manager's support in balancing work and life. If they cannot or will not help you should quit. If you are well paid remember money is not always worth the price you pay to get it. Just look at any successful politician.

You may be starting to burnout without noticing it. If you come to hate Mondays, especially if the idea of going to work on Monday spoils your Saturday morning something needs to be done, fast.

And do not rely on your manager to look after your health. Only you can do that.

I have done long hours as a contractor and damaged my health but recovered.I did long hours again in a permanent role. But as a contractor I never suffered burnout, only as a permanent employee. The periods on the bench as a contractor let me recover and preserved my liking for coding, though the increasingly regular financial crises as I got older took a lot of the fun out of contracting. The relentless pace of the permanent role (fast paced, aggressive schedule etc ) nearly killed me and it took some months to start recovering. I no longer have much desire to code, except in my head where I can devise algorithms easily, and am trying for a hands off managerial role or returning to university for a career change if I can afford it.

Learn from my experience. For your sake

And if you need to talk to someone the author of [4] has pledged themselves to help anyone who needs it. If I could I would help others.

Summing Up

Technology Industry culture is dysfunctional. Some companies, bless their little cotton socks, do their best to look after their workers, but are trapped in this culture and unable to see their chains: a phenomenon best exemplified by those Nazis who said “But some of my best friends are Jews' or the cemetery in Ireland where protestant and Catholic are buried in the same graveyard but an underground wall separates protestant and catholic graves and this was presented as a move to break down barriers between the two sects.

Aspects of Technology Industry dysfunctionality, most notably a long hours culture are shared by other industries, but there is no equivalent of Real Programmer Syndrome: an accountant may have to work long hours but is not expected to do so because it is “fun”.

The manager should be on the lookout for Impostor and Real Programmer Syndrome and take steps to prevent or cure them. The Real Programmer may be good at coding but unable to see the big picture and design good code, let alone handle architecture. The employee should not trust their manager to look after their health.

  7. Yahoo have recently decided to abolish QA and found a decrease in the number of bugs in production. Together with evidence that code reviews are no more effective in finding bugs than requiring a developer to leave their code aside for a while then review it suggests that the code review process is damaging to developers and should at least be restructured. But that is a different topic.
  8. IT professional nails the Real Programmer.
  9. The Real Programmer
  10. Dangers of Overwork

Monday, 28 December 2015

Health Risks of IT for managers and workers.

You are a young software developer. Eight hours a day fly by, nine hours a day you notice but hey, your code nearly works. Finally it works and you are winding down when you get an email from over the pond about an urgent problem. You fix the problem and finally depart after eleven or twelve hours. Most nights you also work on side projects or, since technology moves faster than politicians chasing bribes, investigate what you think will be the next big thing (you are probably wrong) and hope The Management adopt it. You dream you are coding

Next day you do it all again, and the day after that. The management notice you and you are employee of the month.

Fast forward ten years and an extra five stone and 12 inches on your waist. The work is getting harder but your experience makes it go faster. Shame about the knee and back pain. You should exercise but you groan at the idea and it's an interruption from coding. Shame you are still single and no one in sight. Shame you have no conversation other than your work. You survive on food from the vending machines and cheap kebabs. You have become one of the devs from  CommitStrip. You dream you are coding.

You get married. For a while, but as with police work, your spouse comes second to The Job and eventually they leave you. They take the dog. You enter clinical depression and end up sleeping on the street, dreaming of coding.

This is slightly exaggerated but programming can hijack your brain and make you autistic. There is always pressure, from peers or management, to work long hours, which you don't even notice because you love coding, right?

Offices are not the healthiest of work environments, and, like doctors, software people do not have the healthiest lifestyles. There are physical, mental and even social risks involved. Some of the hazards of the office are shared with other office workers, albeit experienced more strongly.

The physical damage caused flying a desk can, if caught early, be reversed by small changes consistently applied. The mental, social and relationship problems may be impossible to fix

To The Manager
You are not immune to this damage, even if your work is different. There is a longstanding tradition that middle managers work long hours. Senior managers work long hours networking with other senior managers on the golf course and in the bar (Criminals spend a lot of time in bars networking: to them that IS their office) while sending memos about timekeeping to the lower orders.

Look after your self before looking after your reports: Set an example by sticking to your contracted hours. Take breaks ( note that face to face meetings give you a hint of exercise walking to a meeting room, and can help keep you healthy). If you must work overtime one day compensate for it as soon as possible.

Take time to relax: Switch off the phone after work and take time to relax.

Learn to say NO when over loaded.

Monitor your mental and physical health but avoid hypochondria. Maybe keep a diary.

Look after your reports. Make it clear that you expect long hours to be an occasional emergency measure. Avoid Death Marches. If need be occasionally wonder round in the evening and send any late working programmers home, unless they are working on a time sensitive critical issue, in which case try to work out how to prevent this happening again.

Watch out for signs of physical and mental issues and any change in the performance of your reports. There could be any number of reasons for this and you are duty bound to investigate.

Remember that problems like stress can be insidious and by the time they become obvious may be irreversible.

And do not feel embarrassed discussing the health of a one of your reports with them. And try not to stigmatise someone with psychological issues.

To the Developer

Do not rely on your manager to look after your health. The culture in IT and business generally is to extract more and more from workers - and developers are increasingly becoming commodities to exploit – and the manager has their own pressures from above. If it is a choice between you and them guess which they will choose.

Stick to your contracted hours and take any time off in lieu of overtime as soon as possible ( or get your manager's agreement to taking it as paid leave). If need be keep a diary of the hours you work. Remember some places feel that if you cannot do your job in the normal number of hours you are probably not up to the job.

Learn to say NO.

Take regular screen breaks. If need be walk around for ten minutes every hour

Coding divides into Analysis, Inspiration and Perspiration. The latter is when you have a solution and are turning it into code. Inspiration comes when you are away from the machine, often at the least convenient times (Bed, Bath and Bus are the traditional places). So take on board that you do not need to be in front of a keyboard all the time and may become less productive if you are.

Without becoming hypochondriac, monitor your mental and physical health. If possible buddy with a trusted colleague to look after each other. If married ask your partner, siblings or children to be brutal in pointing out changes in your health and personality. “My god you have got fat” may be impolite but it could save your job, career or even your life. “You seem to be slowing down and you memory is worse” is an even bigger red light possibly leading to Alzheimers in later life.

Physical Health Issues
You can lose muscle mass sitting in front of a computer all day. This messes up your ability to remain the light, slim athlete you used to be because muscle is far more effective at metabolizing calories than fat. Muscle also weighs more so you can lose weight but still expand sideways while your face turns into a balloon. Your risk of diabetes goes through the roof. Poor diet and lack of exercise can cause cardiovascular disease which not only affects arteries round the heart but can also affect blood flow to the legs and other extremities resulting in peripheral vascular disease, a serious condition that can lead to a heart attack, stroke or Diabetes.

Moving up to the gut, now expanded dramatically through poor diet and exercise aversion you have a 1 in 8 chance of having gained 20 pounds or more and a one in three chance of gaining more than 10 pounds. This is less than the bloat that affects those in financial services (one of the few professions with a culture of longer hours than IT). You may weigh less than when you started you career but muscle is denser than fat so you may to be having problems opening a jar of jam or even walking up two flights of stairs. Lunch at the local Pizza Parlour or bar makes matters worse.

Your beer gut, which would be appropriate in a darts player or Sumo wrestler lays you open to heart disease, diabetes and other problems. Diabetes leads to a vast number of other physical and mental problems: blindness, sores that do not heal, testosterone deficiency and erectile dysfunction (impotence) . But hey, no need to worry till you are over 45 right? WRONG. Diabetes is moving down the age ladder and even children are getting it now.

By now you may be around 50 and thinking “if I had known I would live this long I would have looked after myself better”

They say the road to a man's heart is through his stomach. Heart problems and Diabetes follow this road carved by your expanding stomach or worse, hidden fat round your internal organs. Some of the risk factors cannot be controlled but others like smoking, exercise and diet can help reduce your risk.

Your upper extremities also suffer, though not from lack of exercise. Repetitive strain injury is likely. Unless you consciously compensate activities like texting, using a mouse, typing etc can cause you to tense shoulders and upper arms which reduces circulation to the forearm just when the thumb and fingers need more blood. Typing all the time increases the risk of arthritis.

If you don't set up your workspace properly you risk back, spine and shoulder problems. Exercising can actually make matters worse if you do not follow a balanced workout and thereby create imbalances.

Poor posture, an easily adopted habit, can set you up for a host of musculo-skeletal problems as well as indigestion and constipation ( which can lead to colon cancer ) as well as lung problems when your posture makes it harder to breathe.

You are also likely to have eye problems and laser surgery to correct these can result in career ending after effects.

We haven't even started on the mental problems yet

Mental Problems

Long hours and permanent connectivity to your office email overstimulates the brain. So does thinking about coding problems. Lack of time for relaxation and exercise bumps up your stress, as if you do not have enough already. Chronic and excess stress can harm the immune and cardiovascular systems, and increase vulnerability to heart disease, depression, exhaustion, sleep deprivation and overall malaise. Undue stress can also trigger anxiety, with its own symptoms, including stomach pain, dizziness, muscle tension and headaches, decreased concentration, irritability and sexual problems. Extreme anxiety can even increase the risk of cardiovascular diseases, psychological problems, suicide and some cancers.

When stressed you get sleep problems which again leads to a higher risk of diabetes, obesity, high blood pressure and other health problems. As a result you may suffer from fatigue which, like stress leads to poor performance and bad decisions. This can impact your job and career, if not more important things.

Overstimulating the brain has other risks. At least one savant has trained so hard at mental problems that they made themselves autistic and some of the thumbnails photos of contestants in competition coding sites show definite signs of autism.

If you get Diabetes you can look forward to a host of other mental problems, including possible mental decline associated with some medications ranging from major depression through anxiety to bipolar disorder.

Stress can of course mean you perform less well which means more risk of ending up on the streets. In the finance industry many high pressure workers take to drugs to deal with the stress. DON'T. If you find you need alcohol to relax pile on the brakes and see a professional.


Other than leaving IT solutions to many of the physical problems are the same as for handling diabetes without medication: Diet, Exercise and Weight control. Other physical problems can be combated or prevented by attention to posture and properly setting up your workspace.

The mental problems require stress management, maintaining a decent work life balance and getting a life away from the keyboard plus making time for sleep. Side projects are NOT a good idea. If you can, stick to your contracted hours. If not find another job.

Take frequent breaks, drink loads of water (coffee in a moderation) and try not to get stressed: it's only work. Take up a sport and train three times a week.

The Wrap

Desk work is not healthy. Computer related work is even worse. The pressure to work long hours, whether internal or external, lead to a lifestyle that can spawn a host of physical and mental problems. Constant vigilance is needed to prevent these problems, including reverting to a separation between work and leisure that some may feel quaintly last century.

Managers should look after themselves first then their reports. Developers cannot assume managers will look after them.

Wider issues such as role of the culture of the IT industry and the Protestant Work Ethic in ensuring the unhealthy lifestyle of IT workers, and more generally office workers, are left for future research.

Further reading

Sunday, 20 December 2015

Java 8 Interfaces Lambda-expressions and Streams

My last workplace refused to move from Java 7 after experiments found that some code failed at runtime when the compiler version was raised. Now, as a relatively late adopter I am able to look at some of the new features of Java 8. First impressions are that Java 8 is best considered a new language, or at worst a superset of Classical Java ( the 21st Century version of Cobol), just as C++ is technically a superset of C. The changes could reduce code and class bloat resulting. As usual the new features raise questions about the underlying implementation which need further research.

Interface changes can reduce the number of classes needed in an application, Lambda expressions can reduce the number of lines of code, though the new syntax can be confusing at first and Streams provide support for parallelism, though only for stateless operation. The new language will leave many trapped in backward looking organisations floundering when they finally emerge.

Interfaces and Default Methods

A useful design/architecture pattern in Classical Java is the Interface-Abstract Class-Implementing class pattern where an abstract class implements an interface and holds default methods expected to be common to all implementations. The New Java makes abstract classes much less useful and I would expect them to be deprecated at some future point. As a result a typical implementation may hold fewer classes as abstract classes will not be needed. The eventual outcome should be simpler designs with less code to harbour bugs.

An interface can also hold static methods ( which cannot be overridden in subclasses), which may allow movement of utility functions into an interface when this is appropriate.

Here is an example

interface Greeter {
public void saySomething();

public default void sayHi()
System.out.println("Hi there");

public class HelloWorld implements Greeter
public static void main(String[] args)
HelloWorld hello = new HelloWorld();

// saySomething() is not a default method so still needs to be implemented.
public void saySomething()
System.out.println("Say What?");

Interface static methods are useful for providing utility methods and utility classes can be replaced with interfaces that contain static methods.

Functional interfaces interfaces with exactly one abstract method are new

A new annotation @FunctionalInterface has been introduced to mark an interface as a Functional Interface and avoid accidental addition of abstract methods in the functional interfaces. Functional Interfaces allow the use of lambda expressions for instantiating them.
Default and static interface methods alone should simplify existing code bases dramatically.
Lambda Expressions

One of the painful aspects of Classical Java is the sheer verbosity needed for Object Orientation. Java 8 allows the walls of text needed even to start a simple thread to be replaced by more compact, initially cryptic, expressions.

Consider this

Runnable r1 = () -> System.out.println("My Runnable");

Equivalent to

Runnable r = new Runnable()
public void run()
System.out.println("My Runnable");

What the Lambda expression is doing is instantiating a new Runnable with a run() method that just prints out a message. 

Using Lambdas to inject behaviour

It is now possible to inject behaviours into a method using predicates for example

private static Stream<String> extract(List<String> words, Predicate<String> predicate)
return words.parallelStream().filter(predicate);

Predicate<String> catpred = index -> index.startsWith("cat");

Stream<String> catstream = extract(words,catpred);

And one can be even lazier with method references

private static boolean isdog(String target)
return target.startsWith("dog");

Stream<String> dogstream = extract(words,BehaviourInjector::isdog);

This feature, as just one example, renders the Classical java methods for Custom sorting pretty much irrelevant for example as in [2] given a class Person with fields name and age we can sort a list of persons by name, pers2) -> pers1.getName().compareTo(pers2.getName()));

which could be written as a BiPredicate

(The official Oracle docs still require a comparator for custom sorting at the time of writing)


Streams bear some resemblance to Pyspark RDDs (Resilient Distributed Datasets) which represent
“an immutable, partitioned collection of elements that can be operated on in parallel.” The difference seems to be that a Java 8 Stream is computed on demand by operating on a specified data source whereas the RDD seems to exist independently but the Pyspark documentation seems not too clear on this.

Streams are created on demand and produce a pipeline of data from a specified source and this can be operated on sequentially or in parallel, with, according to Oracle, the parallelism being largely user transparent. They cannot therefore be reused.

There are limitations to streams for example

  • The operations on a stream must not depend on the order in which they are performed
  • Streams, once consumed, may not be reused

Classical Java often feels like this
Wrapping up

Java 8 was launched as a major update to Java. In many ways the functional programming aspects introduced make it a new language. The changes to interfaces could, properly used, reduce the number of classes in a typical application, Lambda functions will reduce the number of lines of code and improve readability and Streams will go a long way to provide transparent support for parallelism, at least on single multicore machines.

It seems unlikely that Java 8 will render large scale parallel programming frameworks such as Hadoop and (Py)Spark redundant since the use cases seem different, nor does it seem likely that New Java (Java 8 and beyond) will do anything to impact the popularity of languages such as Python which also support both functional and object oriented programming but require much less boilerplate code.

Time will tell but one pleasant possibility is that Java will evolve into a functional language with residual support for Object Orientation. If that happens it will probably be when another programming paradigm pops up and becomes popular.

A more likely scenario is that various bits of cruft will build up and intellectual fashions plus the standard developer intellectual arrogance (I have to admit having been guilty of this in the past and plead immaturity), desire to show off (ditto) and love of developing complex solutions where simple ones will suffice [4] plus the tendency to adopt technological fashions uncritically will lead to a situation where the simplifications Java 8 has bought are lost and another simplifying paradigm is needed.

  1. In all fairness the first solution of a problem is usually too complex and refactoring simplifies it while improving non functional aspects, but the current desire to shorten time to market makes time for such refactoring hard to find. Developers are not to blame for everything.

Saturday, 12 December 2015

When Code flow went wrong

This note describes an old, rare problem which,  given the cyclic nature of technological time, where yesterday's solution is forgotten and tomorrow solved anew to a fanfare of acclaim, will doubtless reappear, laughing.
The first time I encountered this was in 2004 using Eclipse. Control, as shown by the debugger, was going to somewhere unexpected for no apparent reason. There was no error message or warning.
I had to write a quick program, run from the command line, that exercised the errant code amd mimicked as far as possible runtime conditions. When this was done a NoSuchMethodException appeared. It turned out that the jar file used lacked the method being called. Using a different version of the jar file fixed the problem.
One would think this could no happen now given the rise of Maven which requires the Jar file version to be specified. Not so.
The second time I found this problem it arose because the developer did not have the right maven POM file.
One would hope the problem in Eclipse that prevented a no such method exception being thrown has been fixed.
I am not sure why this behaviour arose. The most likely possibility is that the compiled class contains a reference to a specific location in the Jar file and if the Jar file is incorrect the location is incorrect. The fact I got an exception using a command line test harness but not in Eclipse also suggests an exception was being swallowed.
The lesson is you cannot just replace a Jar file and expect the system to work: you need the correct version. The POM file must specify the correct version and the developer who uses the component must have the correct POM file. In the heat of development this cannot always be guaranteed. Fortunately the problem is easily recognised after experiencing it once, and unit tests should catch most problems.
Java developers of course have this ingrained in their DNA by now, but the same problem could occur with any library in any language. Log files here were useless.
More seriously this could be a security hole, an attacker, whether an insider or a man in the middle, could have replaced the jar file with one containing malicious code and this could could be active a long time before it was discovered. This means that in critical areas the provenance of a jar file must be verified and if possible the code should be decompiled and checked for vulnerabilities. 
Perhaps the bottom line is that when using an external jar file or library third party code must be treated as guilty until proven innocent and even if their code is thoroughly tested it may not play well with your code, especially if the version changes.
Realising the code flow at runtime may not be that indicated by the code can lead to a deep sense of insecurity

Thursday, 19 November 2015

K nearest neighbour classification in Pyspark

K-nearest neighbour clustering (KNN) is a supervised classification technique that looks at the nearest neighbours, in a training set of classified instances, of an unclassified instance in order to identify the class to which it belongs, for example it may be desired to determine the probable date and origin of a shard of pottery. There are several variations and refinements of KNN clustering [1]. Some, perhaps most of these are not needed in a parallel bath processing environment such as Hadoop or Spark but may become relevant for the rapid processing portion of a lambda architecture.

The simplest KNN algorithm was implemented in Pyspark, a map-reduce framework that is some ten times faster than Hadoop when run from disk and 100 times faster when run in memory. The well known Iris dataset was used to verify the implementation. No effort was made to allow for the effects of skew in the training set Before Pyspark coding began the algorithm was implemented in a conventional way based on [2] with some slight improvments to the code, for example using list comprehensions rather than loops. Doing this allowed a deeper understanding of the algorithm and gave a reference point for assessing the Pyspark implementation.

Both implementations were developed in Eclipse with a view to eventual command line use of the Pyspark implementation. Eclipse/Pydev was configured as in [3] with the log output sent to a file. Eclipse being Eclipse occasionally the log output did not go to the file, but did go to the file on rerunning the programme

The Pyspark code is shorter and more easily understood ( once the basics of Pyspark are mastered) than the serial version. Please note that any code presented here is proof of concept code and used at your own risk.

The Algorithm
A set of known instances, in this case Irises, is used to train the algorithm. Since not many people go round measuring Iris details for fun the data set was split randomly split into a training set and a test set with roughly two thirds of the data used as a training set

The algorithm classifies an item in the test set by computing the distance from the test item to each member of the training set, taking the K nearest neighbours and assigning the test item to the class most represented in the K nearest neighbours.

The steps nthe algorithm are

  1. Load data (here data is loaded from a file)
  2. Split the data into test and training sets
  3. Compute the distance from a test instance (A member of the test set) to all members of the training set
  4. Select the K nearest neighbours of the test instance
  5. Assign the test instance to the class most represented in these nearest neighbours

Running the algorithm against every member of the test set and determining the percentage of correct classifications gives an estimate of the accuracy. Since the algorithm randomly splits the data into a test and training set the result will vary every time the test is run but the accuracy was around 90% with occasional extremes of 85% and 97%

Loading the data

Loading data from a file involves creating a spark context and using that to load the data. You can only have one spark context running at a time.

# Create spark context
sc = SparkContext(appName="PysparkKnearestNeigbours")
# Read in lines from file.
records = sc.textFile(sys.argv[1])

This creates an RDD (Resilient Distributed Dataset) holding all data records. For this exploratory exercise the first (header) row was removed. Since RDDs are not iterable, they must be transformed into lists using collect() for example

recordlist = records.collect()

The number of nearest neighbours is similarly read from the command line

numNearestNeigbours = int(sys.argv[2])
numfieldsInDistance = int(sys.argv[3])

Splitting data into test and training sets

# Split data into test and training sets in ratio 1:2
testset,trainingset = records.randomSplit([1,2])

This does what it says, it splits the data randomly into test and training sets with about one third in the test set.

The distance function

The distance function is an integral part of the algorithm. Various distance functions can be used but here, instead of the standard Euclidean distance the absolute values of the differences in each component was used in order to minimise numerical problems

d(x,y) = sum( |xi-yi|)

Finding the nearest neighbours

Finding the nearest neighbours ran into the restriction that only one RDD can be transformed at a time. Overcoming this involved creating an RDD comprising all pairs (training instance, test instance) using cartesian().

nearestNeigbours = trainingset.cartesian(testinstance) \
.map(lambda (training,test):(training, distanceAbs(training, test, numfields))) \
.sortBy(lambda (trainingInstance, distance):distance) \

Cartesian creates an RDD with the required K-V pairs. The map statement creates a pair
(training instance, distance to test instance). sortBy() method then sorts these pairs by value ( the distance), in ascending order and take(..) crops the result to the desired number of nearest neighbours.

Assigning the test instance to a class

The code below should be self explanatory. It transforms the neighbours into an RDD of classnames. To get the assigned class convert neighbourNames to a list as above and take the first element

# (kv) pair is typically (u'5;3.4;1.6;0.4;Iris-setosa', 0.08000000000000014)
# training = u'5;3.4;1.6;0.4;Iris-setosa'
# training.split(";")[-1] = Iris-setosa
neighbourNames = (trainingInstance, test):(trainingInstance.split(";")[-1]))

Check the accuracy by looping over the test set comparing the assigned class in the training set to the actual class in the test set.

Wrapping up

The steps needed to implement the KNN classification algorithm have been outlined. The two major Gotchas encountered were being able to run only one spark context at a time and being unable to use nested RDDs. The first problem was not a major one as the python methods have access to the global scope. The second one was overcome using the cartesian() method which may prove expensive for very large training sets.

The algorithm worked well on the iris dataset, but might not work so well on less well ordered sets. One improvement would be to deal with a skewed dataset ( where one class dominates) by weighting the data accordingly.

  1. Some notes on configuring Pyspark to run code from Eclipse or from the command line