Posts Tagged ‘ programming

Make sure your continuous testing is continuous

One of the key features of the Aristotle Metadata Registry is its extensive test suite. Every time code is checked-in, the test suites are run and about 20 minutes later I get a notification saying everything is fine… or so it should be.

I recently made a small change to the test suite, that altered no code, and just changed some of the reporting. This shouldn’t have changed how the last tests were run, so they should have completed without problems, but this wasn’t the case.

After a short investigation, I discovered that a library that is used in the Aristotle admin interface had changed in a big way. Unfortunately, I haven’t been able to work on Aristotle as frequently as I’d like over the past few months, so this had gone completely unnoticed. Since the test environment is rebuilt every time the test are run, it was using the most recent version, while my code depended on an earlier version.

Since Aristotle is still in beta, the result wasn’t disastrous, however it still highlights (for me at least) an issue with relying on a green tick in the test suite saying everything is alright – because while the tests might be alright at that point in time, its prone to change.

So if you have to put down a project for a few weeks, or longer, make sure to nudge your code periodically, just to make sure everything is still running ok.

As for how it was fixed, a short alteration to the requirements file got the tests passing again, and a newer version that incorporates the updated library will be coming shortly.

Request for comments/volunteers for the Aristotle Metadata Registry

This is a request for comments and volunteers for an open source ISO 11179 metadata registry I have been working on called the Aristotle Metadata Registry (Aristotle-MDR). Aristotle-MDR is a Django/Python application that provides an authoring environment for a wide variety of 11179 compliant metadata objects with a focus to being multilingual. As such, I’m hoping to raise interest around bug checkers, translators, experienced HTML and Python programmers and data modelers for mapping of ISO 11179 to DDI3.2 (and potentially other formats).

For the eager:


Aristotle-MDR is based on the Australian Institute of Health and Welfare’s METeOR Registry, an ISO 11179 compliant authoring tool that manages several thousand metadata items for tracking health, community services, hospital and primary care statistics. I have undertaken the Aristotle-MDR project to build upon the ideas behind Meteor, and extend it to improve compliance with 11179, but to also allow for access and discovery using other standards, including DDI and GSIM.

Aristotle-MDR is build on a number of existing open source frameworks, including Django, Haystack, Bootstrap and jQuery which allows it to easily scale from mobile to desktop on the client side, and scale from small shared hosting to full-scale enterprise environments on the server side. Along with the in-built authoring suite is the Haystack search platform which allows for a range of searching solutions from enterprise search such as Solr or Elastisearch, to smaller scale search engines.

The goal of the Aristotle-MDR is to conform to the ISO/IEC 11179 standard as closely as possible, so while it has a limited range of metadata objects, much like the 11179 standard it allows for the easy extension and inclusion of additional items. Among those already available, are extensions for:

Information on how to create custom objects can be found in the documentation:

Due to the wide variety of needs for users to access information, there is a download extension API that allows for the creation of a wide variety of download formats. Included is the ability to generate PDF versions of content from simple HTML templates, but an additional module allows for the creation of DDI3.2 (at the moment this supports a small number of objects only):

As mentioned, this is a call for comments and volunteers. First and foremost I’d appreciate as much help as possible with my mapping of 11179 objects in DDI3.2 (or earlier versions), but also with the translations for the user interface – which is currently available in English and Swedish (thanks to Olof Olsson). Partial translations into other languages are available thanks to translations in the Django source code, but additional translations around technical terms would be appreciated. More information on how to contribute to translating is available on the wiki:

To aid with this I’ve added a few blank translation files in common languages. Once the repository is forked, it should be relatively straightforward to edit these in Github and send a pull request back without having to pull down the entire codebase. These are listed by ISO 639-1 code, and if you don’t see your own listed let me know and I can quickly pop a boilerplate translation file in.

If you find bugs or identify areas of work, feel free to raise them either by emailing me or by raising a bug on Github:

The public release of “A Case Against the Skip Statement”

A few years ago I wrote a paper titled “A Case Against the Skip Statement” on the logic construction of questionnaires that was awarded second place in the 2012 Young Statisticians Awards of the International Association of Official Statistics.

It went through two or three rounds of review over the course of a year, but due to shifting organisational aims, I was never able to get the time to polish it to the point of publication before changing jobs. So for the past few years I have quietly emailed it around and received some positive feedback and have gotten a few requests to have it published so it could be cited. I have even myself referred back to it in conferences and other papers, but never formally cited it myself. I have also used this article as a reason why study of ‘classical’ articles in computer science is still important, for the simple fact that while Djikstra’s “Gotos Considered Harmful” is outdated in traditional computer science, its methods and mathematical and logic reasoning can still be useful, as seem in the comparison of programming languages and the logic of questionnaires.

As a compromise to those requests, I released the full text online, with references and a ready to use Bibtex citation for those who are interested. For those interested the abstract follows the Bibtex reference:

    title = {A Case Against the Skip Statement},
    author ={Samuel Spencer},
    year = 2012,
    howpublished = {\url{}},
    note = {[Date downloaded]}

or using BibLatex:

   author ={Samuel Spencer},
   title ={A Case Against the Skip Statement},
   year = 2012,
   url ={},
   urldate ={[Date downloaded]}

With statistical agencies facing shrinking budgets and a desire to support evidence-based policy in a rapidly changing world, statistical surveys must become more agile. One possible way to improve productivity and responsiveness is through the automation of questionnaire design, reducing the time necessary to produce complex and valid questionnaires. However, despite computer enhancements to many facets of survey research, questionnaire logic is often managed using templates that are interpreted by specialised staff, reducing efficiency. It must then be asked why, in spite of such benefits, is automation so difficult?

This paper suggests that the stalling point of further automation within questionnaire design is the ‘skip statement’. An artifact of paper questionnaires, skip statements are still used in the specification of computer-aided instruments, complicating the understanding of questionnaires and impeding their transition to computer systems. By examining questionnaire logic in isolation we can analyse the structural similarity to computer programming and examine the applicability of hierarchical patterns described in the structured programming theorum, laying a foundation for more structured patterns in questionnaire logic, which in time will help realise the benefits of automation.

“FingerTabs” – Horizontal Tabs with Horizontal Text in PyQt

On advice from someone far more experienced in user interface, I was given some feedback on Canard (a questionnaire specification editor) and was pointed in the direction of FingerTabs. Although, its not a widely used term (unless you are an archer) I couldn’t find anything other term for what are height-wise stacked (in PyQt this west or east positioned tabs) tabs with horizontal labels. FingerTabs are call so, because they look like a bunch of long little fingers, although this visual metaphor breaks down if you have more than 5 tabs, or a Lovecraftian imagination. For example:

Normal (or top aligned) tabs.

Normal (or top aligned) tabs.

Left (or west) aligned tabs with PyQt Default text orientation.

Left (or west) aligned tabs with PyQt Default text orientation.

Left aligned tabs with normal (or horizontal) oriented text.

Left aligned tabs with normal (or horizontal) oriented text.

If you want to go down the path of stacked tabs, the last way is probably the best to go for as it is much easier to read, and you can fit more tabs in the space vertical space, with little loss of horizontal space, as the examples above illustrate. Interestingly enough getting the last one, is quite easy, although not well publicised.

Changing from the top to the bottom is just a matter of extending the QTabBar of the QTabWidget and overriding the default paintEvent and sizeHint. This allows you to override the original text orientation, and insert it in a more readbale fashion. The difficult bit was determining how to reuse the default tab sytling (line 10 and 17 in below).

For what its worth, those 38 lines of code took about 4 hours to write for a staggering 1 line of code every 6 minutes (and 20 seconds).

Thanks go to the two threads from StackOverflow, where the first answer got me close enough to implement the above:

Sing a song of software, bubbles full of lies; 4 and 20 years of stocks audio-lised with Py

I’ve been recently toying with the idea of using music as a format of exploratory data analysis. While the use of sound to monitor data isn’t new, its still relatively uncommon. As I occasionally find myself trying to make sense of large data sets finding a way to quickly analyse them, to find the points of interest can be quite tedious. So I thought about ways someone with no music skill could generate sound from data, and produce something relatively melodic, and useful for highlighting patterns and anomalies in the data.

Sing a song of software,

To test this out I put together a little tune, that covers the past 24 years of stock information from Microsoft (Piano), Apple(Clarinet) and Google(Xylophone). The pitch is proportional to the price of the stock with low tones being low prices and high be high. While the volume of each instrument is proportional to the volume of sales over the period, so when you hear a quiet sound that is a low volume day, while loud note is a period of higher trading volume.

There are two versions of the music available:

A shorter 2 minute, up-tempo version using weekly stock prices: OGG, Midi – This one is short and to the point, but some of the nuances, like big daily trade spikes are missed.

A longer, 17 minute, version using daily prices: OGG, Midi – This one is a little monotonous at the start, but you can hear Apple come from a tiny instrument in the background to a larger force much better. It also lets you hear some off Apples big trading days.

Bubbles full of lies;

A few things to listen out for:

  • Early on, listen for Microsoft’s speedy accent during the 2000’s tech boom, and an even quicker decline. (About 1:00 in on the quicker version)
  • Apple, has for a long time a consistently low trade volume, however occasionally you will hear loud piano strikes starting from the early 2000’s. (About x minutes in.) These are peaks of stock sale, probably around MacWorld and iPod/Phone/Pad announcements.
  • After about 2005, you can hear Apple and Google slowly rise in volume and stock price, while Microsoft remains in a consistent range throughout the same period. (After 2:00 in the short version)

4 and 20 years of stocks,

The data that all of this was pulled from was the historical stock prices data sets available on Google Finance. Why 24 years worth – because it fit with the theme of the nursery rhyme I was trying to mimic. Its pretty touch and go as to what data you can download from Google Finance, but to be fair, from my understanding this is an issue with the exchanges rather than with Google.

Audio-lised with Py(thon).

So the nitty gritty on how it works:

Its a python script that loops through a set of files of output data from Google Finance and using midiutil creates a Midi file. Each day (or weeks) datapoint is weighted so the values remain within a specific range for a specific instrument and the volumes are adjusted so that each instrument can be detected. Without either of these it really is quite a mish-mash of sound.

This output Midi file is then run through Timidity++ to create an Ogg/Vorbis file. Converting to Ogg is only necessary for consistency, but both the Midi and the Ogg are available.

Future work and ideas

Well the goal is to be able to use a technique like this to listen to large multi-variate datasets, that have either a time dimension, or a continuous dependent variable (heights, weights, etc…). As long as one dimension has values that are relatively evenly and closely distributed with few overlaps and a wide enough spread it should be possible to ‘graph’ probably any dataset meaningfully as audio.

How you can and why you should learn to program.

Often people ask me how I learned to program and why I did. The answer is simple, lots of practice because its a useful skill – and it also helps that I enjoy it.

Not everyone will find programming fun or interesting, however at more than one time people will come up against a problem that computers were made to do. Contrary to popular believe computers are dumb – at least in the sense that they can only do what they are told. What they can do, is they can do these dumb things very, very quickly. So much so, that they can fool you into believing they are smart – more than smart, magic even. In fact, if you are even a mediocre programmer people can become convinced you are a magician.

So why should you learn to program?

Mostly, because your time is valuable. Not to me, but to you. Unlike a computer your time is finite, and if you can make a machine that can give you more time to do something, isn’t that in your interest? Even if it isn’t in work, if its just sorting your taxes, or writing a script to check your email for you, there are plenty of small, repetative tasks that you probably do that a machine can do quicker. If you enjoy doing repetitive tasks, then there isn’t much I can do for you. But if you want to spend more time understanding why you do these things, then read on…

Where do you begin?

Well, I think, there are 3 programs every one must be able to write. Because if you can write these 3 programs and adapt them to your needs, you can do most big, boring tasks that will come your way.

There are 3 programs you need to learn to start to become a programmer and as I explain them, I’ll show you a brief example to edit and play with and ultimately understand what is going on. These examples are written in Python, a free programming language with a very user friendly syntax.

Hello, World!

“Hello world” is traditionally the first program many new programmers will write. It is simple, when the program is run the computer prints “Hello, World!”. In essence, this simple program introduces programming syntax and demonstrates how to display text to a user.

print "Hello, World!"

There isn’t much to this, but its a starting point. It teaches you some basic syntax and with a lot of languages understanding syntax is important – computers don’t speak English and to make them useful you need to learn to talk to them, more than the other way around.

Simple user interaction

The second is a simple string manipulator. This goal is to create a simple, persistent user interface, with some error checking that fulfills a task. Here we see an example that does actions on a given string based on a command given to it.

while True:
        input = raw_input("> ")
                cmd,text = input.split(":",1)
                if cmd == "uc":
                        print text.upper()
                elif cmd == "lc":
                        print text.lower()
                elif cmd == "rev":
                        print text
                elif cmd == "quit":
                        print "Bye"
                        print "Command not recognised"
                print "Syntax error: enter a command, a colon (:), then a string"

Firstly, we start the loop and set it to never stop looping. As long as the user wants to play with strings this program will keep going. Next we ask for some input from the user.
Now things get a little more complex, first we try and split the string around a colon, into a command and the text. If the user doesn’t enter a colon, we throw an error, and give them some help text (after the line that says “except:”.
If they do enter a command and text, we set the text to upper case, lower case or reverse it. If they tell us to “quit:” we quit by breaking out of the loop (the break command) or we tell them we didn’t recognise the command.

Its not perfect, but it gives us an understanding of errors, handling user input and basic user interaction – not bad for 18 lines of code.

File manipulation

The last and most important program is a simple file manipulation tool. Again, what the tool does to the file is irrelevant- it might merge files, look for spelling errors, count lines, anything. Perhaps, we are looking for entries in a diary that start with numbers (like dates) in a large file, and only want to view these.

file = open('test')
for line in file:
    if line[0].isdigit():
        print line

Here we open the file, and then line by line we search through it. When we find a line whose first character is a digit we print the line (line[0] essentially means the 0th character from the start – its complicated but its how almost everyone deals with subscripting in lists). Lastly, we clean everything up by closing the file.

Again, by no means the best implementation, but easy enough to read and alter. This time in 5 lines we have a simple script that could help use find our tax information, search our diary, lookup phone numbers, or anything like this.

I’ve done this, what now?

Do whatever task it is that needs doing. Odds are having read these short code snippets you can get an idea of what can be done. Its just a matter of getting out and doing it. You may not be the next Bill Gates/Mark Zuckerburg/whoever, but if you learn to paint a wall you won’t be the next Da Vinci either. Learn how to do what you need to do, and keep plugging away. Programming is about pulling together pieces of logic, to help us do simple tasks easily and reproducibly. So think lazy, and find all the tasks that you can automate and get something else to do them for you.

Virgil UI 0.0.1 Beta now live!!

After months of development, testing, coding and crying…. Virgil UI version 0.0.1b is now available for public beta testing.

This release sees the first public testing of a full-functional, classification and codelist specific editor based on and supporting the DDI Lifecycle XML format (DLML).

Features in this release of Virgil include:

Known issues in the 0.0.1b release that  will be fixed in a future release:

  • Codes or languages cannot be removed once added.
  • New CodeSchemes cannot be added manually, only when importing from CSV.
Also new is an updated version of the standalone CSV to DDI converter tool that fixes some outstanding bugs in multilingual imports and corrects a few mistakes when writing the DLML.
For more information on Virgil-UI there is a list of blog post outlining the development process, or you can checkout the Google Code page, view all the downloads, or submit bugs.

Virgil UI – Converting from legacy to CSV to DDI

While my main machine has been out-of-action, I’ve devoted a more time to one of the first use cases that prompted the development of Virgil – transforming legacy CSVs into DDI 3.1.

One of the main features of Virgil is the ability to help users transition from legacy systems, using non-standard formats to using DDI as the main data language for managing codes, categories and classifications. Unfortunately, there is no way for any one system to support every format for classifications, however by targeting a lowest-common denominator we can process the bulk of the work. In this case the lowest common denominator is CSVs.

If a user or developer of a legacy system is able to transform their legacy format into one of several different CSV formats supported by Virgil, then they will be able to import, at the least the basic structure and metadata of their codes and classifications into DDI. With most of the code for the conversion tools done, I’ve begun putting together the wizard interface for Virgil UI, which will also form part of a standalone conversion tool. Within the next few weeks the standalone conversion tool will be ready for release, and made available as open-source with the supporting code.

Below is a list of questions that users and developers may have around how to prepare CSVs for conversion to DDI listing the convertible metadata, preferred CSV structure and developer support. Although there are restricted possibilities for CSV structuring options for conversion, if there is a need for expanding the formats or metadata available for conversion, make your needs known and this can be incorporated in to future development.

What metadata will be supported?

A user will be able to import the code values and the hierarchy of a classification, as well as labels and descriptions of categories. Labels and descriptions can be multilingual, and multiple languages per item are able to be imported.

Will I have to use Virgil-UI to use this converter

No. This converter will be available as a wizard within Virgil, but the UI for the wizard will be available as a standalone program for users who need to convert from a legacy system to DDI. Lastly, as the code will be entirely open-sourced, the Python module that performs the transformations will be able to be imported into any other Python piece of software. Lastly, since the converter module is written entirely using modules from the Python standard libraries, it will be usable by programs using languages that are compatible or have compatible python compilers – such as Java using Jython[] or .Net using IronPython[].

In summary there will be at least four ways developers and users will be able to implement the Virgil CSV-DDI converter tools.

What ‘formats’ of CSV will be supported?

CSVs are generally without structure, and are just a basic way of storing tabular data, but by using a simple combination of the following code and category forms within a CSV. When picking a structure, it is important that the ‘code’ columns come before any ‘category’ columns. However, and combination of a code and category column format if created correctly should convert from CSV to DDI without trouble.

Column options for importing codes and their hierarchy

Referential CSV Codelist
Order: Code , Parent
Notes: This can be reversed to go Parent, Code. If a parent is blank it is assumed that this node is a top level code in a CodeScheme

A, ,
B, ,

Semi-structured CSV Codelist

Order: (Empty,)*Code,
Notes: If the code is the first entry in a row then it is considered a top code in the CodeScheme. Any children of a code should be indented by only one column. The columns for labels and descriptions start in different columns depending on level of the hierarchy.


Aligned Semi-structured CSV Codelist

Order: (Empty,)*Code,(Empty,)*
Notes: If the code is the first entry in a row then it is considered a top code in the CodeScheme. Any children of a code should be indented by only one column. All nodes should be padded so that the columns for labels and descriptions start in the same columns.

A, ,
B, ,

Column options for importing multilingual categories

Prefix-embedded Language

Order: (Label,Description)+
Notes: As many languages as needed can be be repeated within the column as long as they have unique language codes.

Example: en-au;Chocolate,en-au;Confectionery based on the seed of the cacao plant,fr;Chocolat,fr; Confiseries à base de la graine de la plante de cacao

Pre-defined Column

Order: (language,Label,Description)+
Notes: As many languages as needed can be be repeated within the column as long as they have unique language codes.

Example: en-au,Strawberries,Tasty fruit that isn't a true berry,fr,Frasie,Fruits savoureux qui n'est pas une baie vrai


Order: (Label,Description)
Notes: When only importing a single language that isn’t expressed in the CSV a default language will need to be given when invoking the converter.

Example: Vegemite,A yeast extract spread only edible by people from Australia. No other translations exist because no one else can stand it.

Can this tool support tab-separated files?

Yes. In the wizard users will be given the opportunity to select from a range of delimiter options or enter their own delimiting character. When using this module in other code, it will also support any delimiter as long it is specified when calling the module.

How should a developer write CSV for the converter?

With no agreed upon standard for CSVs its hard for developers to try and write ‘standard’ CSVs. To simplify development and be as lenient as possible the Virgil CSV-DDI converter using the Python CSV module[]. If you are writing your own CSV writer I’d suggest testing it against this module to make sure it works.

In a nut shell though – leading and trailing whitespace is trimmed and any entry that contains a comma (or specified delimiter) should be quoted with double (“) or single (‘) quote marks.

What will the wizard and standalone converter look like?

Something like this:

PyQT Mockup of the CSV/DDI ConverterClick for bigger…

How crowdsourcing will drive open data

Over the past year there has been a global shift in data policies within governments world-wide – opening up data, that once cost or was hidden from public view. Some government organisations have gone so far as to put together incentives to encourage people to make use of this newly freed data. Australian examples include state initiatives such as Victoria’s AppMyState and New South Wales’ apps4NSW and the Federal Governments Government 2.0 taskforce’s Mashup Australia. What’s important to realise though is that these promotions should be seen more as a means to an end, rather than an end in and of themselves.

Data collectors and maintainers need to exhibit a relatively low level of bias and high levels of independence. If any data collected is called into question, the validity of the entire archive, and future collections, can also be questioned. The reason for this stems for the logical (albeit poorly grounded) assumption that if one collection had been tampered with then there are grounds to suggest that all other collections may suffer the same bias.

It is for this reason that the data exposed from these sources may be little more than the aggregated and weighted data, with explanatory metadata and appropriate notes on the methodology of the study. No exposition into the links between data, or overtly controversial or political hypotheses attempting to explain the data. Anything more may suggest bias or influence: for example, a statistical agency may present statistics regarding life expectancy in collection regions, and in another study my present statistics around polluting industry in the same regions, but would be amiss to begin to draw correlations between them. The same agency may hold for data around economic indicators, but would never try to correlate them to political policy.

So while the agency may hold this data and the expertise in explaining the data, they would be reluctant to go further than collection and dissemination. However, the public is unencumbered by such ideals.

Contests such as those listed above give public agency the opportunity to drive public use of public data. This again drive straight to the goals of data agencies: public use of data reenforces their relevance. This gives the public the opportunity to ask potentially controversial questions, backed by official data, while giving the data agencies recognition for their continued importance in society.