Archive for the ‘ Data ’ Category

Upcoming improvements to the DDI Website

EDDI has generated a lot of discussion around DDI, and one area that I have been most interested in and have been guiding discussions around is examining how to improve the DDI Alliance website. As the Web Maintenance Chair, it would be great to rest on my laurels and admit the Website is perfect and leave it at that.

However, it isn’t and I wont.

So throughout EDDI, I have been compiling a list of gripes and grumbles (as well as positive remarks and suggestions) regarding the website. In the new year I will be sending out a short survey to DDI Users looking at how people use the website, their issues (both positive and negative) and how they think the DDI Website should look in the future.

One main issue that will definitely be addressed as a part of this exercise is the lack of positive examples of DDI available on the web. The reason for this is that the survey itself, and all its metadata will be made available for people to download and study. This will not be an easy job, but I look forward to contacting researchers and developers across the DDI community to help piece this together and make improvements for everyone.

Sing a song of software, bubbles full of lies; 4 and 20 years of stocks audio-lised with Py

I’ve been recently toying with the idea of using music as a format of exploratory data analysis. While the use of sound to monitor data isn’t new, its still relatively uncommon. As I occasionally find myself trying to make sense of large data sets finding a way to quickly analyse them, to find the points of interest can be quite tedious. So I thought about ways someone with no music skill could generate sound from data, and produce something relatively melodic, and useful for highlighting patterns and anomalies in the data.

Sing a song of software,

To test this out I put together a little tune, that covers the past 24 years of stock information from Microsoft (Piano), Apple(Clarinet) and Google(Xylophone). The pitch is proportional to the price of the stock with low tones being low prices and high be high. While the volume of each instrument is proportional to the volume of sales over the period, so when you hear a quiet sound that is a low volume day, while loud note is a period of higher trading volume.

There are two versions of the music available:

A shorter 2 minute, up-tempo version using weekly stock prices: OGG, Midi – This one is short and to the point, but some of the nuances, like big daily trade spikes are missed.

A longer, 17 minute, version using daily prices: OGG, Midi – This one is a little monotonous at the start, but you can hear Apple come from a tiny instrument in the background to a larger force much better. It also lets you hear some off Apples big trading days.

Bubbles full of lies;

A few things to listen out for:

  • Early on, listen for Microsoft’s speedy accent during the 2000′s tech boom, and an even quicker decline. (About 1:00 in on the quicker version)
  • Apple, has for a long time a consistently low trade volume, however occasionally you will hear loud piano strikes starting from the early 2000′s. (About x minutes in.) These are peaks of stock sale, probably around MacWorld and iPod/Phone/Pad announcements.
  • After about 2005, you can hear Apple and Google slowly rise in volume and stock price, while Microsoft remains in a consistent range throughout the same period. (After 2:00 in the short version)

4 and 20 years of stocks,

The data that all of this was pulled from was the historical stock prices data sets available on Google Finance. Why 24 years worth – because it fit with the theme of the nursery rhyme I was trying to mimic. Its pretty touch and go as to what data you can download from Google Finance, but to be fair, from my understanding this is an issue with the exchanges rather than with Google.

Audio-lised with Py(thon).

So the nitty gritty on how it works:

Its a python script that loops through a set of files of output data from Google Finance and using midiutil creates a Midi file. Each day (or weeks) datapoint is weighted so the values remain within a specific range for a specific instrument and the volumes are adjusted so that each instrument can be detected. Without either of these it really is quite a mish-mash of sound.

This output Midi file is then run through Timidity++ to create an Ogg/Vorbis file. Converting to Ogg is only necessary for consistency, but both the Midi and the Ogg are available.

Future work and ideas

Well the goal is to be able to use a technique like this to listen to large multi-variate datasets, that have either a time dimension, or a continuous dependent variable (heights, weights, etc…). As long as one dimension has values that are relatively evenly and closely distributed with few overlaps and a wide enough spread it should be possible to ‘graph’ probably any dataset meaningfully as audio.

3 quick questions to identify children in financially risky families

  1. Do you live with mommy and daddy? 
  2. Does mommy or daddy smoke?
  3. Do you eat a lot of fruit or nuts at home?

The Australian Bureau of Statistics recently released data cubes on Household Expendituregiving an incite into how Australians use their money. What is of interest is that they provide a breakdown of average weekly expenditure for a variety of products, tabulated against the number of financial stressors. The ABS has defined financial stressors are events where a household is unable to pay bills, goes without meals,

The reason this is interesting is that there are very few factors with a striking correlation to the number of financial stresses within a household as those below:

Number of indicators of financial stress experienced
Risk factors 0 1 2 3 4+
Tobacco products ($/week) 8.89 13.01 14.63 18.02 21.45
Newspapers ($/week) 3.17 2.80 2.64 1.47 1.48
Fruit and nuts  ($/week) 14.06 12.81 11.41 10.62 7.98
One parent family with dependent children (%) 1.9 4.7 8.3 9.8 19.3
All Renters (%) 18.6 28.3 34.0 39.3 54.4
Main source of income – Government pensions and allowances (%) 17.8 22.1 25.2 28.3 52.1

Note, that for newspaper and fruit and nut weekly expenditure, there is a negative correlation.

The problem with some of these metrics though is while they are all strongly correlated with financial stress, they may not all be obvious to children. For example, a child may not know about their parents income or if they live in a rental property, or certain actions may not be hidden from the child, such as a parent buying a newspaper on the way to work.

Other metrics however, are more obvious for children to notice and report. Such who they live with, obvious activities of their parents,  (like smoking) and their own diets. This leads us to the three questions listed above:

  1. Do you live with mommy and daddy?
  2. Does mommy or daddy smoke?
  3. Do you eat a lot of fruit or nuts at home?
Now while these may not account for same-sex couples, a child in a two parent same sex household would, given sufficient prompting, probably indicate they had two parents. Furthermore, this is based on aggregate information, however there is a good chance unit records may back these correlations up. Lastly, this is looking at correlation for risk factors, and cannot be used to suggest causation. Together however, these three questions can quickly give a strong indication of the risk of financial stress within a child’s household.

Why we’ve gotten better at chess and skateboarding

Chess game

Chess game by niallkennedy, on Flickr

There was a recent post on the Freakonomics blog that mentions a study from the University of Buffalo that attempts to examine the relative strengths of top ranking chess masters over the years to attempt to answer the question “are chess ratings over inflated, or are we really getting better.”

The results of the study showed that chess ratings have not inflated, and we are in fact getting better at chess.

I am in two minds about the fact, at first glance this seems like a real “well-duh” moment in science, but after a little thought it actually comes across as an extremely interesting study.

It seems obvious that we as a society are advaning. Regardless of your perceptions of the “youth of today” or the “invasion of foreigners” (it always seems that conservatives think that young or alien people are just going to destroy society), it makes perfect sense that we are generally getting better at stuff.

The best example of this the progression of skills in skateboarding. While chess has been around for hundreds of years, skateboarding started in the 1940′s, but only got big in the 70′s. This means people have been around long enough to see vast improvements in the sport. For the longest time most skateboarding tricks were flatland based – that is they were either tricks one did while rolling, carving in pools, or handstands. It wasn’t until the early 80′s that the ollie (or jump) became a phenomenon. Without the ollie, almost no modern skateboarding tricks are possible.

It took 40 years for skateboarders to learn how to jump! Now days, it only takes a solid afternoon of practice. This means that anyone who starts skating today can benefit from the knowledge 20 years of study of the ollie.

The same should hold true for almost any discipline – especially chess. Unlike skateboarders, chess players are very meticulous about recording and researching their hobby. In fact Chess.com has records on games earlier than 1940, and chess books will often have games from the 1800′s. So while it may take a lot of practice to advance in chess, usually much more than an afternoon, there still exists an exhaustive wealth of knowledge for newcomers to tap into if they wish to progress and provides the proverbial shoulders for them to stand upon to improve the field.

The reason that this study is fascinating, is that the above really is anecdote. Sure, we can see that we’ve improved at skateboarding or chess. This study takes the vast amount of data available for study in chess and demonstrates that this shared phenomenon or improvement is a reality. We as a society are advancing (at least at chess and skateboarding), and although there is no definite proof of causation, this provides support to our anecdotes.

This post also appears on Chess.com.

420 convert classifications everyday

With the recent release of the new Australian Standard Classification of Drugs of Concern from the ABS, there was the opportunity to field test the Virgil CSV to DDI converter with real data to see how it held up. Fortunately, the classification was released as an Excel data cube that conformed almost entirely with the structures that Virgil supports. After a little cleaning of the CSV, it was able to run through the converter without few issues at all. Incidentally the most major error highlighted the massive oversight that the converter fails to add values for the codes! However this has been corrected and changes have been pushed in the svn, and a new version of the Windows tool will be pushed out this weekend.

A screen shot of Virgil with the converted classification

A screen shot of Virgil with the converted classification

Opening the newly created DDI file in the Virgil DDI CodeList Editor was another story and pointed out a few flaws with how it handles empty data. With the structure from the Excel file not containing descriptions for any category or any labels for the CodeScheme, there were a few small corrections made to accommodate freshly created DDI, but many of these problems will be ironed out by the time the CodeList editor is available for download.

While the converter hasn’t been fully integrated into the CodeList Editor, it will shortly be possible to create a single DDI file and import numerous CSV files to create a series of classificatory codelists in a single package. A practical and soon to be realised example would be the Australian Standard Classification of Drugs of Concern with the lists of drugs of concern, forms of drug and methods of consumption codelists all contained in a single machine processable DDI package.

For those who haven’t been able to download or run the converter, the output from this example is available for testing.

Virgil UI – CSV to DDI converter now available for Windows

The day is finally here – Virgil c2d is available for Windows. You can download the zip archive from Google Code. In future this will be the place that new versions of the tool will be made available, and I am hoping that as people start using it and bug do get noticed that there will be activity, so be sure to check back often to see if changes are available.

For the time being though, download a copy of the beta, checkout some of the example CSVs and  learn about how the different CSV types look.

If you have issues getting the application to run, check the converter_ui.exe.log log file for any errors and be sure to raise a bug through the issue tracker.  If there are issues getting a file to covert check the structure settings are correct, and check the line that the error dialog indicates may be causing the issue. If you are still unable to get the CSV to convert raise an issue and attach the offending CSV file and I’ll see if the problem can be resolved.

When checking out the example CSVs the filenames give some hints to the structure of the data in them:

  • ss: semi-structured
  • mono: monolingual
  • pd: pre-defined language
  • pe: prefix embedded language

For the other files they have type:

  • anzsic 2006 – codes and titles.csv — Semi-strucutred, Monoglingual
  • anzsic.csv – Semi-strucutred, Monoglingual

 

Virgil UI – CSV Converter UI Files now up

Over the week I’ve been coding away wrapping the CSV to DDI converter module with a nice user interface. Well, after a weekend of work it has a user interface, whether it is nice is in the eye of the beholder. As with the rest of the Virgil project the python code for this tool is available on Google Code. Unfortunately I haven’t had time to compile this into a Windows executable suitable for novice use, but interested parties are again welcome to download and test the tool from source.

For the curious, I’ve again recorded a demonstration and put it up on youtube, which is embeded below:

Again there is no audio, but I’ve included a brief transcription below so people can get a better idea of what the demonstration is trying to illustrate:

  • Open the anzsic.csv file to briefly view the contents of the CSV holding the labels and some descriptions of categories in the 2006 Australia and New Zealand Standard Industrial Classification.
  • Execute the conversion tool,  and load the ansic.csv file
  • Select the correct structure options for the CSV, as per the allowed structures described in a previous post.
  • Add a default language code and ID prefix for the DDI Instance and all codes and categories.
  • Demoing the preview table, showing how the header row can be ignored.
  • Convert the file, in the background you can see debug text for each code encountered.
  • Open a folder to save, and confirm the folder is empty.
  • Open the newly created file.
  • Add some line breaks to the automatically created XML  and search for a term from the original CSV.

Hopefully by this time next week there will be a fully downloadable Windows executable available for people to try.

Monday Funday – Challenge: De-obfuscate some bad DDI

The solution is now available below

DDI can be a harsh mistress sometimes, and mistakes can sometimes be made when trying to use it. As a data format it is flexible enough to handle most situations, but this flexibility can sometimes be a shortcoming.

Below is a poorly written chunk of DDI I’ve written, that forms part of a survey instrument. The good news it can be written in a much better way. The challenge is how to rewrite it:

<d:Sequence id="MainSequence">
    <d:ComputationItem id="CompItem1">
        <d:Code>
            <r:Code programmingLanguage="pseudoCode">SET X = X + 1</r:Code>
        <d:Code>
    </d:ComputationItem>
    <d:IfThenElse id="ifblock1">
        <d:IfCondition>
            <d:Code>
                <r:Code programmingLanguage="pseudoCode">X == Y</r:Code>
            <d:Code>
        <d:IfCondition>
        <d:ThenConstructReference>
            <r:ID>A_different_sequence</r:ID>
            <r:ID>MainSequence</r:ID>
        </d:ThenConstructReference>
    </d:IfThenElse>
</d:Sequence>

If you think you can correct this code, email a solution to theodore.therone at gmail.com. At the end of the week (When I wake up this Saturday AEST) I’ll select a solution at random and give away a $15 voucher for 5senses coffee.

If you need clarification on anything in the example code, post it in the comments and I’ll clear it up.

 


Unfortunately, there were no correct responses, so the voucher will go to the next challenge, but the answer is still available below.

Solution – this was a simple computer science riddle wrapped in a layer of DDI. It was a loop rewritten as a recursive if-branch. Rewritten as a loop it comes out as this:

<d:Loop id="MainLoop">
    <d:LoopWhile >
            <d:Code>
                <r:Code programmingLanguage="pseudoCode">X == Y</r:Code>
            <d:Code>
    <d:LoopWhile>
    <d:StepValue>
        <d:Code>
            <r:Code programmingLanguage="pseudoCode">SET X = X + 1</r:Code>
        <d:Code>
    </d:StepValue>
    <d:ControlConstructReference>
        <r:ID>A_different_sequence</r:ID>
    </d:ControlConstructReference>
</d:Sequence>

A much cleaner solution!

Virgil UI – Site is live and converter code is now available

Virgil UI is now starting to get ready for release to the public, with the first step being that it now has a Google Code project site, which is starting to include.

The first code to go up in the public repository is the utility code for the CSV to DDI conversion described in the last post. This includes the specialised CSV parser, library to create DDI 3.1 Codes and Categories and a sample command line interface to pull it all together.

For those who can’t wait for either a GUI tool or an executable command-line app, or just want to play with the code, feel free to grab the source. Keep in mind that this is very much pre-beta and is under active development, but if (or more likely when?) you come across a bug, be sure to report it through the issue tracker on the site. To help folks along a few sample CSVs are included to give developers an idea of the required format for the converter to consumer, but that isn’t a conclusive list of all the possible combinations of code and category list types.

In lieu of actual usage documentation (which will be added during the week) below is a sample execution:

python2.7 ./converter_cli.py -i ./test_files/anzsic.ss-pd.csv -c SemiStructured -C PreDefinedColumn -o outfile.xml -d DDIInstance_ID --codeSchemeID=Test_codeSchemeID --categorySchemeID=Test_categorySchemeID
python2.7 ./converter_cli.py                - Needed to execute the script
-i ./test_files/anzsic.ss-pd.csv            - The CSV file to transform
-c SemiStructured                           - The CSV CodeList type (see previous blogpost for more info)
-C PreDefinedColumn                         - The CSV Category type (see previous blogpost for more info)
-o outfile.xml                              - File to save the DDI to, if blank output to console
-d DDIInstance_ID                           - The ID for the new parent DDIInstance of the resultant file
--codeSchemeID=Test_codeSchemeID            - ID for the CodeScheme that will hold the codes - is also part of the prefix for all DDI code IDs
--categorySchemeID=Test_categorySchemeID    - ID for the CategoryScheme that will hold the codes - is also part of the prefix for all DDI category IDs

note. This does need Python 2.7 to use some of the more advanced XPath options in ElementTree that the DDI module uses.

Virgil UI – Announcement and Pre-alpha demonstration

When I’m not writing about writing code, I occasionally get to hop into a terminal and tear out a few lines of code. While Ramona was a bit of a bust that needs to revisit the drawing board before its ready to leave the nest, Virgil has taken off. Virgil is something I’ve been doing in-between other tasks with the sole purpose of allowing users to edit and manage CodeLists managed in DDI. This is based on work I did mid-last year to turn DDI Code and Category Schemes into interactive webpages. To support this I’ve been working on a tool to allow users to properly edit Codelists in DDI.

A CodeList is a combination of two DDI objects, a CodeScheme and a CategoryScheme and enables users to manage complex hierarchies of coded information, as small as codifiying “Yes/No” responses to managing large industrial classifications.

To demonstrate how this may be done, I’ve uploaded a screencast of Virgil-UI in action opening a DDI version of the coded hierarchy from the Australian and New Zealand Standard Industrial Classification (ANZSIC) editing and saving the file.


The video demonstration is available on youtube – here.

The video got downscaled when it was uploaded (pressing the expand button helps) but for those having trouble understanding whats in the video, the features demo’d in the video are:

  • Open the ANZLIC DDI File in the Vim text editor and searching for the term “LOOK HERE”. This search term isn’t in the file… yet
  • Virgil-UI is run and the same file is loaded
  • Data from the DDI File for a Category is loaded and is displayed in English and German
  • The term “LOOK HERE” is added to the description of a category and the file is saved
  • The file is then reloaded in the Vim text editor and the term “LOOK HERE” searched for
  • The search term “LOOK HERE” is found

When ready (hopefully mid-August for open-beta) Virgil-UI will be released under an free open-source licence and will support the following features – ** Indicates a feature that is fully or partially implemented already
** Complete multilingual support, for both the UI and multilingual DDI files.
** DDI3.x file support
** Full rich-text editing for DDI Descriptions and Labels
** Support for Windows, Mac and Linux
* Export support for Virgil-Web an existing tool for generating Web-pages from DDI CodeLists
* Import from CSV
* Drag-and-drop re-ordering of CodeLists

Planned features after the initial release include:
* DDI2.x file support
* DDI3.x support from a custom-built repository
* DDI3.x support from a Colectica repository