Posts Tagged ‘ Data

Virgil UI – Announcement and Pre-alpha demonstration

When I’m not writing about writing code, I occasionally get to hop into a terminal and tear out a few lines of code. While Ramona was a bit of a bust that needs to revisit the drawing board before its ready to leave the nest, Virgil has taken off. Virgil is something I’ve been doing in-between other tasks with the sole purpose of allowing users to edit and manage CodeLists managed in DDI. This is based on work I did mid-last year to turn DDI Code and Category Schemes into interactive webpages. To support this I’ve been working on a tool to allow users to properly edit Codelists in DDI.

A CodeList is a combination of two DDI objects, a CodeScheme and a CategoryScheme and enables users to manage complex hierarchies of coded information, as small as codifiying “Yes/No” responses to managing large industrial classifications.

To demonstrate how this may be done, I’ve uploaded a screencast of Virgil-UI in action opening a DDI version of the coded hierarchy from the Australian and New Zealand Standard Industrial Classification (ANZSIC) editing and saving the file.


The video demonstration is available on youtube – here.

The video got downscaled when it was uploaded (pressing the expand button helps) but for those having trouble understanding whats in the video, the features demo’d in the video are:

  • Open the ANZLIC DDI File in the Vim text editor and searching for the term “LOOK HERE”. This search term isn’t in the file… yet
  • Virgil-UI is run and the same file is loaded
  • Data from the DDI File for a Category is loaded and is displayed in English and German
  • The term “LOOK HERE” is added to the description of a category and the file is saved
  • The file is then reloaded in the Vim text editor and the term “LOOK HERE” searched for
  • The search term “LOOK HERE” is found

When ready (hopefully mid-August for open-beta) Virgil-UI will be released under an free open-source licence and will support the following features – ** Indicates a feature that is fully or partially implemented already
** Complete multilingual support, for both the UI and multilingual DDI files.
** DDI3.x file support
** Full rich-text editing for DDI Descriptions and Labels
** Support for Windows, Mac and Linux
* Export support for Virgil-Web an existing tool for generating Web-pages from DDI CodeLists
* Import from CSV
* Drag-and-drop re-ordering of CodeLists

Planned features after the initial release include:
* DDI2.x file support
* DDI3.x support from a custom-built repository
* DDI3.x support from a Colectica repository

63 years of Australian CPI data

The latest Australian CPI figures were released last week by the Australian Bureau of Statistics.

Fortunately, each release includes re-weighted indices for each indicator. These cover 11 major topics, including food, clothing, housing and education. Unfortunately, the Excel spreadsheets these are distributed aren’t the easiest formats to process data from. This is because exporting data from Excel can be time consuming, and the data as it is stored in Excel neglects the hierarchical structure of the CPI indicators.

To help change this, I wrote a few scripts (and lovingly hand-crafted some XML) to help transform the CPI from Excel into a DSPL dataset, and have uploaded this into into the Google Public Data Explorer.

The dataset that has been uploaded is based of Table 12 from the Downloads section of the latest CPI page. This dataset includes the indices for each capital city, and Australia, for each level of indicator – from the broad total CPI to, for example, the more specific “Food”, “Bread and Cereals” and finally “Breads”. There are 12 total broad topic covered, including a miscellaneous group of indices that exclude some of the 11 topics, and there are 144 topics at the finest detail.

The end result is something like this:

If you want to play with the whole dataset, it is available on the Google Public Data Explorer, or if you would like to download the full DSPL dataset, that is available on the DSPL-R downloads page.

Probably one of the more interesting parts was how to create the hierarchical CPI indicators category in DSPL, but I’ll be following this post up later in the week with a tutorial on how to work with complex datasets.

Update: With the help of a kind statistician from Google the datset is now much better structured. The updated dataset is available here: http://code.google.com/p/dspl-r/downloads/list

Using metadata within statistical software

Today is the release of a beta of a package I am writing for the R statistical package to make it easier for researchers to utilise metadata within R and to make it more worthwhile for statisticians to provide metadata.

Most of the methods for R to import data rely solely on the importing of undocumented data, in fact one of the most common ways to import data is through raw CSVs. However, with the release of DSPL.R it is now possible to browse the metadata of a dataset within a statistical package.

For example, the following output is example output from the US Retail Sales dataset provided by Google:

> print (prep.dspl("~/example/census-retail-sales.zip"))
DSPL Dataset - For more info see: [www.kidstrythisathome.com/dspl.r]
------------                  or: [code.google.com/apis/publicdata/]

Name : Retail Sales in the U.S.
Description : Monthly Retail Trade and Food Services report
            for the United States. This dataset was prepared by Google based
            on data downloaded from the U.S. Census Bureau.
Concepts : 3  -  Type of business, Seasonality, Retail Sales Volume
Slices   : 1  -  retail_sales_business
Tables   : 3  -  businesses, seasonalities, retail_sales_business_tbl
Topics   : 3  -  Industry, Business, Gender

As this example shows, a user is able to load in a new dataset, and get an immediate sense for what the dataset contains. By being able to allow a user to be able to understand the meaning behind a dataset, without having to leave the statistical environment, users are able to seamlessly work with their data and metadata within the same interface.

While DSPL is seen as a newcomer to the statistical world, and the R is perceived(albeit wrongly) to be inferior to more established commercial statistical tools, the agility of R and the brevity of the DSPL standard act as a strong indicator of how, given time statistical metadata could become an integral part of the all statistical processes.

DSPL, SDMX and the future of Data

It was recently announced that Google has made their Public Data Explorer open to the public, so now anyone can upload data. While the data that they have made available is interesting, what is more interesting is the much subtler announcement of their DataSet Publication Language (DSPL).

DSPL is a data/metadata language specification that basically allows people to describe multi-dimensional, aggregate datasets, along with their appropriate metadata in a structured way. For those of you playing along at home, this is almost identical to the ideas behind SDMX. Both languages support datasets by essentially defining compound keys of dimensions and their associated measures – dimensions and metrics in DSPL and key families and measures in SDMX. However, there are three significant differences between the two standards which will impact which one will see the wider adoption.

1: SDMX was a collaborative effort to meet the needs of a wide banking and statistical community, DSPL was made solely by Google to accommodate the Public Data Explorer.
For good or bad, when making DSPL, Google only had to meet their own needs: design a standard that will help people write data for the Public Data Explorer, it would be good if it was easy. SDMX on the other hand, has had a lot of hands pulling it, which means the standard, as well written as it is, is overly complex and meets more needs than any one agency could every likely encounter. The important thing about the size of the spec, is that it has a near linear relationship to the size of the documentation, which means…

2: DSPLs documentation is vastly shorter than SDMX.
When I first began looking at SDMX, the documentation was enormous. To be able to go through the simple task of making a dataset, and adding its metadata required reading through reams of documentation. In the end it was possible, but not without a significant investment of time, and for many people if the time to produce something productive is too long they will start to look elsewhere. However, the documentation for DSPL is light, and easy to understand, such that it took me much less time to start working with the standard productively *. However, one of the main reasons the documentation is so small is because…

3: SDMX manages all data and metadata in XML, DSPL uses XML and CSV
This leads to a huge difference in understanding how to start working with data. DSPL works for the same idea as SDMX that data should be defined according to its dimensions, and that these dimensions should be described as XML to allow people and machines to understand them. However, DSPL manages all its data, ie the actual numbers, as tabular in Comma Separated Value (CSV) files, while SDMX serialises the data in XML constructs – this is a massive difference.

Any programmer worth their salt can process CSVs: readline, split by commas, do things with values, repeat. There are standard libraries for managing CSVs, it can be imported into most RDBMS and is almost a defacto standard for data – even Excel supports it. This means that to write a tool to work with DSPL, one only needs to learn one new thing, rather than two. It also acts as an improvement to an existing technology, rather than a completely new idea.

4: DSPL is backed by Google, SDMX isn’t
This is probably the biggest difference of all. Despite the fact that SDMX may be managed by some pretty heavy weight players, weighing in against Google they look like small fry. Google also has the advantage that it has visibility with developers and that will make the big difference between the two. A standard could be the best designed, easiest to learn, and most useful, but if no-one is writing tools for it, if no-one is promoting it, if there aren’t the resources to work with it, then it is doomed to fail. Google has the ability, experience and resources to drive this ground swell.

So where does this leave our two standards?

Well, hopefully, it leaves them in a position to work with each other. DSPL acts as a great entry point into the more complex SDMX metadata requirements – an SDMX-Lite almost. This is especially of note, considering that with such a large overlap, it would take little effort to write a tool to convert data between the two standards – albeit with a lossy conversion between SDMX and DSPL.

However, for either to ignore that the other exists is a dangerous proposition. With each format having the potential to tap into slightly different markets, a division between them would present a split between people who ultimately have similar goals,.

* One could argue that I was able to understand DSPL so quickly because I had already learned how SDMX worked. One wouldn’t be wrong, but the point about the difference in documentation holds.

Examining the factors of Indigenous participation in Crime

Thats the gist of my statistics thesis thats taking up so much of my time right now.

I’m currently working with Anna Ferrante from the UWA Crime Research Centre on a project to examine some Australian Bureau of Statistics data from 2008. Currently reading my way through reams of papers on the subject and so far the answer is a resounding “we don’t quite know.” 

Currently reading my way through reams of papers on the subject and so far the answer is a resounding “we don’t quite know.”

Or more the fact that there is no real easy answer. The combination of social biases, along with substance abuse and socioeconomic inequities all contribute from the base reading of done.

The bad news from this perspective is that at the moment there appears to be no silver bullet for this hot button issue.

However, with the untouched data of over 13,000 anonymised persons at my fingertips I can only hope that this analysis proves to be helpful to this already wide body of research.

And with any luck I will get a chance to carry some of this experience to when I start my Masters in Statistics.

Twitter Sparkline Generator using Unicode

NB: This post uses examples of Unicode that may not show up in some browsers.

One of my main gripes with twitter is the ability to add only text. People often have the desire to share small snippets of data, but to no avail. The ideal idea to share data in such tiny chunks of data Edward Tufte idea of a Sparklines.

For those of you disinclined to read the wikipedia page, sparklines are “data-intense, design-simple, word-sized graphics”, designed to be entered inline with text, at similar height to help illustrate an idea.

Now I am not the first person to suggest entering sparklines in to twitter, in fact the second entry for a google search for sparkline turns up Alex Kerin’s article. However, there are two slight problems with Kerin’s implementation. Firstly, the unicode block characters he is using are not designed to be lined up, and examples that are shown on his page demonstrate this. To be fair, this isn’t his fault at all as unicode compliance isn’t 100%. The second is that a bar and a line can provide two very different perceptions: bar charts generally being used to display discrete data (or continuous data being shown as discrete) and line charts being used to show continuous data – for the record there is no good time to use a pie chart.

To this end I have created a tool for producing two different types of sparkline from an input data source – A crude line graph and a 5-figure box-plot.

Here is an example showing this are using the June 30th 2010 Perth weather data from the Bureau of Meterology, with bars delimiting 3 hour blocks:

The weather yesterday in Perth was quite cool (4.1┣▇▇|▇━━┫17.7) with a maximum of 17.7 degrees occuring around 2pm, before quickly cooling down until 3pm. (⣤⣤⣀⎸⣀⣀⣀⎸⣀⣀⡤⎸⠴⠚⠛⎸⠛⠛⠙⎸⠒⠒⠒⎸⠒⠲⠶⎸⠶⠶⠶).

Limiting this example further, restricting ourselves to the 140 characters of twitter:

Perth 30/06/10: Cool (4.1┣▇▇|▇━━┫17.7), max at 2pm, cooling to around 13°C after 3pm, steady afterwards. (⣤⣤⣀⎸⣀⣀⣀⎸⣀⣀⡤⎸⠴⠚⠛⎸⠛⠛⠙⎸⠒⠒⠒⎸⠒⠲⠶⎸⠶⠶⠶)

This is a 115 character weather report leaving 25 characters for a url to the full data. This may be for temperature only, but it shows the potential and can place 2 dataset in a twitter post with commentary.

I think the boxplots look quite good, however the tool does take a few liberties with the braille layout, relying on people to see a pair of vertical dots as a value in between the two, but it helps convey the message quite well in a limited, text-based format.

Hating <table>s considered harmful

Apologies to Dijkstra for butchering that quote again , but the rage against <table>s is getting out of hand. Back in the dark ages, when it was impossible to get consistent HTML rendering across browsers and platforms someone decided to use the <table> element , originally designed to markup tabular data, to layout webpages.

Ever since that moment <table>s have been an unfairly shunned part of the HTML spec. When CSS became a stable and supported spec many people started screaming from the roof tops “Stop using <table> it ruins webpages”, and too many people have taken this to heart, so much that any time tabular data needs to be presented bizarre alternatives are used.

This was an example of a bad table of data that caught my eye a while ago:

See the "pre" section below

A screenshot for posterity

Its supposed to give an idea of the cumulative revenue of iPhone Apps over time, but its hard to tell that as there is no title. Good data tells a story, good metadata explains that story. There is no metadata here to explain any of these numbers, here is what a search engine sees when they look at the HTML on that page:

Period ending.....Period downloads.....Cumulative downloads....Period revenues
Jun 2008............no apps...................no apps........................no revenues
Dec 2008.............600 M......................600 M..........................$ 172 M
Jun 2009..............800 M....................1.4 B.............................$  228 M
Dec 2009..........1.6 B.........................3.0 B............................$  458 M
Jun 2010...........2.0 B.........................5.0 B............................$  542 M
Total.................5.0 B.........................5.0 B............................$1.4 B

Accessibility and search robots aside, here is no context to any of this, the units change between rows, the number of dots changes each time. Its hard to say what this data is without purposefully reading the text around it. I could drone on about the lack of metadata describing the data in this table, but instead I’ll counter this with a better example.

Here is the same table, rewritten as an actual <table>, using only what’s defined in the HTML4 and above W3C specifications:

A few things instantly stand out: the columns all match up nicely, both in display and in the table-model; each of the years is clearly marked; the table is relatively self-explanatory; and lastly it appears as a table to anyone who reads it. Furthermore, all of the formatting is done using CSS; thats right the presentation is left to the CSS and the descriptive markup is done using a <table>, just how it is intended to be done. But even this is only a fraction of the possible metadata to place into a HTML table.

However, this is not the worse example of a ‘table’ I have seen, the one that sent me over the edge was this large image that highlights the relative pros and cons of different smart phones:

A very large table comparing different smartphones

This 'table' is too big to show inline, it needed to be reduced to a third of its size to fit

The main issue with this, is this image of a table is mostly text, is quite large and static. You cannot easily copy text out of this, you can’t easily reorder the table and unless you have a large screen it would be quite difficult to compare two products that weren’t both very close to each other. Unlike the table of iPhone revenues, I am not going to go to the effort of transcribing this into a proper HTML table.

Compare this with a similar table from wikipedia also providing a comparison of smartphones and the differences are obvious. First of all, the table is now web-crawlable – whatever data is in this table is now indexed in Google, instant bonus, and the user can easily search through the page to find what they needed to know. There is also a whole lot of Javascript on the page. For example, clicking the boxes next the each of the column titles reorders the whole table around that column. In the first table this isn’t too noticeable with most entries having either ‘Yes’ or ‘No’, but the Hardware and OS table from further down the same page is full of figures, and now I can easily find the lightest phone (it’s the HP iPAQ Voice Messenger, something that would be much harder to find out on the image ‘table’).

Furthermore there are plenty of ways that you can easily add functionality to tables, to reoorder, hide, expand or change to add plenty of use to your tables.
The web is supposed to be full of life and allow use to use our data in exciting new ways, so don’t stick with static images, or unstructured text for your tables when there is a much more useful alternative out there.

Cornell Trip Report Days 3-5

So, I’m sitting in the Sydney Qantas Business Lounge, with plenty of time until my connecting flight home, so I’ll recap the last few days.

From what people had to say my presentation went quite well. The script I worked from is now available online, and the slides will soon follow. As soon as I have updated the slides with the transcript I will upload them and update this post.

There were however plenty of excellent talks at the IASSIST Conference this year. Due to the theme there was an abundance of social networking talks, including the keynote speaker on Friday, and they all presented excellent angles on the same question: “How can we get more people interacting with our data better?”

I think this was a great take on where data and metadata agencies are heading, as as I have said before, due to the fact that the more we have people interacting with data agencies, and using the data and metadata they provide, the more relevant those agencies become. The fact there are so many agencies asking this question and working on their own solutions and are willing to share those ideas is a great step forward for open data.

Cornell Trip Report – Day 1 & 2

So in the midst of preparing for Cornell, Uni work and wedding planning I haven’t had time to update, but taking a breather in a beautiful hotel room is a perfect time to get back to blogging.

I arrived in Ithaca yesterday, the trip up way gorgeous. I would say pun intended, but I didn’t know Ithaca was famous for it gorges until I got here. Spend the afternoon with folks from the DDI Alliance Expert Committee. That was great, didn’t say much, but listening to some extraordinarily smart people and having them do the same to me was honouring. Sadly after dinner I had to retire with a killer fog of jetlag and nearly immediately fell asleep.

This morning was much more eventful, the lovely staff at the Statler Hotel were able to get me a cord to charge my camera with, and with that I set off to explore the campus and surrounds. I walked round the campus before setting off for Collegetown. As my camera only had a short charge all my shorts of Collegetown are on my phone, whose cord is absent, so until I find it or figure out how to get the shots off, no photos from there yet.

But thats ok, as the best shots were down by Casadilla Gorge. The shots were great, and by the bend down the end there is (according to locals) a little wading area. The track along the creak from there was spectacular and green after last nights rain.

The only downside is a twisted my ankle some time during the walk, but I’m all strapped up, about to head of for lunch and see Jeremy Iverson from Algenta talk about Colectica.

How crowdsourcing will drive open data

Over the past year there has been a global shift in data policies within governments world-wide – opening up data, that once cost or was hidden from public view. Some government organisations have gone so far as to put together incentives to encourage people to make use of this newly freed data. Australian examples include state initiatives such as Victoria’s AppMyState and New South Wales’ apps4NSW and the Federal Governments Government 2.0 taskforce’s Mashup Australia. What’s important to realise though is that these promotions should be seen more as a means to an end, rather than an end in and of themselves.

Data collectors and maintainers need to exhibit a relatively low level of bias and high levels of independence. If any data collected is called into question, the validity of the entire archive, and future collections, can also be questioned. The reason for this stems for the logical (albeit poorly grounded) assumption that if one collection had been tampered with then there are grounds to suggest that all other collections may suffer the same bias.

It is for this reason that the data exposed from these sources may be little more than the aggregated and weighted data, with explanatory metadata and appropriate notes on the methodology of the study. No exposition into the links between data, or overtly controversial or political hypotheses attempting to explain the data. Anything more may suggest bias or influence: for example, a statistical agency may present statistics regarding life expectancy in collection regions, and in another study my present statistics around polluting industry in the same regions, but would be amiss to begin to draw correlations between them. The same agency may hold for data around economic indicators, but would never try to correlate them to political policy.

So while the agency may hold this data and the expertise in explaining the data, they would be reluctant to go further than collection and dissemination. However, the public is unencumbered by such ideals.

Contests such as those listed above give public agency the opportunity to drive public use of public data. This again drive straight to the goals of data agencies: public use of data reenforces their relevance. This gives the public the opportunity to ask potentially controversial questions, backed by official data, while giving the data agencies recognition for their continued importance in society.