Archive for the ‘ Data ’ Category

URLs should be meaningful

In computing, a Uniform Resource Locator (URL) is a Uniform Resource Identifier (URI) that specifies where an identified resource is available and the mechanism for retrieving it.

- wikipedia.com – Uniform Resource Locator

A URL by rights is the first piece of information a user interacts with on your site, because without it they cannot get there, and while most users will follow a link to find your site, URLs still take center space in every browser UI.

So why is it that so many web-developers neglect their usefulness completely?

Take for example the URL of this blog post

http://www.kidstrythisathome.com/2010/10/urls-should-be-meaningful/

There is quite a bit of information here, this is a blog post and it was made in October of 2010. If you remove the last section so the URL looks like this:

http://www.kidstrythisathome.com/2010/10/

you have a related, but brand new URL, and you now have a link to a page of all blog posts from October 2010. Remove the “10/” and you have a page of all posts from 2010, and remove “2010/” and you are at the main site again.

This should be expected behaviour, and users should be encouraged or at least allowed to have this basic interaction with your site.

Now, lets look at an example of a “bad” URL:

http://www.canberra.edu.au/courses-units/m-coursework/information-studies/online/mis

To look at this URL is very descriptive, you found this I was look at courses at Canberra University examining the units available in Masters by coursework (/m-coursework) in the field of information studies, and looking at the online Masters in Information Studies (/online/mis).

By rights, I should be able to work backwards through this URL, deleting the last section and finding valid information all the way back up the directory structure. For example

http://www.canberra.edu.au/courses-units/m-coursework/

should take us to the Canberra University webpage outlining all coursework Masters, likewise with other possible permutations of the above URL. However, it does not, there is no valid URL that can be made by deleting sections from the end of the given URL, and in this sense valid is a URL that resolves to a page with useful and relevant information.

By using meaningful URLs on your site you add another feature for users, and help provide a stronger semantic structure to your pages. That said, if your URL looks like this:

http://example.com/Directory/exm@.nsf/ProductsbyTopic/D8CBDEC90255GBF7CA2575E70119DFAA?OpenDocument

then you have bigger problems than URL structure and need to start reexamining your entire design.

Can we find all the Roman toilets in ancient Jordan?

So Sam, what exactly do you do at uni?

Well, I’m glad you asked hypothetical reader, I solve practical problems. For example, for the last 2 months I along with 5 of my peers have been working with the Archaeology department at the University of Western Australia to design a tool to help field researchers track and visualise dig sites using Google Earth.

This work appears to have caught the eye of the publishing group at UWA and it was mentioned in a recent edition of the UWA News.

The tool will allow researchers to track and upload site information (including such information as “was this site a toilet”) via a webpage and have this be instantly viewable back here by researchers at UWA. So yes, in the near future UWA researchers will be able to easily find out where ancient Romans pooped when they were holidaying in Jordan!

Examining the factors of Indigenous participation in Crime

Thats the gist of my statistics thesis thats taking up so much of my time right now.

I’m currently working with Anna Ferrante from the UWA Crime Research Centre on a project to examine some Australian Bureau of Statistics data from 2008. Currently reading my way through reams of papers on the subject and so far the answer is a resounding “we don’t quite know.” 

Currently reading my way through reams of papers on the subject and so far the answer is a resounding “we don’t quite know.”

Or more the fact that there is no real easy answer. The combination of social biases, along with substance abuse and socioeconomic inequities all contribute from the base reading of done.

The bad news from this perspective is that at the moment there appears to be no silver bullet for this hot button issue.

However, with the untouched data of over 13,000 anonymised persons at my fingertips I can only hope that this analysis proves to be helpful to this already wide body of research.

And with any luck I will get a chance to carry some of this experience to when I start my Masters in Statistics.

Twitter Sparkline Generator using Unicode

NB: This post uses examples of Unicode that may not show up in some browsers.

One of my main gripes with twitter is the ability to add only text. People often have the desire to share small snippets of data, but to no avail. The ideal idea to share data in such tiny chunks of data Edward Tufte idea of a Sparklines.

For those of you disinclined to read the wikipedia page, sparklines are “data-intense, design-simple, word-sized graphics”, designed to be entered inline with text, at similar height to help illustrate an idea.

Now I am not the first person to suggest entering sparklines in to twitter, in fact the second entry for a google search for sparkline turns up Alex Kerin’s article. However, there are two slight problems with Kerin’s implementation. Firstly, the unicode block characters he is using are not designed to be lined up, and examples that are shown on his page demonstrate this. To be fair, this isn’t his fault at all as unicode compliance isn’t 100%. The second is that a bar and a line can provide two very different perceptions: bar charts generally being used to display discrete data (or continuous data being shown as discrete) and line charts being used to show continuous data – for the record there is no good time to use a pie chart.

To this end I have created a tool for producing two different types of sparkline from an input data source – A crude line graph and a 5-figure box-plot.

Here is an example showing this are using the June 30th 2010 Perth weather data from the Bureau of Meterology, with bars delimiting 3 hour blocks:

The weather yesterday in Perth was quite cool (4.1┣▇▇|▇━━┫17.7) with a maximum of 17.7 degrees occuring around 2pm, before quickly cooling down until 3pm. (⣤⣤⣀⎸⣀⣀⣀⎸⣀⣀⡤⎸⠴⠚⠛⎸⠛⠛⠙⎸⠒⠒⠒⎸⠒⠲⠶⎸⠶⠶⠶).

Limiting this example further, restricting ourselves to the 140 characters of twitter:

Perth 30/06/10: Cool (4.1┣▇▇|▇━━┫17.7), max at 2pm, cooling to around 13°C after 3pm, steady afterwards. (⣤⣤⣀⎸⣀⣀⣀⎸⣀⣀⡤⎸⠴⠚⠛⎸⠛⠛⠙⎸⠒⠒⠒⎸⠒⠲⠶⎸⠶⠶⠶)

This is a 115 character weather report leaving 25 characters for a url to the full data. This may be for temperature only, but it shows the potential and can place 2 dataset in a twitter post with commentary.

I think the boxplots look quite good, however the tool does take a few liberties with the braille layout, relying on people to see a pair of vertical dots as a value in between the two, but it helps convey the message quite well in a limited, text-based format.

Hating <table>s considered harmful

Apologies to Dijkstra for butchering that quote again , but the rage against <table>s is getting out of hand. Back in the dark ages, when it was impossible to get consistent HTML rendering across browsers and platforms someone decided to use the <table> element , originally designed to markup tabular data, to layout webpages.

Ever since that moment <table>s have been an unfairly shunned part of the HTML spec. When CSS became a stable and supported spec many people started screaming from the roof tops “Stop using <table> it ruins webpages”, and too many people have taken this to heart, so much that any time tabular data needs to be presented bizarre alternatives are used.

This was an example of a bad table of data that caught my eye a while ago:

See the "pre" section below

A screenshot for posterity

Its supposed to give an idea of the cumulative revenue of iPhone Apps over time, but its hard to tell that as there is no title. Good data tells a story, good metadata explains that story. There is no metadata here to explain any of these numbers, here is what a search engine sees when they look at the HTML on that page:

Period ending.....Period downloads.....Cumulative downloads....Period revenues
Jun 2008............no apps...................no apps........................no revenues
Dec 2008.............600 M......................600 M..........................$ 172 M
Jun 2009..............800 M....................1.4 B.............................$  228 M
Dec 2009..........1.6 B.........................3.0 B............................$  458 M
Jun 2010...........2.0 B.........................5.0 B............................$  542 M
Total.................5.0 B.........................5.0 B............................$1.4 B

Accessibility and search robots aside, here is no context to any of this, the units change between rows, the number of dots changes each time. Its hard to say what this data is without purposefully reading the text around it. I could drone on about the lack of metadata describing the data in this table, but instead I’ll counter this with a better example.

Here is the same table, rewritten as an actual <table>, using only what’s defined in the HTML4 and above W3C specifications:

A few things instantly stand out: the columns all match up nicely, both in display and in the table-model; each of the years is clearly marked; the table is relatively self-explanatory; and lastly it appears as a table to anyone who reads it. Furthermore, all of the formatting is done using CSS; thats right the presentation is left to the CSS and the descriptive markup is done using a <table>, just how it is intended to be done. But even this is only a fraction of the possible metadata to place into a HTML table.

However, this is not the worse example of a ‘table’ I have seen, the one that sent me over the edge was this large image that highlights the relative pros and cons of different smart phones:

A very large table comparing different smartphones

This 'table' is too big to show inline, it needed to be reduced to a third of its size to fit

The main issue with this, is this image of a table is mostly text, is quite large and static. You cannot easily copy text out of this, you can’t easily reorder the table and unless you have a large screen it would be quite difficult to compare two products that weren’t both very close to each other. Unlike the table of iPhone revenues, I am not going to go to the effort of transcribing this into a proper HTML table.

Compare this with a similar table from wikipedia also providing a comparison of smartphones and the differences are obvious. First of all, the table is now web-crawlable – whatever data is in this table is now indexed in Google, instant bonus, and the user can easily search through the page to find what they needed to know. There is also a whole lot of Javascript on the page. For example, clicking the boxes next the each of the column titles reorders the whole table around that column. In the first table this isn’t too noticeable with most entries having either ‘Yes’ or ‘No’, but the Hardware and OS table from further down the same page is full of figures, and now I can easily find the lightest phone (it’s the HP iPAQ Voice Messenger, something that would be much harder to find out on the image ‘table’).

Furthermore there are plenty of ways that you can easily add functionality to tables, to reoorder, hide, expand or change to add plenty of use to your tables.
The web is supposed to be full of life and allow use to use our data in exciting new ways, so don’t stick with static images, or unstructured text for your tables when there is a much more useful alternative out there.

Cornell Trip Report Days 3-5

So, I’m sitting in the Sydney Qantas Business Lounge, with plenty of time until my connecting flight home, so I’ll recap the last few days.

From what people had to say my presentation went quite well. The script I worked from is now available online, and the slides will soon follow. As soon as I have updated the slides with the transcript I will upload them and update this post.

There were however plenty of excellent talks at the IASSIST Conference this year. Due to the theme there was an abundance of social networking talks, including the keynote speaker on Friday, and they all presented excellent angles on the same question: “How can we get more people interacting with our data better?”

I think this was a great take on where data and metadata agencies are heading, as as I have said before, due to the fact that the more we have people interacting with data agencies, and using the data and metadata they provide, the more relevant those agencies become. The fact there are so many agencies asking this question and working on their own solutions and are willing to share those ideas is a great step forward for open data.

How crowdsourcing will drive open data

Over the past year there has been a global shift in data policies within governments world-wide – opening up data, that once cost or was hidden from public view. Some government organisations have gone so far as to put together incentives to encourage people to make use of this newly freed data. Australian examples include state initiatives such as Victoria’s AppMyState and New South Wales’ apps4NSW and the Federal Governments Government 2.0 taskforce’s Mashup Australia. What’s important to realise though is that these promotions should be seen more as a means to an end, rather than an end in and of themselves.

Data collectors and maintainers need to exhibit a relatively low level of bias and high levels of independence. If any data collected is called into question, the validity of the entire archive, and future collections, can also be questioned. The reason for this stems for the logical (albeit poorly grounded) assumption that if one collection had been tampered with then there are grounds to suggest that all other collections may suffer the same bias.

It is for this reason that the data exposed from these sources may be little more than the aggregated and weighted data, with explanatory metadata and appropriate notes on the methodology of the study. No exposition into the links between data, or overtly controversial or political hypotheses attempting to explain the data. Anything more may suggest bias or influence: for example, a statistical agency may present statistics regarding life expectancy in collection regions, and in another study my present statistics around polluting industry in the same regions, but would be amiss to begin to draw correlations between them. The same agency may hold for data around economic indicators, but would never try to correlate them to political policy.

So while the agency may hold this data and the expertise in explaining the data, they would be reluctant to go further than collection and dissemination. However, the public is unencumbered by such ideals.

Contests such as those listed above give public agency the opportunity to drive public use of public data. This again drive straight to the goals of data agencies: public use of data reenforces their relevance. This gives the public the opportunity to ask potentially controversial questions, backed by official data, while giving the data agencies recognition for their continued importance in society.