Archive for the ‘ Metadata ’ Category

Help users find your data with the “Data Discovery Cycle”

Prologue – A story of data discovery

Users routinely search for information on the web, and it is no different to any other for of discovery, be it digital or physical. To help cement this idea, consider this simple chain of events.

  1. A website author adds tags and text to a page appropriate to the content
  2. A user searches for terms, dates and locations in a search engine, such as Google or Bing
  3. After getting the results, the user reads the descriptions of the appropriate sites to deciding which sites to read or ignore
  4. The user clicks the link to a page they are interested in and reads more.

By going through these steps the user has been able to quickly narrow down a potentially millions of pages of information, and find the most appropriate one for their needs. Could you say that a user searching through your data could do the same?

The Data Discovery Cycle

Good metadata is about leading users through the Data Discovery Cycle – Discovery, Description and Identification, or DDI *. These are the three steps users go through to find the data they need. If you know where something is, you won’t search for it. If you don’t know what something is you won’t care where it is.

Discovery

Discovery metadata is the first piece of information that helps a user find what they want. When a user is at the Discovery stage of the Data Discovery Cycle they know what they want to find, but don’t know what or where it is yet. As a data provider, it is your role to help a user find what that is, and it starts by answering the five W’s that users may want to know about data. For example, a user may need to know some of the following to begin narrowing down the data:

At the Discovery stage of the Data Discovery Cycle [users] know what they want to find, but don’t know what or where it is yet.
  • Who gave (or collected) the data?
  • What data was gathered?
  • When was the data collected?
  • Where was the data gathered?
  • Why was this data collected?

‘How’ has been left off this list because how something occurred can be very complicated or domain-specific, and Discovery metadata should be relatively standardised across domains. Also of note, is that what questions a user asks can be very specific, and although there are standard was to express this kind of information, building effective services over these standards comes down to understand who is trying to find your data.

An example of an excellent discovery metadata standard is something like Dublin Core. Using just 15 base fields, Dublin Core allows a provider to capture the essence of a piece of data, allowing widely differing pieces of information to be placed in one registry allowing users to find what they need across many different areas.

Description

Descriptive metadata is information that helps a user narrow down what they are trying to find. At the Description phase a user has begun narrowing the field of data, and is begins investigation specific data sources. Once a user has culled the whole field of data down to the a short list of related information they want, they can examine the descriptions of data that matches their criteria and can narrow the field of results even further.

At the Description phase a user has begun narrowing the field of data, and is begins investigation specific data sources.

Example of descriptive metadata that help at this stage includes:

  • The name of the data (which is different from an indentifier)
  • Brief and in-depth descriptions of the data
  • Specific labels attached users may know the data as

This is the least machine actionable of all three steps, and the most likely to be in a written language. What is important, is that this information helps users get a better feel for what the data is about and allow them to cull or keep the data that is most relevant. Once a user has narrowed down what they need, they can move on to actually retrieving it.

Identification

Once a user has reached the Identification stage they have found what they want, and now need to locate and retrieve the actual data.

At its simplest Identification metadata can be as simple as a single Uniform Resource Identifier or series of complex identifiers. No matter how identification metadata is managed, it is still able to pinpoint a single piece of information. Once a user has reached the Identification stage they have found what they want, and now need to be able to locate and retrieve the actual data. Previous pieces of metadata that assisted them search through data can take on new roles outside of the Data Discovery Cycle.

Probably the best identification standard was mentioned above, the humble URI or Uniform Resource Identifier. Standard, easily resolved, infinitely extendable and widely used.

What about everything else?

Don’t take this simple division of information to mean that everything else is unimportant. Metadata that doesn’t fall into the above roles shouldn’t be discounted. For domain specific reasons the lists of information above, can become what the user is after. However, for the process of helping a user go through the Data Discovery Cycle, everything else really does become less important, and the above distinction can help you narrow down what the is the most important information to help users search through a registry of information.

So what is the point?

The point is look at how specific metadata helps or hinders a users ability to find what they are after. Too much and they become overwhelmed with options, too little and it become too difficult to find what they need. Likewise, if users are restricted to specific values for their metadata they may misinterpret their meaning of the controlled vocabulary. But again, if they are given to much freedom, it may become impossible for anyone to find anything.

Epilogue – A story of discovery revisted

Lets revisit our earlier story and look at how this maps to the Data Discovery Cycle:

  1. A website author adds tags and text to a page appropriate to the content
  2. A user searches for terms, dates and locations in a search engine, such as Google or Bing (Discovery)
  3. After getting the results, the user reads the descriptions of the appropriate sites to deciding which sites to read or ignore (Description)
  4. The user clicks the link to a page they are interested in (Identification)

In summary, good discovery metadata is about finding the balance of information needed to help users find what they need with minimal effort and maximum results. However, ultimately this means understanding who your users, what they are trying to find, and how they want to search for data – but people can be a lot harder to understand that data I’m afraid.

* Although, not the DDI you might be thinking of which is a good metadata standard, but doesn’t explain how users search for their data. However, when it comes to standards that help statistical data providers describe their work, The Data Documentation Initiative is probably the best tool for helping providers make the necessary information to help users through the Data Discovery Cycle.

63 years of Australian CPI data

The latest Australian CPI figures were released last week by the Australian Bureau of Statistics.

Fortunately, each release includes re-weighted indices for each indicator. These cover 11 major topics, including food, clothing, housing and education. Unfortunately, the Excel spreadsheets these are distributed aren’t the easiest formats to process data from. This is because exporting data from Excel can be time consuming, and the data as it is stored in Excel neglects the hierarchical structure of the CPI indicators.

To help change this, I wrote a few scripts (and lovingly hand-crafted some XML) to help transform the CPI from Excel into a DSPL dataset, and have uploaded this into into the Google Public Data Explorer.

The dataset that has been uploaded is based of Table 12 from the Downloads section of the latest CPI page. This dataset includes the indices for each capital city, and Australia, for each level of indicator – from the broad total CPI to, for example, the more specific “Food”, “Bread and Cereals” and finally “Breads”. There are 12 total broad topic covered, including a miscellaneous group of indices that exclude some of the 11 topics, and there are 144 topics at the finest detail.

The end result is something like this:

If you want to play with the whole dataset, it is available on the Google Public Data Explorer, or if you would like to download the full DSPL dataset, that is available on the DSPL-R downloads page.

Probably one of the more interesting parts was how to create the hierarchical CPI indicators category in DSPL, but I’ll be following this post up later in the week with a tutorial on how to work with complex datasets.

Update: With the help of a kind statistician from Google the datset is now much better structured. The updated dataset is available here: http://code.google.com/p/dspl-r/downloads/list

Using metadata within statistical software

Today is the release of a beta of a package I am writing for the R statistical package to make it easier for researchers to utilise metadata within R and to make it more worthwhile for statisticians to provide metadata.

Most of the methods for R to import data rely solely on the importing of undocumented data, in fact one of the most common ways to import data is through raw CSVs. However, with the release of DSPL.R it is now possible to browse the metadata of a dataset within a statistical package.

For example, the following output is example output from the US Retail Sales dataset provided by Google:

> print (prep.dspl("~/example/census-retail-sales.zip"))
DSPL Dataset - For more info see: [www.kidstrythisathome.com/dspl.r]
------------                  or: [code.google.com/apis/publicdata/]

Name : Retail Sales in the U.S.
Description : Monthly Retail Trade and Food Services report
            for the United States. This dataset was prepared by Google based
            on data downloaded from the U.S. Census Bureau.
Concepts : 3  -  Type of business, Seasonality, Retail Sales Volume
Slices   : 1  -  retail_sales_business
Tables   : 3  -  businesses, seasonalities, retail_sales_business_tbl
Topics   : 3  -  Industry, Business, Gender

As this example shows, a user is able to load in a new dataset, and get an immediate sense for what the dataset contains. By being able to allow a user to be able to understand the meaning behind a dataset, without having to leave the statistical environment, users are able to seamlessly work with their data and metadata within the same interface.

While DSPL is seen as a newcomer to the statistical world, and the R is perceived(albeit wrongly) to be inferior to more established commercial statistical tools, the agility of R and the brevity of the DSPL standard act as a strong indicator of how, given time statistical metadata could become an integral part of the all statistical processes.

DSPL, SDMX and the future of Data

It was recently announced that Google has made their Public Data Explorer open to the public, so now anyone can upload data. While the data that they have made available is interesting, what is more interesting is the much subtler announcement of their DataSet Publication Language (DSPL).

DSPL is a data/metadata language specification that basically allows people to describe multi-dimensional, aggregate datasets, along with their appropriate metadata in a structured way. For those of you playing along at home, this is almost identical to the ideas behind SDMX. Both languages support datasets by essentially defining compound keys of dimensions and their associated measures – dimensions and metrics in DSPL and key families and measures in SDMX. However, there are three significant differences between the two standards which will impact which one will see the wider adoption.

1: SDMX was a collaborative effort to meet the needs of a wide banking and statistical community, DSPL was made solely by Google to accommodate the Public Data Explorer.
For good or bad, when making DSPL, Google only had to meet their own needs: design a standard that will help people write data for the Public Data Explorer, it would be good if it was easy. SDMX on the other hand, has had a lot of hands pulling it, which means the standard, as well written as it is, is overly complex and meets more needs than any one agency could every likely encounter. The important thing about the size of the spec, is that it has a near linear relationship to the size of the documentation, which means…

2: DSPLs documentation is vastly shorter than SDMX.
When I first began looking at SDMX, the documentation was enormous. To be able to go through the simple task of making a dataset, and adding its metadata required reading through reams of documentation. In the end it was possible, but not without a significant investment of time, and for many people if the time to produce something productive is too long they will start to look elsewhere. However, the documentation for DSPL is light, and easy to understand, such that it took me much less time to start working with the standard productively *. However, one of the main reasons the documentation is so small is because…

3: SDMX manages all data and metadata in XML, DSPL uses XML and CSV
This leads to a huge difference in understanding how to start working with data. DSPL works for the same idea as SDMX that data should be defined according to its dimensions, and that these dimensions should be described as XML to allow people and machines to understand them. However, DSPL manages all its data, ie the actual numbers, as tabular in Comma Separated Value (CSV) files, while SDMX serialises the data in XML constructs – this is a massive difference.

Any programmer worth their salt can process CSVs: readline, split by commas, do things with values, repeat. There are standard libraries for managing CSVs, it can be imported into most RDBMS and is almost a defacto standard for data – even Excel supports it. This means that to write a tool to work with DSPL, one only needs to learn one new thing, rather than two. It also acts as an improvement to an existing technology, rather than a completely new idea.

4: DSPL is backed by Google, SDMX isn’t
This is probably the biggest difference of all. Despite the fact that SDMX may be managed by some pretty heavy weight players, weighing in against Google they look like small fry. Google also has the advantage that it has visibility with developers and that will make the big difference between the two. A standard could be the best designed, easiest to learn, and most useful, but if no-one is writing tools for it, if no-one is promoting it, if there aren’t the resources to work with it, then it is doomed to fail. Google has the ability, experience and resources to drive this ground swell.

So where does this leave our two standards?

Well, hopefully, it leaves them in a position to work with each other. DSPL acts as a great entry point into the more complex SDMX metadata requirements – an SDMX-Lite almost. This is especially of note, considering that with such a large overlap, it would take little effort to write a tool to convert data between the two standards – albeit with a lossy conversion between SDMX and DSPL.

However, for either to ignore that the other exists is a dangerous proposition. With each format having the potential to tap into slightly different markets, a division between them would present a split between people who ultimately have similar goals,.

* One could argue that I was able to understand DSPL so quickly because I had already learned how SDMX worked. One wouldn’t be wrong, but the point about the difference in documentation holds.

Reality Doesn’t Fit RDF

I’ve been doing recent research into mapping a information models into a useable data format, and one of the first ideas that crossed my mind was using RDF as the serialization format. While I spent a lot of time trying to figure out to best map the model to RDF, and it was possible, it just wasn’t technically feasible.

One of the main issues I found with RDF, was while there was a lot of theory behind it (and there is a lot of theory) there is very little practice to back it up. You can create an ontology to describe an information model, but I found no way to validate a graph against it, triple-stores while fast each relied on their own slightly different implementation of SPARQL and while RDF graphs support the idea of concise descriptions of objects, these aren’t quite the same as having a complex object model.

Probably the biggest issue I had was the idea of anonymous nodes, and how they are tolerated to allow complex objects. The basic example of this is a name made of multiple parts, eg:

me:Sam foaf:name _:name1
_:name1 foaf:first_name "Sam"
_:name1 foaf:last_name "Spencer"

In the above example “_:name1″ is a node all on its own. It can be referred to, and some triple-stores even allow it to be used by multiple identified nodes, except that it doesn’t really exist. Its a fake, a phoney, a thing that doesn’t really exist outside of the relationship “Sam has a name”. Granted other people will be called “Sam Spencer” and you may even want to be able to find all people with the same name, but in a lost of instances a name is just a very abstract concept, a semi-unique identifier.

Wouldn’t it be simpler to go:

thisSam = {name: {first: "Sam",
                  last:  "Spencer"}
          }

Here we can see that this name belongs to the object “thisSam”, and thisSams first name is thisSam.name.first, or more simply “Sam”. The ownership of this sub-property is very specific about where it goes.

As the saying goes, this name is mine, there are many like it, but this one is mine.

Can we find all the Roman toilets in ancient Jordan?

So Sam, what exactly do you do at uni?

Well, I’m glad you asked hypothetical reader, I solve practical problems. For example, for the last 2 months I along with 5 of my peers have been working with the Archaeology department at the University of Western Australia to design a tool to help field researchers track and visualise dig sites using Google Earth.

This work appears to have caught the eye of the publishing group at UWA and it was mentioned in a recent edition of the UWA News.

The tool will allow researchers to track and upload site information (including such information as “was this site a toilet”) via a webpage and have this be instantly viewable back here by researchers at UWA. So yes, in the near future UWA researchers will be able to easily find out where ancient Romans pooped when they were holidaying in Jordan!

Hating <table>s considered harmful

Apologies to Dijkstra for butchering that quote again , but the rage against <table>s is getting out of hand. Back in the dark ages, when it was impossible to get consistent HTML rendering across browsers and platforms someone decided to use the <table> element , originally designed to markup tabular data, to layout webpages.

Ever since that moment <table>s have been an unfairly shunned part of the HTML spec. When CSS became a stable and supported spec many people started screaming from the roof tops “Stop using <table> it ruins webpages”, and too many people have taken this to heart, so much that any time tabular data needs to be presented bizarre alternatives are used.

This was an example of a bad table of data that caught my eye a while ago:

See the "pre" section below

A screenshot for posterity

Its supposed to give an idea of the cumulative revenue of iPhone Apps over time, but its hard to tell that as there is no title. Good data tells a story, good metadata explains that story. There is no metadata here to explain any of these numbers, here is what a search engine sees when they look at the HTML on that page:

Period ending.....Period downloads.....Cumulative downloads....Period revenues
Jun 2008............no apps...................no apps........................no revenues
Dec 2008.............600 M......................600 M..........................$ 172 M
Jun 2009..............800 M....................1.4 B.............................$  228 M
Dec 2009..........1.6 B.........................3.0 B............................$  458 M
Jun 2010...........2.0 B.........................5.0 B............................$  542 M
Total.................5.0 B.........................5.0 B............................$1.4 B

Accessibility and search robots aside, here is no context to any of this, the units change between rows, the number of dots changes each time. Its hard to say what this data is without purposefully reading the text around it. I could drone on about the lack of metadata describing the data in this table, but instead I’ll counter this with a better example.

Here is the same table, rewritten as an actual <table>, using only what’s defined in the HTML4 and above W3C specifications:

A few things instantly stand out: the columns all match up nicely, both in display and in the table-model; each of the years is clearly marked; the table is relatively self-explanatory; and lastly it appears as a table to anyone who reads it. Furthermore, all of the formatting is done using CSS; thats right the presentation is left to the CSS and the descriptive markup is done using a <table>, just how it is intended to be done. But even this is only a fraction of the possible metadata to place into a HTML table.

However, this is not the worse example of a ‘table’ I have seen, the one that sent me over the edge was this large image that highlights the relative pros and cons of different smart phones:

A very large table comparing different smartphones

This 'table' is too big to show inline, it needed to be reduced to a third of its size to fit

The main issue with this, is this image of a table is mostly text, is quite large and static. You cannot easily copy text out of this, you can’t easily reorder the table and unless you have a large screen it would be quite difficult to compare two products that weren’t both very close to each other. Unlike the table of iPhone revenues, I am not going to go to the effort of transcribing this into a proper HTML table.

Compare this with a similar table from wikipedia also providing a comparison of smartphones and the differences are obvious. First of all, the table is now web-crawlable – whatever data is in this table is now indexed in Google, instant bonus, and the user can easily search through the page to find what they needed to know. There is also a whole lot of Javascript on the page. For example, clicking the boxes next the each of the column titles reorders the whole table around that column. In the first table this isn’t too noticeable with most entries having either ‘Yes’ or ‘No’, but the Hardware and OS table from further down the same page is full of figures, and now I can easily find the lightest phone (it’s the HP iPAQ Voice Messenger, something that would be much harder to find out on the image ‘table’).

Furthermore there are plenty of ways that you can easily add functionality to tables, to reoorder, hide, expand or change to add plenty of use to your tables.
The web is supposed to be full of life and allow use to use our data in exciting new ways, so don’t stick with static images, or unstructured text for your tables when there is a much more useful alternative out there.

Cornell Trip Report Days 3-5

So, I’m sitting in the Sydney Qantas Business Lounge, with plenty of time until my connecting flight home, so I’ll recap the last few days.

From what people had to say my presentation went quite well. The script I worked from is now available online, and the slides will soon follow. As soon as I have updated the slides with the transcript I will upload them and update this post.

There were however plenty of excellent talks at the IASSIST Conference this year. Due to the theme there was an abundance of social networking talks, including the keynote speaker on Friday, and they all presented excellent angles on the same question: “How can we get more people interacting with our data better?”

I think this was a great take on where data and metadata agencies are heading, as as I have said before, due to the fact that the more we have people interacting with data agencies, and using the data and metadata they provide, the more relevant those agencies become. The fact there are so many agencies asking this question and working on their own solutions and are willing to share those ideas is a great step forward for open data.

Cornell Trip Report – Day 1 & 2

So in the midst of preparing for Cornell, Uni work and wedding planning I haven’t had time to update, but taking a breather in a beautiful hotel room is a perfect time to get back to blogging.

I arrived in Ithaca yesterday, the trip up way gorgeous. I would say pun intended, but I didn’t know Ithaca was famous for it gorges until I got here. Spend the afternoon with folks from the DDI Alliance Expert Committee. That was great, didn’t say much, but listening to some extraordinarily smart people and having them do the same to me was honouring. Sadly after dinner I had to retire with a killer fog of jetlag and nearly immediately fell asleep.

This morning was much more eventful, the lovely staff at the Statler Hotel were able to get me a cord to charge my camera with, and with that I set off to explore the campus and surrounds. I walked round the campus before setting off for Collegetown. As my camera only had a short charge all my shorts of Collegetown are on my phone, whose cord is absent, so until I find it or figure out how to get the shots off, no photos from there yet.

But thats ok, as the best shots were down by Casadilla Gorge. The shots were great, and by the bend down the end there is (according to locals) a little wading area. The track along the creak from there was spectacular and green after last nights rain.

The only downside is a twisted my ankle some time during the walk, but I’m all strapped up, about to head of for lunch and see Jeremy Iverson from Algenta talk about Colectica.