Posts Tagged ‘ Metadata

Virgil UI – Beta demo video

Just a quick update that was supposed to have gone up last night. There is a video up on youtube now, showing of some of the more finalised features of Virgil-UI.

This shows three big features – CSV import, drag-and-drop reordering of classifications and multilingual support for editing. This means a classification with a multilingual component, for example a Canadian Industry Classification could have the English and French components edited simultaneously.

As stated in the last post, there should be a Windows binary release of a beta version of Virgil-UI and an updated version of the convertor tool should be released early September.

Virgil UI – Converting from legacy to CSV to DDI

While my main machine has been out-of-action, I’ve devoted a more time to one of the first use cases that prompted the development of Virgil – transforming legacy CSVs into DDI 3.1.

One of the main features of Virgil is the ability to help users transition from legacy systems, using non-standard formats to using DDI as the main data language for managing codes, categories and classifications. Unfortunately, there is no way for any one system to support every format for classifications, however by targeting a lowest-common denominator we can process the bulk of the work. In this case the lowest common denominator is CSVs.

If a user or developer of a legacy system is able to transform their legacy format into one of several different CSV formats supported by Virgil, then they will be able to import, at the least the basic structure and metadata of their codes and classifications into DDI. With most of the code for the conversion tools done, I’ve begun putting together the wizard interface for Virgil UI, which will also form part of a standalone conversion tool. Within the next few weeks the standalone conversion tool will be ready for release, and made available as open-source with the supporting code.

Below is a list of questions that users and developers may have around how to prepare CSVs for conversion to DDI listing the convertible metadata, preferred CSV structure and developer support. Although there are restricted possibilities for CSV structuring options for conversion, if there is a need for expanding the formats or metadata available for conversion, make your needs known and this can be incorporated in to future development.


What metadata will be supported?

A user will be able to import the code values and the hierarchy of a classification, as well as labels and descriptions of categories. Labels and descriptions can be multilingual, and multiple languages per item are able to be imported.

Will I have to use Virgil-UI to use this converter

No. This converter will be available as a wizard within Virgil, but the UI for the wizard will be available as a standalone program for users who need to convert from a legacy system to DDI. Lastly, as the code will be entirely open-sourced, the Python module that performs the transformations will be able to be imported into any other Python piece of software. Lastly, since the converter module is written entirely using modules from the Python standard libraries, it will be usable by programs using languages that are compatible or have compatible python compilers – such as Java using Jython[http://www.jython.org/] or .Net using IronPython[http://ironpython.net/].

In summary there will be at least four ways developers and users will be able to implement the Virgil CSV-DDI converter tools.

What ‘formats’ of CSV will be supported?

CSVs are generally without structure, and are just a basic way of storing tabular data, but by using a simple combination of the following code and category forms within a CSV. When picking a structure, it is important that the ‘code’ columns come before any ‘category’ columns. However, and combination of a code and category column format if created correctly should convert from CSV to DDI without trouble.

Column options for importing codes and their hierarchy

Referential CSV Codelist
Order: Code , Parent
Notes: This can be reversed to go Parent, Code. If a parent is blank it is assumed that this node is a top level code in a CodeScheme

Example:
A, ,
1,A,
2,A,
B, ,
3,B,
4,B,

Semi-structured CSV Codelist

Order: (Empty,)*Code,
Notes: If the code is the first entry in a row then it is considered a top code in the CodeScheme. Any children of a code should be indented by only one column. The columns for labels and descriptions start in different columns depending on level of the hierarchy.

Example:
A,
 ,1,
 ,2,
B,
 ,3,
 ,4,

Aligned Semi-structured CSV Codelist

Order: (Empty,)*Code,(Empty,)*
Notes: If the code is the first entry in a row then it is considered a top code in the CodeScheme. Any children of a code should be indented by only one column. All nodes should be padded so that the columns for labels and descriptions start in the same columns.

Example:
A, ,
 ,1,
 ,2,
B, ,
 ,3,
 ,4,

Column options for importing multilingual categories

Prefix-embedded Language

Order: (Label,Description)+
Notes: As many languages as needed can be be repeated within the column as long as they have unique language codes.

Example: en-au;Chocolate,en-au;Confectionery based on the seed of the cacao plant,fr;Chocolat,fr; Confiseries à base de la graine de la plante de cacao

Pre-defined Column

Order: (language,Label,Description)+
Notes: As many languages as needed can be be repeated within the column as long as they have unique language codes.

Example: en-au,Strawberries,Tasty fruit that isn't a true berry,fr,Frasie,Fruits savoureux qui n'est pas une baie vrai

Monolingual

Order: (Label,Description)
Notes: When only importing a single language that isn’t expressed in the CSV a default language will need to be given when invoking the converter.

Example: Vegemite,A yeast extract spread only edible by people from Australia. No other translations exist because no one else can stand it.

Can this tool support tab-separated files?

Yes. In the wizard users will be given the opportunity to select from a range of delimiter options or enter their own delimiting character. When using this module in other code, it will also support any delimiter as long it is specified when calling the module.

How should a developer write CSV for the converter?

With no agreed upon standard for CSVs its hard for developers to try and write ‘standard’ CSVs. To simplify development and be as lenient as possible the Virgil CSV-DDI converter using the Python CSV module[http://docs.python.org/library/csv.html]. If you are writing your own CSV writer I’d suggest testing it against this module to make sure it works.

In a nut shell though – leading and trailing whitespace is trimmed and any entry that contains a comma (or specified delimiter) should be quoted with double (“) or single (‘) quote marks.

What will the wizard and standalone converter look like?

Something like this:

PyQT Mockup of the CSV/DDI ConverterClick for bigger…

Help users find your data with the “Data Discovery Cycle”

Prologue – A story of data discovery

Users routinely search for information on the web, and it is no different to any other for of discovery, be it digital or physical. To help cement this idea, consider this simple chain of events.

  1. A website author adds tags and text to a page appropriate to the content
  2. A user searches for terms, dates and locations in a search engine, such as Google or Bing
  3. After getting the results, the user reads the descriptions of the appropriate sites to deciding which sites to read or ignore
  4. The user clicks the link to a page they are interested in and reads more.

By going through these steps the user has been able to quickly narrow down a potentially millions of pages of information, and find the most appropriate one for their needs. Could you say that a user searching through your data could do the same?

The Data Discovery Cycle

Good metadata is about leading users through the Data Discovery Cycle – Discovery, Description and Identification, or DDI *. These are the three steps users go through to find the data they need. If you know where something is, you won’t search for it. If you don’t know what something is you won’t care where it is.

Discovery

Discovery metadata is the first piece of information that helps a user find what they want. When a user is at the Discovery stage of the Data Discovery Cycle they know what they want to find, but don’t know what or where it is yet. As a data provider, it is your role to help a user find what that is, and it starts by answering the five W’s that users may want to know about data. For example, a user may need to know some of the following to begin narrowing down the data:

At the Discovery stage of the Data Discovery Cycle [users] know what they want to find, but don’t know what or where it is yet.
  • Who gave (or collected) the data?
  • What data was gathered?
  • When was the data collected?
  • Where was the data gathered?
  • Why was this data collected?

‘How’ has been left off this list because how something occurred can be very complicated or domain-specific, and Discovery metadata should be relatively standardised across domains. Also of note, is that what questions a user asks can be very specific, and although there are standard was to express this kind of information, building effective services over these standards comes down to understand who is trying to find your data.

An example of an excellent discovery metadata standard is something like Dublin Core. Using just 15 base fields, Dublin Core allows a provider to capture the essence of a piece of data, allowing widely differing pieces of information to be placed in one registry allowing users to find what they need across many different areas.

Description

Descriptive metadata is information that helps a user narrow down what they are trying to find. At the Description phase a user has begun narrowing the field of data, and is begins investigation specific data sources. Once a user has culled the whole field of data down to the a short list of related information they want, they can examine the descriptions of data that matches their criteria and can narrow the field of results even further.

At the Description phase a user has begun narrowing the field of data, and is begins investigation specific data sources.

Example of descriptive metadata that help at this stage includes:

  • The name of the data (which is different from an indentifier)
  • Brief and in-depth descriptions of the data
  • Specific labels attached users may know the data as

This is the least machine actionable of all three steps, and the most likely to be in a written language. What is important, is that this information helps users get a better feel for what the data is about and allow them to cull or keep the data that is most relevant. Once a user has narrowed down what they need, they can move on to actually retrieving it.

Identification

Once a user has reached the Identification stage they have found what they want, and now need to locate and retrieve the actual data.

At its simplest Identification metadata can be as simple as a single Uniform Resource Identifier or series of complex identifiers. No matter how identification metadata is managed, it is still able to pinpoint a single piece of information. Once a user has reached the Identification stage they have found what they want, and now need to be able to locate and retrieve the actual data. Previous pieces of metadata that assisted them search through data can take on new roles outside of the Data Discovery Cycle.

Probably the best identification standard was mentioned above, the humble URI or Uniform Resource Identifier. Standard, easily resolved, infinitely extendable and widely used.

What about everything else?

Don’t take this simple division of information to mean that everything else is unimportant. Metadata that doesn’t fall into the above roles shouldn’t be discounted. For domain specific reasons the lists of information above, can become what the user is after. However, for the process of helping a user go through the Data Discovery Cycle, everything else really does become less important, and the above distinction can help you narrow down what the is the most important information to help users search through a registry of information.

So what is the point?

The point is look at how specific metadata helps or hinders a users ability to find what they are after. Too much and they become overwhelmed with options, too little and it become too difficult to find what they need. Likewise, if users are restricted to specific values for their metadata they may misinterpret their meaning of the controlled vocabulary. But again, if they are given to much freedom, it may become impossible for anyone to find anything.

Epilogue – A story of discovery revisted

Lets revisit our earlier story and look at how this maps to the Data Discovery Cycle:

  1. A website author adds tags and text to a page appropriate to the content
  2. A user searches for terms, dates and locations in a search engine, such as Google or Bing (Discovery)
  3. After getting the results, the user reads the descriptions of the appropriate sites to deciding which sites to read or ignore (Description)
  4. The user clicks the link to a page they are interested in (Identification)

In summary, good discovery metadata is about finding the balance of information needed to help users find what they need with minimal effort and maximum results. However, ultimately this means understanding who your users, what they are trying to find, and how they want to search for data – but people can be a lot harder to understand that data I’m afraid.

* Although, not the DDI you might be thinking of which is a good metadata standard, but doesn’t explain how users search for their data. However, when it comes to standards that help statistical data providers describe their work, The Data Documentation Initiative is probably the best tool for helping providers make the necessary information to help users through the Data Discovery Cycle.

Using metadata within statistical software

Today is the release of a beta of a package I am writing for the R statistical package to make it easier for researchers to utilise metadata within R and to make it more worthwhile for statisticians to provide metadata.

Most of the methods for R to import data rely solely on the importing of undocumented data, in fact one of the most common ways to import data is through raw CSVs. However, with the release of DSPL.R it is now possible to browse the metadata of a dataset within a statistical package.

For example, the following output is example output from the US Retail Sales dataset provided by Google:

> print (prep.dspl("~/example/census-retail-sales.zip"))
DSPL Dataset - For more info see: [www.kidstrythisathome.com/dspl.r]
------------                  or: [code.google.com/apis/publicdata/]

Name : Retail Sales in the U.S.
Description : Monthly Retail Trade and Food Services report
            for the United States. This dataset was prepared by Google based
            on data downloaded from the U.S. Census Bureau.
Concepts : 3  -  Type of business, Seasonality, Retail Sales Volume
Slices   : 1  -  retail_sales_business
Tables   : 3  -  businesses, seasonalities, retail_sales_business_tbl
Topics   : 3  -  Industry, Business, Gender

As this example shows, a user is able to load in a new dataset, and get an immediate sense for what the dataset contains. By being able to allow a user to be able to understand the meaning behind a dataset, without having to leave the statistical environment, users are able to seamlessly work with their data and metadata within the same interface.

While DSPL is seen as a newcomer to the statistical world, and the R is perceived(albeit wrongly) to be inferior to more established commercial statistical tools, the agility of R and the brevity of the DSPL standard act as a strong indicator of how, given time statistical metadata could become an integral part of the all statistical processes.

Hating <table>s considered harmful

Apologies to Dijkstra for butchering that quote again , but the rage against <table>s is getting out of hand. Back in the dark ages, when it was impossible to get consistent HTML rendering across browsers and platforms someone decided to use the <table> element , originally designed to markup tabular data, to layout webpages.

Ever since that moment <table>s have been an unfairly shunned part of the HTML spec. When CSS became a stable and supported spec many people started screaming from the roof tops “Stop using <table> it ruins webpages”, and too many people have taken this to heart, so much that any time tabular data needs to be presented bizarre alternatives are used.

This was an example of a bad table of data that caught my eye a while ago:

See the "pre" section below

A screenshot for posterity

Its supposed to give an idea of the cumulative revenue of iPhone Apps over time, but its hard to tell that as there is no title. Good data tells a story, good metadata explains that story. There is no metadata here to explain any of these numbers, here is what a search engine sees when they look at the HTML on that page:

Period ending.....Period downloads.....Cumulative downloads....Period revenues
Jun 2008............no apps...................no apps........................no revenues
Dec 2008.............600 M......................600 M..........................$ 172 M
Jun 2009..............800 M....................1.4 B.............................$  228 M
Dec 2009..........1.6 B.........................3.0 B............................$  458 M
Jun 2010...........2.0 B.........................5.0 B............................$  542 M
Total.................5.0 B.........................5.0 B............................$1.4 B

Accessibility and search robots aside, here is no context to any of this, the units change between rows, the number of dots changes each time. Its hard to say what this data is without purposefully reading the text around it. I could drone on about the lack of metadata describing the data in this table, but instead I’ll counter this with a better example.

Here is the same table, rewritten as an actual <table>, using only what’s defined in the HTML4 and above W3C specifications:

A few things instantly stand out: the columns all match up nicely, both in display and in the table-model; each of the years is clearly marked; the table is relatively self-explanatory; and lastly it appears as a table to anyone who reads it. Furthermore, all of the formatting is done using CSS; thats right the presentation is left to the CSS and the descriptive markup is done using a <table>, just how it is intended to be done. But even this is only a fraction of the possible metadata to place into a HTML table.

However, this is not the worse example of a ‘table’ I have seen, the one that sent me over the edge was this large image that highlights the relative pros and cons of different smart phones:

A very large table comparing different smartphones

This 'table' is too big to show inline, it needed to be reduced to a third of its size to fit

The main issue with this, is this image of a table is mostly text, is quite large and static. You cannot easily copy text out of this, you can’t easily reorder the table and unless you have a large screen it would be quite difficult to compare two products that weren’t both very close to each other. Unlike the table of iPhone revenues, I am not going to go to the effort of transcribing this into a proper HTML table.

Compare this with a similar table from wikipedia also providing a comparison of smartphones and the differences are obvious. First of all, the table is now web-crawlable – whatever data is in this table is now indexed in Google, instant bonus, and the user can easily search through the page to find what they needed to know. There is also a whole lot of Javascript on the page. For example, clicking the boxes next the each of the column titles reorders the whole table around that column. In the first table this isn’t too noticeable with most entries having either ‘Yes’ or ‘No’, but the Hardware and OS table from further down the same page is full of figures, and now I can easily find the lightest phone (it’s the HP iPAQ Voice Messenger, something that would be much harder to find out on the image ‘table’).

Furthermore there are plenty of ways that you can easily add functionality to tables, to reoorder, hide, expand or change to add plenty of use to your tables.
The web is supposed to be full of life and allow use to use our data in exciting new ways, so don’t stick with static images, or unstructured text for your tables when there is a much more useful alternative out there.

Cornell Trip Report Days 3-5

So, I’m sitting in the Sydney Qantas Business Lounge, with plenty of time until my connecting flight home, so I’ll recap the last few days.

From what people had to say my presentation went quite well. The script I worked from is now available online, and the slides will soon follow. As soon as I have updated the slides with the transcript I will upload them and update this post.

There were however plenty of excellent talks at the IASSIST Conference this year. Due to the theme there was an abundance of social networking talks, including the keynote speaker on Friday, and they all presented excellent angles on the same question: “How can we get more people interacting with our data better?”

I think this was a great take on where data and metadata agencies are heading, as as I have said before, due to the fact that the more we have people interacting with data agencies, and using the data and metadata they provide, the more relevant those agencies become. The fact there are so many agencies asking this question and working on their own solutions and are willing to share those ideas is a great step forward for open data.

Cornell Trip Report – Day 1 & 2

So in the midst of preparing for Cornell, Uni work and wedding planning I haven’t had time to update, but taking a breather in a beautiful hotel room is a perfect time to get back to blogging.

I arrived in Ithaca yesterday, the trip up way gorgeous. I would say pun intended, but I didn’t know Ithaca was famous for it gorges until I got here. Spend the afternoon with folks from the DDI Alliance Expert Committee. That was great, didn’t say much, but listening to some extraordinarily smart people and having them do the same to me was honouring. Sadly after dinner I had to retire with a killer fog of jetlag and nearly immediately fell asleep.

This morning was much more eventful, the lovely staff at the Statler Hotel were able to get me a cord to charge my camera with, and with that I set off to explore the campus and surrounds. I walked round the campus before setting off for Collegetown. As my camera only had a short charge all my shorts of Collegetown are on my phone, whose cord is absent, so until I find it or figure out how to get the shots off, no photos from there yet.

But thats ok, as the best shots were down by Casadilla Gorge. The shots were great, and by the bend down the end there is (according to locals) a little wading area. The track along the creak from there was spectacular and green after last nights rain.

The only downside is a twisted my ankle some time during the walk, but I’m all strapped up, about to head of for lunch and see Jeremy Iverson from Algenta talk about Colectica.