Help users find your data with the “Data Discovery Cycle”

Prologue – A story of data discovery

Users routinely search for information on the web, and it is no different to any other for of discovery, be it digital or physical. To help cement this idea, consider this simple chain of events.

A website author adds tags and text to a page appropriate to the content
A user searches for terms, dates and locations in a search engine, such as Google or Bing
After getting the results, the user reads the descriptions of the appropriate sites to deciding which sites to read or ignore
The user clicks the link to a page they are interested in and reads more.

By going through these steps the user has been able to quickly narrow down a potentially millions of pages of information, and find the most appropriate one for their needs. Could you say that a user searching through your data could do the same?

The Data Discovery Cycle

Good metadata is about leading users through the Data Discovery Cycle – Discovery, Description and Identification, or DDI *. These are the three steps users go through to find the data they need. If you know where something is, you won’t search for it. If you don’t know what something is you won’t care where it is.

Discovery

Discovery metadata is the first piece of information that helps a user find what they want. When a user is at the Discovery stage of the Data Discovery Cycle they know what they want to find, but don’t know what or where it is yet. As a data provider, it is your role to help a user find what that is, and it starts by answering the five W’s that users may want to know about data. For example, a user may need to know some of the following to begin narrowing down the data:

At the Discovery stage of the Data Discovery Cycle [users] know what they want to find, but don’t know what or where it is yet.

Who gave (or collected) the data?
What data was gathered?
When was the data collected?
Where was the data gathered?
Why was this data collected?

‘How’ has been left off this list because how something occurred can be very complicated or domain-specific, and Discovery metadata should be relatively standardised across domains. Also of note, is that what questions a user asks can be very specific, and although there are standard was to express this kind of information, building effective services over these standards comes down to understand who is trying to find your data.

An example of an excellent discovery metadata standard is something like Dublin Core. Using just 15 base fields, Dublin Core allows a provider to capture the essence of a piece of data, allowing widely differing pieces of information to be placed in one registry allowing users to find what they need across many different areas.

Description

Descriptive metadata is information that helps a user narrow down what they are trying to find. At the Description phase a user has begun narrowing the field of data, and is begins investigation specific data sources. Once a user has culled the whole field of data down to the a short list of related information they want, they can examine the descriptions of data that matches their criteria and can narrow the field of results even further.

At the Description phase a user has begun narrowing the field of data, and is begins investigation specific data sources.

Example of descriptive metadata that help at this stage includes:

The name of the data (which is different from an indentifier)
Brief and in-depth descriptions of the data
Specific labels attached users may know the data as

This is the least machine actionable of all three steps, and the most likely to be in a written language. What is important, is that this information helps users get a better feel for what the data is about and allow them to cull or keep the data that is most relevant. Once a user has narrowed down what they need, they can move on to actually retrieving it.

Identification

Once a user has reached the Identification stage they have found what they want, and now need to locate and retrieve the actual data.

At its simplest Identification metadata can be as simple as a single Uniform Resource Identifier or series of complex identifiers. No matter how identification metadata is managed, it is still able to pinpoint a single piece of information. Once a user has reached the Identification stage they have found what they want, and now need to be able to locate and retrieve the actual data. Previous pieces of metadata that assisted them search through data can take on new roles outside of the Data Discovery Cycle.

Probably the best identification standard was mentioned above, the humble URI or Uniform Resource Identifier. Standard, easily resolved, infinitely extendable and widely used.

What about everything else?

Don’t take this simple division of information to mean that everything else is unimportant. Metadata that doesn’t fall into the above roles shouldn’t be discounted. For domain specific reasons the lists of information above, can become what the user is after. However, for the process of helping a user go through the Data Discovery Cycle, everything else really does become less important, and the above distinction can help you narrow down what the is the most important information to help users search through a registry of information.

So what is the point?

The point is look at how specific metadata helps or hinders a users ability to find what they are after. Too much and they become overwhelmed with options, too little and it become too difficult to find what they need. Likewise, if users are restricted to specific values for their metadata they may misinterpret their meaning of the controlled vocabulary. But again, if they are given to much freedom, it may become impossible for anyone to find anything.

Epilogue – A story of discovery revisted

Lets revisit our earlier story and look at how this maps to the Data Discovery Cycle:

A website author adds tags and text to a page appropriate to the content
A user searches for terms, dates and locations in a search engine, such as Google or Bing (Discovery)
After getting the results, the user reads the descriptions of the appropriate sites to deciding which sites to read or ignore (Description)
The user clicks the link to a page they are interested in (Identification)

In summary, good discovery metadata is about finding the balance of information needed to help users find what they need with minimal effort and maximum results. However, ultimately this means understanding who your users, what they are trying to find, and how they want to search for data – but people can be a lot harder to understand that data I’m afraid.

* Although, not the DDI you might be thinking of which is a good metadata standard, but doesn’t explain how users search for their data. However, when it comes to standards that help statistical data providers describe their work, The Data Documentation Initiative is probably the best tool for helping providers make the necessary information to help users through the Data Discovery Cycle.