Archive for the ‘ Metadata ’ Category

Make sure your continuous testing is continuous

One of the key features of the Aristotle Metadata Registry is its extensive test suite. Every time code is checked-in, the test suites are run and about 20 minutes later I get a notification saying everything is fine… or so it should be.

I recently made a small change to the test suite, that altered no code, and just changed some of the reporting. This shouldn’t have changed how the last tests were run, so they should have completed without problems, but this wasn’t the case.

After a short investigation, I discovered that a library that is used in the Aristotle admin interface had changed in a big way. Unfortunately, I haven’t been able to work on Aristotle as frequently as I’d like over the past few months, so this had gone completely unnoticed. Since the test environment is rebuilt every time the test are run, it was using the most recent version, while my code depended on an earlier version.

Since Aristotle is still in beta, the result wasn’t disastrous, however it still highlights (for me at least) an issue with relying on a green tick in the test suite saying everything is alright – because while the tests might be alright at that point in time, its prone to change.

So if you have to put down a project for a few weeks, or longer, make sure to nudge your code periodically, just to make sure everything is still running ok.

As for how it was fixed, a short alteration to the requirements file got the tests passing again, and a newer version that incorporates the updated library will be coming shortly.

Django-spaghetti-and-meatballs now available on pypi and through pip

The title covers most of it: django-spaghetti-and-meatballs (a little library I was making for producing entity-relationship diagrams from django models) is now packaged and available on PyPI, which means it can be installed via pip super easily:

pip install django-spaghetti-and-meatballs

There is even documentation for django-spaghetti-and-meatballs available on ReadTheDocs, so its all super stable and ready to use. So get it while its still fresh!

There is a live demo on the Aristotle Metadata Registry site, or you can check out the static version below:

A sample erd

Two new projects for data management with django

I’ve recently been working on two new projects through work that I’ve been able to make open source. These are designed to make data and metadata management with Django much easier to do. While I’m not in a position to talk about the main work yet, I can talk about the libraries that have sprung out of it:

The first is the “Django Data Interrogator” which is a Django app for 1.7 and up that allows anyone to create tables of information from a database that stores django models. I can see this being is handy when you are storing lists of people, products or events and want to be able to produce ad-hoc reports similar to “People with the number of sales made”, “Products with the highest sales, grouped by region”. At this stage this is done by giving a list of relations from a base ‘class’, more information is available on the Git repo. I should give apologies to a more well known project with the same acronym – I didn’t pick the name, and will never acronymise this project.

The second is “Django Spaghetti and Meatballs” which is a tool to produce ERD-like diagrams from Django projects – that depending on the colors, and number of models ,looks kind of like a plate of spaghetti. Once given a list of Django apps, this mines the Django content types table and produces an interactive javascript representation, using the lovely VisJs library. This has been really useful for prototyping the database, as while Django code is very readable, as the number of models and cross-app connections grew, this gave us a good understanding of how the wider picture looked. The other big advantage is that this uses Python docstrings, Django help text and field definitions to produce all the text in the diagrams. The example below shows a few models in three apps: Django’s build in Auth models, and the Django notifications and revision apps:

A graph of django models

A sample plate of spicy meatballs – Ingredients: Django Auth, Notifications and Revisions

Request for comments/volunteers for the Aristotle Metadata Registry

This is a request for comments and volunteers for an open source ISO 11179 metadata registry I have been working on called the Aristotle Metadata Registry (Aristotle-MDR). Aristotle-MDR is a Django/Python application that provides an authoring environment for a wide variety of 11179 compliant metadata objects with a focus to being multilingual. As such, I’m hoping to raise interest around bug checkers, translators, experienced HTML and Python programmers and data modelers for mapping of ISO 11179 to DDI3.2 (and potentially other formats).

For the eager:


Aristotle-MDR is based on the Australian Institute of Health and Welfare’s METeOR Registry, an ISO 11179 compliant authoring tool that manages several thousand metadata items for tracking health, community services, hospital and primary care statistics. I have undertaken the Aristotle-MDR project to build upon the ideas behind Meteor, and extend it to improve compliance with 11179, but to also allow for access and discovery using other standards, including DDI and GSIM.

Aristotle-MDR is build on a number of existing open source frameworks, including Django, Haystack, Bootstrap and jQuery which allows it to easily scale from mobile to desktop on the client side, and scale from small shared hosting to full-scale enterprise environments on the server side. Along with the in-built authoring suite is the Haystack search platform which allows for a range of searching solutions from enterprise search such as Solr or Elastisearch, to smaller scale search engines.

The goal of the Aristotle-MDR is to conform to the ISO/IEC 11179 standard as closely as possible, so while it has a limited range of metadata objects, much like the 11179 standard it allows for the easy extension and inclusion of additional items. Among those already available, are extensions for:

Information on how to create custom objects can be found in the documentation:

Due to the wide variety of needs for users to access information, there is a download extension API that allows for the creation of a wide variety of download formats. Included is the ability to generate PDF versions of content from simple HTML templates, but an additional module allows for the creation of DDI3.2 (at the moment this supports a small number of objects only):

As mentioned, this is a call for comments and volunteers. First and foremost I’d appreciate as much help as possible with my mapping of 11179 objects in DDI3.2 (or earlier versions), but also with the translations for the user interface – which is currently available in English and Swedish (thanks to Olof Olsson). Partial translations into other languages are available thanks to translations in the Django source code, but additional translations around technical terms would be appreciated. More information on how to contribute to translating is available on the wiki:

To aid with this I’ve added a few blank translation files in common languages. Once the repository is forked, it should be relatively straightforward to edit these in Github and send a pull request back without having to pull down the entire codebase. These are listed by ISO 639-1 code, and if you don’t see your own listed let me know and I can quickly pop a boilerplate translation file in.

If you find bugs or identify areas of work, feel free to raise them either by emailing me or by raising a bug on Github:

Aristotle MetaData Registry now has a Github organisation

This weekends task has been upgrading Aristotle from a single user repository to a Github organisation. The new Aristotle-MDR organisation holds the main code for the Aristotle Metadata Registry, but alongside that it also has the DDI Utilities codebase and some additional extensions, along with the new “Aristotle Glossary” extension.

This new extension pulls the Glossary code base out of the code code to improve it status as a “pure” ISO/IEC 11179 implementation as stated in the Aristotle-MDR mission statement. It will also provide additional Django post-save hooks to provide easy look-ups from Glossary items, to any item that requires the glossary item in its definition.

If you are curious about the procedure for migrating an existing project from a personal repository to an organisation, I’ve written a step-by-step guide on StackExchange that runs through all of the steps and potential issues.

Aristotle-Metadata-Registry – My worst kept secret

About 6 months ago I stopped frequently blogging, as I began work on a project that was not quite ready for a wider audience, but today that period comes to a close.

Over the past year, I have been working on a new piece of open-source software – an ISO/IEC 11179 metadata registry. This originally began from my experiences working on the Meteor Metadata Registry, which gave me an in-depth understanding of the systems and governance issues around the management of metadata across large scale organisations. I believe Aristotle-MDR provides one of the closest open-source implementations of the information model of Part 6 and the registration workflows of Part 3, in an easy to use and install piece of open-source software.

In that time, Aristotle-MDR has grown to several thousand lines of code, most substantially over 5000 line of rigorously tested Python code, tested using a suit of over 500 regression tests, and rich documentation covering installation, configuration and extension. From a front-end perspective, Aristotle-MDR uses the Bootstrap, CKEditor and jQuery libraries to provide a seemless, responsive experience, the use of the Haystack search engine provides scalable and accurate search capability, while custom wizards encourage the discovery and reuse metadata at the point of content creation.

One of the guiding principles of Aristotle-MDR has been to not only model 11179 straight-forward fashion, but do so in a way that complies with the extension principles of the standard itself. To this end, while the data model of Aristotle-MDR is and will remain quite bare-bones, it provides a robust, tested framework on which extensions can be built. Already a number of such extensions are being built, including those for the management of datasets, questionnaires, and performance indicators and for the sharing of information in the Data Documentation Initiative XML Format.

In the last 12 months, I have learned a lot as a systems developer, had the opportunity to contribute to several Django-based projects and look forward to sharing Aristotle, especially at IASSIST 2015 where I aim to present Aristotle-MDR as a stable 1.0 release. In the interim, there is a demonstration server for Aristotle available, with two guest accounts and a few hundred example items for people to use, test and possibly break.

Why Linus Torvalds is wrong about XML

Linus Torvalds is one of the most revered figures in modern computer science and has made the kind of contributions to the world that I hope to achieve. However, given his global audience, his recent statements about XML give me pause for reflection.

I have worked with XML in a number of jobs, helped with the specification of international XML formats, written tutorials on their use, and even made my own XML format (with reason I might add). And I must say, in reply to Linus’s statement that

XML is the worst format ever designed

XML isn’t the problem, it is more that the problem is bad programmers. Computer Science is a broad field, not covering just the creation of programs, but also the correct specification of information for computation. The lack of appreciation for that second aspect has seen the recent rise of “Data Science” as a field – a mash of statistics, data management and programming.

While it is undenyable that many programmers write bad XML, this is because of poor understanding and discipline. One could equally say, people write bad code, lets stop them writing code. People will always make mistakes or cut corners, the solution is education, not reinventing the wheel.

Linus and the rest of the Subsurface team are well within their rights to use the data formats they choose, and I am eager to see what new formats he can design. But with that in mind, I will address some of the critiques of Linus and others about XML and point out their issues, followed by some handy tips for programmers looking at using XML.

XML should be human readable

I did the best that I could with XML, and I suspect the subsurface XML is about as pretty and human-readable as you can make that crap

CSV isn’t very readable, C, Perl and Python aren’t very human readable. What is “human-readable” is very subjective, as even English isn’t human-readable to non-English speakers.

Restricting ourselves to just technology, CSV isn’t very readable as for any non-trivial amount of data as the header will scroll off the top of the screen, and data will overflow onto the next line or outside the horizonal boundaries of the screen. One could argue that its possible in Excel, OpenOffice or using a VIm/Emacs plugin to lock the headers to the top of the screen – and now we have used a tool to overcome limitations in the format.

Likewise, the same can be said for computer code, code-folding, auto-completion of long function and variable names and syntax highlighting are all software features to overcome failures in the format and make the output more “human-readable”. Plain-text supports none of the above, yet no one would recommend using Notepad to write code for the lack of features.

Likewise, I would never, ever recommend writing XML in a non-XML editor. Auto-adding of closing tags, checking schema as you type, easy access to the schema via hotlinks from elements and attributes, and XPath query and replace are all vital functions of a good XML editor. All of these make writing XML much easier and approachable, and compared to code or CSV, a programmer should spend only as much time in an XML editor to understand the format to make writing XML in code easier.

While it can be said that a poor craftsman blames his tools, a good craftsman knows when to use the right tools as well.

XML files should standalone

This is most visible in this bug raised in Subsurface where it is stated that:

Subsurface only ever stores metric units. But our goal is to create files that make sense and can be read and understood without additional information.

Now, examination of a sample of the XML from subsurface shows a glaring contradiction. There is nothing in this file that says that units are in metric. The distance ‘m’ could equally stand for ‘miles’, and while the order of magnitude would make misinterpretation for a human hard, a dive computer with an incorrect understanding may miscalculate the required oxygen pressure leading to potential death. To accurately understand this file, I need to find the documentation, i.e additional information. The reason for schema is to explicitly describe a data file.

Additionally, because data is stored as “human-readable” strings, I could validly put in “thirty metres” instead of “30.0 m” as a depth. At this point the program might fail, but as someone writing the data elsewhere I’d have no reason why. Apart from being a description of the data, schema exists as a contract. If you say the data is of this form, then these are the rules you must conform to. When you are looking at sharing data between programs or organisations this ability to lean on a technical enforcement is invaluable as making “bad” data is that much harder.

XML shouldn’t need other formats

This is a tricky one, as when people think of XML, even if they have made a schema their mid stops there. XML isn’t just a format, its more a suit of related formats that can make handing and manipulating information easier.

Its worth noting that people have raised databases within that thread as an alternative – SQL is only a query language, but requires the formal Database Definition Language to describe the data and an engine to query over it. Likewise, HTML without CSS, Javascript or any number of programming and templating languages that power the web would be much less useful to the general public.

Similarly, isolating XML from XML schemas, mean your data has no structure. Isolating XML from XQuery and XPath mean you have no way of querying your data. Without XSLT there is no easy, declarative way to transform XML, and having done this with traditional languages and XSLT, the latter makes using and transforming XML much easier. Ultimately, using XML without taking advantage of many of the technologies that exist in the entire XML landscape is not using technologies to its best.

Tips for good XML

With all of that aside, XML like all technologies can be used poorly. However, when done well and documented properly, a good XML format with an appropriate schema can reduce errors and give vital metadata that gives data context and longevity. So I present a few handy tips for using XML well.

  1. Only use XML when appropriateXML is best suited to complex data, especially hierarchical data. As Linus (and others) points out in the linked thread tabular data is much better suited to CSV or more structured tablular formats, simple key values can be stored in ini files, and markup text can be done in HTML, Markdown or any number of other formats.
  2. Look for other formats.If you are thinking of using XML for your tool – stop and see what others have already done. The world doesn’t need another format, so if you are thinking of doing so you should have a very, very good reason to do so.
  3. Use a schema or doctypeIf you are chosing to make your own format, this is the most important point. If you chose to use XML, make a schema. How you choose to capture this Doctype, XSD Schema, Schematron, Relax NG is largely irrelevant. What is important is that your data format is documented. There are even tools that can automate creating schema stubs from documents, so there is no excuse not to. As stated an XML schema is the formal contract about what your data is and lets others know that if the data doesn’t conform to this format then it is broken.
  4. Use XML datatypesXML already has specifications for text, numeric, datetime and identification data. Use these as a starting point for your data.
  5. Store one type of data per field.While the difference between <dive duration="30:00 mins"> and <dive duration="30" durationUnit="mins"> is minimal, the former uses a single string for two pieces of data, while the latter uses two fields, a number and an enumerable, each storing one piece of data. An even better solution is using the XML duration data type <dive duration="PT30M"> based on the existing ISO 8601 standard.

A Request for Comments on a new XML Questionnaire Specification Format (SQBL)

This is an announcement and Request for Comments on SQBL a new
open-source XML format for the cross-platform development of questionnaire
specifications. The design decisions behind SQBL and additional details are the
subject of a paper to be presented in 2 weeks at the 2013 IASSIST conference in
Cologne, Germany:
– Do We Need a Perfect Metadata Standard or is “Good Enough” Good Enough?
However, to ensure people are well-informed ahead time, I am releasing details
ahead to conference.

The gist

SQBL – The Structured (or Simple) Questionnaire Building Language is an
emerging XML format designed to allow survey researchers of all fields to
easily produce questionnaire specifications with the required structure to
enable deployment to any questionnaire platform – including, but not limited
to, Blaise, DDI, LimeSurvey, XForms and paper surveys.

The problem

Analysing the current state of questionnaire design and development shows that
there are relatively few tools available that are capable of allowing a survey
designer to easily create questionnaire specifications in a simple manner,
whilst providing the structure necessary to verify respondent routing and
provide a reliable input to the automation of questionnaire deployment.

Of the current questionnaire creations tools available, they either:
Prevent the sharing of content (such as closed tools like SurveyMonkey)
Require extensive programming experience (such as Blaise or CASES)
* or use formats that make transformation difficult (such as those based on DDI)
Given the high-cost of questionnaire design, in the creation, testing and
deployment of final questionnaires a format that can reduce the cost in any or
all of these areas will have positive effects for researchers.

Furthermore, by providing researchers with the easy tools necessary to create
questionnaires they will consequently create structured metadata, thus reducing
the well understood documentation burden for archivists.

Structured questionnaire design

Last year, I wrote a paper “The Case Against the Skip Statement”, that
described the computational theory of questionnaire logic – namely the
structures used to describe skips and routing logic in questionnaires. This
paper was awarded 3rd place in the International Association of Official
Statistics ‘2013 Young Statistician Prize’ This paper
is awaiting publication, but can be made available for private reading on
request. It proposed that this routing logic in questionnaires is structurally
identical to that of computer programs. Following this assertion, it stated
that a higher-order language can be created that acts as a “high-level
questionnaire specification logic” that can be compiled to any questionnaire
platform, in much the same way that computer programming languages can be
compiled to machine language. Unfortunately, while some existing formats
incorporate some of the principles of Structured Questionnaire Design, they are
incomplete or too complex to provide the proposed benefits.

SQBL – The Structured (or Simple) Questionnaire Building Language

SQBL is an XML format that acts as a high-level language for
describing questionnaire logic. Small and simple, but powerful it incorporates
XML technologies to reduce the barrier to entry and make the description of
questionnaire specifications, even in raw XML readable. Underlying this
simplicity is a strict schema that enforces single solutions to problems,
meaning SQBL can be transformed into a format for any survey tool that has a
published specification.

Furthermore, because of its small schema and incorporation of XML and HTTP core
technologies, it is easier for developers to work with. In turn, this makes
survey design more comprehensible through the creation of easier tools, and
will help remove the need for costly, specialised instrument programmers
through automation.

Canard – the SQBL Question Module Editor

Announced alongside the Request of Comments of SQBl is an early beta release of
the SQBL-based Canard Question Module Editor Canard is
designed as a proof-of-concept tool to illustrate how questionnaire
specifications can be generated in an easy to use drag-and-drop interface. This
is achieved by providing designers with instant feedback on changes to
specifications through its 2 panel design that allows researchers to see the
logical specification, routing paths and example questionnaires all within the
same tool.

SQBL and other standards

SQBL is not a competitor to any existing standard, mainly because a structured
approach to questionnaire design based on solid theory has never been attempted
before. SQBL fills a niche that other standards don’t yet do well.

For example, while DDI can archive any questionnaire as is, this is because
of the loose structure necessary for being able to archive uncontrolled
metadata. However, if we want to be able to make questionnaire specifications
that can be used to drive processes, what is needed is the strict structure of

Similarly, SQBL has loose couplings to other information through standard HTTP
URIs allowing linkages to any networked standard. For example, Date Elements may
be described in a DDI registry, which a SQBL question can reference via its
DDI-URI. Additionally, to support automation a survey instrument described
inside a DDI Data Collection, rather than pointing to a DDI Sequence containing
the Instrument details can use existing linkages to external standards to point
to a SQBL document via a standard URL. Once data collection is complete,
harmonisation can be performed as each SQBL module has questions pointing to
variables, so data has comparability downstream.

SQBL in action

The SQBL XML schemas are available on GitHub that
also contains examples and files from video tutorials.
There is a website with more information on the format that
provides more information on some of the principles of Structured Questionnaire

If you don’t like getting your hands dirty with XML you can download the
Windows version of the Canard Question Module Editor from Dropbox and start producing questionnaire specifications
immediately. All that needs to be done is to unzip the file and run the file
named . Due to dependencies flowcharts may not be immediately
available, however this can be fixed by installing the free third-party
graphing tool Graphviz

Lastly, there is a growing number of tutorial videos on how to use Canard on Youtube.

Video 1 – Basic Questions (2:17 min)
Video 2 – Complex Responses (2:17 min)
Video 3 – Simple Logic (4:11 min)

There is also an early beta video that runs through creating an entire
questionnaire showing the side-by-side preview. (13:21 mins)

Joining the SQBL community

First of all there is a mailing list for SQBL hosted by Google Groups:!forum/sqbl.

Along with this each of the GitHub repositories, include issue trackers. Both Canard and SQBL are in
early design stages so there is an opportunity for feedback and input to ensure
both SQBL and Canard support the needs of all questionnaire designers.

Lastly, while there are initial examples of conversion tools to transform SQBL
into DDI-Lifecycle 3.1 and XForms, there is room for growth. Given the
proliferation of customised solutions to deploy both paper and web-forms there
is a need for developers to support the creation of transformations from SQBL
into formats such as Blaise, LimeSurvey, CASES and more.

If you have made it this far thank you for reading all the way through, and I
look forward to all the feedback people have to offer.

Cheers and I look forward to feedback now or at IASSIST,

Samuel Spencer.
SQBL & Canard Lead Developer
IASSIST Asia/Pacific Regional Secretary

Beginning the soft launch of SQBL and Canard

Over the past week I’ve start finalising a version of Canard and SQBL ready for early-Beta testing and public review ahead of IASSIST2013. While I’ll be putting together more documentation later in the week, the first of a series of short tutorials on how Canard will eventually be used.

Also, later this week will see the source code for Canard as shown in the below video released on GitHub, as well as a beta binary for easy of use during testing. For now the SQBL schemas can be seen on GitHub and the main SQBL website contains more information. For now, enjoy the two videos below to see how a strict structure can make questionnaire design easier than ever before!

Why I’ve chosen to make a new XML standard for questionnaires

XKCD #927

Normally I don’t like XKCD, but this is so true.

I’ve made no secret of the fact that I’ve been working on a new format for questionnaires. I recently registered a domain for the Structured Questionnaire Building Language, and have been releasing screenshots and a video of a new tool for questionnaire design that I’m working on. Considering that I’ll be covering this work at at least one conference this year, and given my close ties in a few technical communities I felt that it would be good to discuss why this is the case, and answer a few questions that people may have.

Why is a new format for questionnaire design necessary?

Over the past few years I’ve done a lot of research analysing how questionnaires are structured in a very generic sense. Given the simplistic nature of the logic traditionally found in paper and electronic questionnaires and their logical similarity to computer programming, I’ve theorised that it should be possible to use the same methods (and thus the same tools) to supports all questionnaires – including the oft ignored paper questionnaire. Unfortunately, attempts to improve questionnaires have focus on proprietary or limited use cases, which is why tools and formats such as Blaise, CASES and queXML exist, but generally only support telephone or web surveys. Likewise, all of these attempts have ignore the logical structure in various ways and discouraged questionnaire designers from becoming intimately, and necessarily familiar with the logic of their questionnaires.

SQBL on the other hand is an attempt at designing a specialised format to support the capture of the generic information that describes a questionnaire. Likewise, Canard is a parallel development of a tool to allow a researcher to quickly create this information, as a way to help them create their questionnaire, rather than just document it afterwards.

As a quick aside, if you are interested in this research on Structured Questionnaire Design, I’m still waiting publication, but if you email me directly, I’ll be glad to forward you as much as you care to read – and probably more.

Why not just use DDI?

Given the superficial overlap between SQBL and DDI, this is not an uncommon question even at this early stage. I’ve written previously that writing software for DDI isn’t easy, and when trying to write software that is user friendly, and can handle all of the edge cases that DDI does, and operate using the referential structures that make DDI so powerful its hard. Really hard. Given that a format is nothing without the tools to support it, I looked written a three part essay on how to extend DDI in the necessary ways to support complex questionnaires. However, even this is fraught with trouble as software that writes these extensions would have trouble reading “un-extended” DDI. What is needed is a tool that is powerful enough to capture the content required of well structured questionnaires, in a user-friendly way, and it seemed increasingly unlikely that this was possible in DDI.

A counterpoint is to also ask “why DDI?” DDI 2 and 3 are exemplary formats when looking at archival and discovery, however this is because both are very flexible, and can capture any and every possible use case – which is absolutely vital when working in an archive to capture what was done. However, when we turn this around and ask look at formats that can be predictably and reliably written and read what is needed is rigidity and strict structures. While such rigidity could be applied to DDI, it risks fracturing the user base leading to “archival DDI”, “questionnaire DDI” and who knows what else.

Thus the I deemed the decision to start again, with a strict narrow use case, uncomfortable but necessary.

What about DDI?

I did some soul searching on this (as much soul searching one can do around picking sides in a ‘standards war’), and realised that there really is no point in “picking sides”. SQBL isn’t perfect and isn’t yet complete, and more to the point it supports a very narrow use case. If I personally view DDI as an flexible archival format, there is a lot of work necessary to support conversion into and out of it to support discovery and reuse. Likewise, if I view SQBL as a rigid living format for creating questionnaires, the question becomes how to link this relatively limited content with other vital survey information. By definition SQBL has a limit useful timeframe, and once data has been collected (if not earlier) it is no longer necessary so conversion or linkages to other formats become required.

Some where between these overlaps is where DDI and SQBL will handshake, and perhaps in future standards this handshake will be formalised. Which means there is a lot of work on both sides of the fence, that I look forward to playing an active part. But in the interim, and for questionnaire design, I believe SQBL will prove to be a necessary new addition to the wide world of survey research standards.