The unspoken financial benefits of open-source software

https://commons.wikimedia.org/wiki/File:Stone-soup-ii-pawn-nitichan.jpg

Stone soup

I have been recently been applying for support from an employer for travel funding to attend the 2015 IASSIST conference to present the Aristotle Metadata Registry and after adding up the cost, I started thinking about the benefits that would justify this expense.

Since the call for comments went out I’ve had two people offer to provide translation for Aristotle-MDR and I started considering the unaccounted for benefits I’ve already received. For arguments sake, lets consider a typical conference registration cost of $500 (AUD or USD) with accommodation and travel being another $1000. For attendance to be beneficial, you’d want to be able to see at least $1500 in return.

I started by looking at professional translation costs, which can cost as high as $100 per hour. So, if translating a portion of the project takes an hour, for the two languages that are (or will soon be) available, I can say that Aristotle has received about $200 of volunteer effort. With this in mind, I started thinking about how little support needs to be rallied to quickly provide a return on an investment in attending a conference.

If we consider freelancer developers can be hired for about $50, this means that for our conference, we’d need to get around 30 hours of work – not a small amount, especially when done for free. But broken down across multiple attendees this shrinks dramatically. If a talk is able to encourage moderate participation from as little as 3 people in an audience, this becomes 10 hours of work. Spread again across the course of a year, this is under an hour a month!

Given the rough numbers above, convincing 3 attendees to provide an hour of work a month gives a very rough approximate of $1800 of service – a 20% return on investment.

Along with programming or user interface development, there are other metrics when calculating the value generated from open-source. As a developer, I know the intrinsic value of a well written bug report, so even discovered bugs that lead to improvements are highly valuable for a project. This means that numbers of filed and closed bugs can be used as a rough metric (albeit a very, very rough metric) for positive contributions.

Ultimately, while there are strong ideological reasons for contributing to open-source, when developing for open-source projects within a business context these need to be offset with solid financial rational.

Request for comments/volunteers for the Aristotle Metadata Registry

This is a request for comments and volunteers for an open source ISO 11179 metadata registry I have been working on called the Aristotle Metadata Registry (Aristotle-MDR). Aristotle-MDR is a Django/Python application that provides an authoring environment for a wide variety of 11179 compliant metadata objects with a focus to being multilingual. As such, I’m hoping to raise interest around bug checkers, translators, experienced HTML and Python programmers and data modelers for mapping of ISO 11179 to DDI3.2 (and potentially other formats).

For the eager:

Background

Aristotle-MDR is based on the Australian Institute of Health and Welfare’s METeOR Registry, an ISO 11179 compliant authoring tool that manages several thousand metadata items for tracking health, community services, hospital and primary care statistics. I have undertaken the Aristotle-MDR project to build upon the ideas behind Meteor, and extend it to improve compliance with 11179, but to also allow for access and discovery using other standards, including DDI and GSIM.

Aristotle-MDR is build on a number of existing open source frameworks, including Django, Haystack, Bootstrap and jQuery which allows it to easily scale from mobile to desktop on the client side, and scale from small shared hosting to full-scale enterprise environments on the server side. Along with the in-built authoring suite is the Haystack search platform which allows for a range of searching solutions from enterprise search such as Solr or Elastisearch, to smaller scale search engines.

The goal of the Aristotle-MDR is to conform to the ISO/IEC 11179 standard as closely as possible, so while it has a limited range of metadata objects, much like the 11179 standard it allows for the easy extension and inclusion of additional items. Among those already available, are extensions for:

Information on how to create custom objects can be found in the documentation: http://aristotle-metadata-registry.readthedocs.org/en/latest/extensions/index.html

Due to the wide variety of needs for users to access information, there is a download extension API that allows for the creation of a wide variety of download formats. Included is the ability to generate PDF versions of content from simple HTML templates, but an additional module allows for the creation of DDI3.2 (at the moment this supports a small number of objects only): https://github.com/aristotle-mdr/aristotle-ddi-utils

As mentioned, this is a call for comments and volunteers. First and foremost I’d appreciate as much help as possible with my mapping of 11179 objects in DDI3.2 (or earlier versions), but also with the translations for the user interface – which is currently available in English and Swedish (thanks to Olof Olsson). Partial translations into other languages are available thanks to translations in the Django source code, but additional translations around technical terms would be appreciated. More information on how to contribute to translating is available on the wiki: https://github.com/aristotle-mdr/aristotle-metadata-registry/wiki/Providing-translations.

To aid with this I’ve added a few blank translation files in common languages. Once the repository is forked, it should be relatively straightforward to edit these in Github and send a pull request back without having to pull down the entire codebase. These are listed by ISO 639-1 code, and if you don’t see your own listed let me know and I can quickly pop a boilerplate translation file in.

https://github.com/aristotle-mdr/aristotle-metadata-registry/tree/master/aristotle_mdr/locale

If you find bugs or identify areas of work, feel free to raise them either by emailing me or by raising a bug on Github: https://github.com/aristotle-mdr/aristotle-metadata-registry/issues

Aristotle MetaData Registry now has a Github organisation

This weekends task has been upgrading Aristotle from a single user repository to a Github organisation. The new Aristotle-MDR organisation holds the main code for the Aristotle Metadata Registry, but alongside that it also has the DDI Utilities codebase and some additional extensions, along with the new “Aristotle Glossary” extension.

This new extension pulls the Glossary code base out of the code code to improve it status as a “pure” ISO/IEC 11179 implementation as stated in the Aristotle-MDR mission statement. It will also provide additional Django post-save hooks to provide easy look-ups from Glossary items, to any item that requires the glossary item in its definition.

If you are curious about the procedure for migrating an existing project from a personal repository to an organisation, I’ve written a step-by-step guide on StackExchange that runs through all of the steps and potential issues.

Aristotle-Metadata-Registry – My worst kept secret

About 6 months ago I stopped frequently blogging, as I began work on a project that was not quite ready for a wider audience, but today that period comes to a close.

Over the past year, I have been working on a new piece of open-source software – an ISO/IEC 11179 metadata registry. This originally began from my experiences working on the Meteor Metadata Registry, which gave me an in-depth understanding of the systems and governance issues around the management of metadata across large scale organisations. I believe Aristotle-MDR provides one of the closest open-source implementations of the information model of Part 6 and the registration workflows of Part 3, in an easy to use and install piece of open-source software.

In that time, Aristotle-MDR has grown to several thousand lines of code, most substantially over 5000 line of rigorously tested Python code, tested using a suit of over 500 regression tests, and rich documentation covering installation, configuration and extension. From a front-end perspective, Aristotle-MDR uses the Bootstrap, CKEditor and jQuery libraries to provide a seemless, responsive experience, the use of the Haystack search engine provides scalable and accurate search capability, while custom wizards encourage the discovery and reuse metadata at the point of content creation.

One of the guiding principles of Aristotle-MDR has been to not only model 11179 straight-forward fashion, but do so in a way that complies with the extension principles of the standard itself. To this end, while the data model of Aristotle-MDR is and will remain quite bare-bones, it provides a robust, tested framework on which extensions can be built. Already a number of such extensions are being built, including those for the management of datasets, questionnaires, and performance indicators and for the sharing of information in the Data Documentation Initiative XML Format.

In the last 12 months, I have learned a lot as a systems developer, had the opportunity to contribute to several Django-based projects and look forward to sharing Aristotle, especially at IASSIST 2015 where I aim to present Aristotle-MDR as a stable 1.0 release. In the interim, there is a demonstration server for Aristotle available, with two guest accounts and a few hundred example items for people to use, test and possibly break.

The public release of “A Case Against the Skip Statement”

A few years ago I wrote a paper titled “A Case Against the Skip Statement” on the logic construction of questionnaires that was awarded second place in the 2012 Young Statisticians Awards of the International Association of Official Statistics.

It went through two or three rounds of review over the course of a year, but due to shifting organisational aims, I was never able to get the time to polish it to the point of publication before changing jobs. So for the past few years I have quietly emailed it around and received some positive feedback and have gotten a few requests to have it published so it could be cited. I have even myself referred back to it in conferences and other papers, but never formally cited it myself. I have also used this article as a reason why study of ‘classical’ articles in computer science is still important, for the simple fact that while Djikstra’s “Gotos Considered Harmful” is outdated in traditional computer science, its methods and mathematical and logic reasoning can still be useful, as seem in the comparison of programming languages and the logic of questionnaires.

As a compromise to those requests, I released the full text online, with references and a ready to use Bibtex citation for those who are interested. For those interested the abstract follows the Bibtex reference:

@misc{CaseAgainstSkip,
    title = {A Case Against the Skip Statement},
    author ={Samuel Spencer},
    year = 2012,
    howpublished = {\url{http://bit.ly/CaseAgainstSkip}},
    note = {[Date downloaded]}
}

or using BibLatex:

@online{CaseAgainstSkip,
   author ={Samuel Spencer},
   title ={A Case Against the Skip Statement},
   year = 2012,
   url ={http://bit.ly/CaseAgainstSkip},
   urldate ={[Date downloaded]}
}

With statistical agencies facing shrinking budgets and a desire to support evidence-based policy in a rapidly changing world, statistical surveys must become more agile. One possible way to improve productivity and responsiveness is through the automation of questionnaire design, reducing the time necessary to produce complex and valid questionnaires. However, despite computer enhancements to many facets of survey research, questionnaire logic is often managed using templates that are interpreted by specialised staff, reducing efficiency. It must then be asked why, in spite of such benefits, is automation so difficult?

This paper suggests that the stalling point of further automation within questionnaire design is the ‘skip statement’. An artifact of paper questionnaires, skip statements are still used in the specification of computer-aided instruments, complicating the understanding of questionnaires and impeding their transition to computer systems. By examining questionnaire logic in isolation we can analyse the structural similarity to computer programming and examine the applicability of hierarchical patterns described in the structured programming theorum, laying a foundation for more structured patterns in questionnaire logic, which in time will help realise the benefits of automation.

Making a login badge with font-awesome

On a new site I’m building I was looking for a way to include a nice login badge, similar to those on Google login pages. In fact I found some nice looking bootstrap login templates that actually included the Google login image below directly.

Google login guy

Turns out the image is actually a square, with a css border-radius applied, so given that the page already loads the whole set of font-awesome icons, I wondered if it was possible to replicate this without loading an image… and it is.

Font-awesome login guy

There were issues getting the user silhouette to sit nicely over a standard font awesome circle, so I went down the route of using the border radius. The colours don’t match exactly, but the advantage of this approach is the colours can be customised to match any theme very quickly. The complete code is pretty straight forward and the gist is below:

 

A cacophony of Canard and SQBL updates.

Late last month Canard version 0.2.2 was packaged up and released in time for the 2014 IASSIST conference. This new version changed the Canard Question Module Editor from running on Python 64-bit to 32-bit as was requested by a few people, as well as including import and export plugins for the new 3.2 release of the DDI-Lifecycle XML format.

The reason for the timing of this release the was due to me presenting Canard during the DDI tools session as well as the poster session. At both of these session, Canard was was well received, and there was interest in Canard in wide applications from teaching undergraduates about survey design and metadata management, researchers documenting surveys with lots of live feedback, and archivists recapturing questionnaire metadata.

Until recently, Canard (and its underlying data model SQBL) have been following a very ad hoc release and change process due to its ongoing development. However, given growing interest in the Canard tool, its in everyones interest for this to be made clearer. So first of all a proper release version of the Simple Questionnaire Building Language Schema is available on Github

New versions of SQBL will be released in a form that remains backwards compatible for minor and patch version numbers – in essence, this means the addition of no new or removal of old mandatory elements or attributes. Canard release versions will also be based to these version numbers, so starting from Canard v0.2.2, the Canard minor version number will be 1 greater than the SQBL version numbers it supports. Its likely that Canard will have more frequent releases than SQBL, however, given how patch versions may be released, its best to use the latest version of Canard to ensure compatibility with the applicable minor version family of SQBL.

In short – this guarantees that in future the latest Canard version 0.2 will open any SQBL v0.1 document made by a prior version of Canard in the 0.2 version family, which makes upgrading a safer option.

However, this doesn’t mean that SQBL is ‘fixed’ there are already certain issues that will need to be addressed – especially around describing calculated values in questionnaires – just that changes will be more predictable. Nor does it mean that the development of Canard will slow, as there are already issues which have been opened (and some closed) since the last release.

Its likely that patch level changes to SQBL will happen every 3 to 4 patch releases of Canard. This will give people time to adjust and find and identify any issues in the software or the schema, as well as give time to migrating import and export plugins.

As always, requests or bugs are more than wanted, so if you are using Canard, please feel free to raise an issue on Github, and likewise if there are missing features that require changes to the underlying data model, or deviations from similar standards like DDI, raise an issue with the SQBL schema, or join the SQBL mailing list and see if there are ways a problem can be captured in the current schema.

Why Linus Torvalds is wrong about XML

Linus Torvalds is one of the most revered figures in modern computer science and has made the kind of contributions to the world that I hope to achieve. However, given his global audience, his recent statements about XML give me pause for reflection.

I have worked with XML in a number of jobs, helped with the specification of international XML formats, written tutorials on their use, and even made my own XML format (with reason I might add). And I must say, in reply to Linus’s statement that

XML is the worst format ever designed

XML isn’t the problem, it is more that the problem is bad programmers. Computer Science is a broad field, not covering just the creation of programs, but also the correct specification of information for computation. The lack of appreciation for that second aspect has seen the recent rise of “Data Science” as a field – a mash of statistics, data management and programming.

While it is undenyable that many programmers write bad XML, this is because of poor understanding and discipline. One could equally say, people write bad code, lets stop them writing code. People will always make mistakes or cut corners, the solution is education, not reinventing the wheel.

Linus and the rest of the Subsurface team are well within their rights to use the data formats they choose, and I am eager to see what new formats he can design. But with that in mind, I will address some of the critiques of Linus and others about XML and point out their issues, followed by some handy tips for programmers looking at using XML.

XML should be human readable

I did the best that I could with XML, and I suspect the subsurface XML is about as pretty and human-readable as you can make that crap

CSV isn’t very readable, C, Perl and Python aren’t very human readable. What is “human-readable” is very subjective, as even English isn’t human-readable to non-English speakers.

Restricting ourselves to just technology, CSV isn’t very readable as for any non-trivial amount of data as the header will scroll off the top of the screen, and data will overflow onto the next line or outside the horizonal boundaries of the screen. One could argue that its possible in Excel, OpenOffice or using a VIm/Emacs plugin to lock the headers to the top of the screen – and now we have used a tool to overcome limitations in the format.

Likewise, the same can be said for computer code, code-folding, auto-completion of long function and variable names and syntax highlighting are all software features to overcome failures in the format and make the output more “human-readable”. Plain-text supports none of the above, yet no one would recommend using Notepad to write code for the lack of features.

Likewise, I would never, ever recommend writing XML in a non-XML editor. Auto-adding of closing tags, checking schema as you type, easy access to the schema via hotlinks from elements and attributes, and XPath query and replace are all vital functions of a good XML editor. All of these make writing XML much easier and approachable, and compared to code or CSV, a programmer should spend only as much time in an XML editor to understand the format to make writing XML in code easier.

While it can be said that a poor craftsman blames his tools, a good craftsman knows when to use the right tools as well.

XML files should standalone

This is most visible in this bug raised in Subsurface where it is stated that:

Subsurface only ever stores metric units. But our goal is to create files that make sense and can be read and understood without additional information.

Now, examination of a sample of the XML from subsurface shows a glaring contradiction. There is nothing in this file that says that units are in metric. The distance ‘m’ could equally stand for ‘miles’, and while the order of magnitude would make misinterpretation for a human hard, a dive computer with an incorrect understanding may miscalculate the required oxygen pressure leading to potential death. To accurately understand this file, I need to find the documentation, i.e additional information. The reason for schema is to explicitly describe a data file.

Additionally, because data is stored as “human-readable” strings, I could validly put in “thirty metres” instead of “30.0 m” as a depth. At this point the program might fail, but as someone writing the data elsewhere I’d have no reason why. Apart from being a description of the data, schema exists as a contract. If you say the data is of this form, then these are the rules you must conform to. When you are looking at sharing data between programs or organisations this ability to lean on a technical enforcement is invaluable as making “bad” data is that much harder.

XML shouldn’t need other formats

This is a tricky one, as when people think of XML, even if they have made a schema their mid stops there. XML isn’t just a format, its more a suit of related formats that can make handing and manipulating information easier.

Its worth noting that people have raised databases within that thread as an alternative – SQL is only a query language, but requires the formal Database Definition Language to describe the data and an engine to query over it. Likewise, HTML without CSS, Javascript or any number of programming and templating languages that power the web would be much less useful to the general public.

Similarly, isolating XML from XML schemas, mean your data has no structure. Isolating XML from XQuery and XPath mean you have no way of querying your data. Without XSLT there is no easy, declarative way to transform XML, and having done this with traditional languages and XSLT, the latter makes using and transforming XML much easier. Ultimately, using XML without taking advantage of many of the technologies that exist in the entire XML landscape is not using technologies to its best.

Tips for good XML

With all of that aside, XML like all technologies can be used poorly. However, when done well and documented properly, a good XML format with an appropriate schema can reduce errors and give vital metadata that gives data context and longevity. So I present a few handy tips for using XML well.

  1. Only use XML when appropriateXML is best suited to complex data, especially hierarchical data. As Linus (and others) points out in the linked thread tabular data is much better suited to CSV or more structured tablular formats, simple key values can be stored in ini files, and markup text can be done in HTML, Markdown or any number of other formats.
  2. Look for other formats.If you are thinking of using XML for your tool – stop and see what others have already done. The world doesn’t need another format, so if you are thinking of doing so you should have a very, very good reason to do so.
  3. Use a schema or doctypeIf you are chosing to make your own format, this is the most important point. If you chose to use XML, make a schema. How you choose to capture this Doctype, XSD Schema, Schematron, Relax NG is largely irrelevant. What is important is that your data format is documented. There are even tools that can automate creating schema stubs from documents, so there is no excuse not to. As stated an XML schema is the formal contract about what your data is and lets others know that if the data doesn’t conform to this format then it is broken.
  4. Use XML datatypesXML already has specifications for text, numeric, datetime and identification data. Use these as a starting point for your data.
  5. Store one type of data per field.While the difference between <dive duration="30:00 mins"> and <dive duration="30" durationUnit="mins"> is minimal, the former uses a single string for two pieces of data, while the latter uses two fields, a number and an enumerable, each storing one piece of data. An even better solution is using the XML duration data type <dive duration="PT30M"> based on the existing ISO 8601 standard.

Book Review: H.G. Wells The Time Machine

After finishing the Harry Potter series (which doesn’t need another poorly written review), I decided for something older, shorter and a little more mature and decided to tackle some books I’ve been meaning to read for a long time – namely H.G. Wells more well known works.

I fortunately found an anthology – if two stories can be called so – of H.G. Wells “The Time Machine” and “The Invisible Man”. First of all, the story is thinly veiled critique of the aristocracy of Victorian Britain, the idea that the segmentation and stagnation of class would lead to the eventual dulling of the minds of the leisure class, while the underclasses become more violent and brutish.

However, the one thing I did find interesting is the relationship between the unnamed Time Traveller, the Narrator and Weena – an Eloi woman of the future. The story can be broken into two sections, those where the Narrator introduces the cast at the start and closes the story at the end, each about a chapter or so each, while the bulk of the novel in-between is the Narrators recollection of the Time Travellers account of their travel. Here, I say their for a very important reason- by the Narrators account the Time Traveller is a man, while the Time Traveller does nothing to speak of their gender or race during the story.

The reason this is fascinating, is that during their travels the Time Traveller meets a meek, child-like Eloi woman who accompanies them on their future adventure. Reading the Time Traveller as intended as a man, presents a view of Victorian gender relations, where women were treated as child-like and in need of a man for protection. However, I found myself attempting to read the story and interpret it as though the Time Traveller could have been a woman and comparing the subtext. Reading the Time Traveller as a woman however, Weena becomes more like a child-like companion, and it reads less like a patriarchal reinforcement of a man overseeing the safety of a woman, but  more as a story of a woman managing their maternal instincts against their scientific drive and desire to return home – and funnily enough this reminded me of Aliens.

Given that people often lament the lack of female leads in fiction and movies, this seems to be one where the role could easily be played or read as a woman with very little change to the original text.

Next up, The Handmaids Tale and then The Invisible Man.

New release: Canard Question Module Editor v0.2.1

Its been a while since my last post, and while a few emails and updates to GitHub have happened, I haven’t written a formal announcement here. The big news of which two releases of the Canard Question Module Editor have been published and made available.

Version 0.2 went out on October 1st and Version 0.2.1 went out on the 9th. Normally, there’d be a larger gap between releases, however, very shortly after version 0.2 went out the door a handful of troublesome bugs were discovered regarding the handling of multilingual text. So a second version went out the door very shortly after that included some bug fixes and some new features tht made dealing with multiple languages easier.

Canard Screenshot

Additionally, the QFingerTabsWidget that was discussed in an earlier post, has seen a number of updates to allow for icons and a more responsive design of horizontal text in horizontally aligned tabs. This has become larger than just a gist and will in the coming weeks be spun of into its own GitHub repository.

Lastly, with Canard in a relatively stable state, and with the “plug-in architecture” in a usable state I’m going to focus on creating SQBL transformations for input, output and visualisation. So suggestions for target formats and languages, or updates or changes to existing plugins would be welcome in the comments.

If anyone is interested in following updates to SQBL or Canard, I’d strongly recommend following the SQBL Mailing lists to keep up to date with releases and new plugins as they are published.