Archive for the ‘ Statistics ’ Category

The public release of “A Case Against the Skip Statement”

A few years ago I wrote a paper titled “A Case Against the Skip Statement” on the logic construction of questionnaires that was awarded second place in the 2012 Young Statisticians Awards of the International Association of Official Statistics.

It went through two or three rounds of review over the course of a year, but due to shifting organisational aims, I was never able to get the time to polish it to the point of publication before changing jobs. So for the past few years I have quietly emailed it around and received some positive feedback and have gotten a few requests to have it published so it could be cited. I have even myself referred back to it in conferences and other papers, but never formally cited it myself. I have also used this article as a reason why study of ‘classical’ articles in computer science is still important, for the simple fact that while Djikstra’s “Gotos Considered Harmful” is outdated in traditional computer science, its methods and mathematical and logic reasoning can still be useful, as seem in the comparison of programming languages and the logic of questionnaires.

As a compromise to those requests, I released the full text online, with references and a ready to use Bibtex citation for those who are interested. For those interested the abstract follows the Bibtex reference:

    title = {A Case Against the Skip Statement},
    author ={Samuel Spencer},
    year = 2012,
    howpublished = {\url{}},
    note = {[Date downloaded]}

or using BibLatex:

   author ={Samuel Spencer},
   title ={A Case Against the Skip Statement},
   year = 2012,
   url ={},
   urldate ={[Date downloaded]}

With statistical agencies facing shrinking budgets and a desire to support evidence-based policy in a rapidly changing world, statistical surveys must become more agile. One possible way to improve productivity and responsiveness is through the automation of questionnaire design, reducing the time necessary to produce complex and valid questionnaires. However, despite computer enhancements to many facets of survey research, questionnaire logic is often managed using templates that are interpreted by specialised staff, reducing efficiency. It must then be asked why, in spite of such benefits, is automation so difficult?

This paper suggests that the stalling point of further automation within questionnaire design is the ‘skip statement’. An artifact of paper questionnaires, skip statements are still used in the specification of computer-aided instruments, complicating the understanding of questionnaires and impeding their transition to computer systems. By examining questionnaire logic in isolation we can analyse the structural similarity to computer programming and examine the applicability of hierarchical patterns described in the structured programming theorum, laying a foundation for more structured patterns in questionnaire logic, which in time will help realise the benefits of automation.

A Request for Comments on a new XML Questionnaire Specification Format (SQBL)

This is an announcement and Request for Comments on SQBL a new
open-source XML format for the cross-platform development of questionnaire
specifications. The design decisions behind SQBL and additional details are the
subject of a paper to be presented in 2 weeks at the 2013 IASSIST conference in
Cologne, Germany:
– Do We Need a Perfect Metadata Standard or is “Good Enough” Good Enough?
However, to ensure people are well-informed ahead time, I am releasing details
ahead to conference.

The gist

SQBL – The Structured (or Simple) Questionnaire Building Language is an
emerging XML format designed to allow survey researchers of all fields to
easily produce questionnaire specifications with the required structure to
enable deployment to any questionnaire platform – including, but not limited
to, Blaise, DDI, LimeSurvey, XForms and paper surveys.

The problem

Analysing the current state of questionnaire design and development shows that
there are relatively few tools available that are capable of allowing a survey
designer to easily create questionnaire specifications in a simple manner,
whilst providing the structure necessary to verify respondent routing and
provide a reliable input to the automation of questionnaire deployment.

Of the current questionnaire creations tools available, they either:
Prevent the sharing of content (such as closed tools like SurveyMonkey)
Require extensive programming experience (such as Blaise or CASES)
* or use formats that make transformation difficult (such as those based on DDI)
Given the high-cost of questionnaire design, in the creation, testing and
deployment of final questionnaires a format that can reduce the cost in any or
all of these areas will have positive effects for researchers.

Furthermore, by providing researchers with the easy tools necessary to create
questionnaires they will consequently create structured metadata, thus reducing
the well understood documentation burden for archivists.

Structured questionnaire design

Last year, I wrote a paper “The Case Against the Skip Statement”, that
described the computational theory of questionnaire logic – namely the
structures used to describe skips and routing logic in questionnaires. This
paper was awarded 3rd place in the International Association of Official
Statistics ‘2013 Young Statistician Prize’ This paper
is awaiting publication, but can be made available for private reading on
request. It proposed that this routing logic in questionnaires is structurally
identical to that of computer programs. Following this assertion, it stated
that a higher-order language can be created that acts as a “high-level
questionnaire specification logic” that can be compiled to any questionnaire
platform, in much the same way that computer programming languages can be
compiled to machine language. Unfortunately, while some existing formats
incorporate some of the principles of Structured Questionnaire Design, they are
incomplete or too complex to provide the proposed benefits.

SQBL – The Structured (or Simple) Questionnaire Building Language

SQBL is an XML format that acts as a high-level language for
describing questionnaire logic. Small and simple, but powerful it incorporates
XML technologies to reduce the barrier to entry and make the description of
questionnaire specifications, even in raw XML readable. Underlying this
simplicity is a strict schema that enforces single solutions to problems,
meaning SQBL can be transformed into a format for any survey tool that has a
published specification.

Furthermore, because of its small schema and incorporation of XML and HTTP core
technologies, it is easier for developers to work with. In turn, this makes
survey design more comprehensible through the creation of easier tools, and
will help remove the need for costly, specialised instrument programmers
through automation.

Canard – the SQBL Question Module Editor

Announced alongside the Request of Comments of SQBl is an early beta release of
the SQBL-based Canard Question Module Editor Canard is
designed as a proof-of-concept tool to illustrate how questionnaire
specifications can be generated in an easy to use drag-and-drop interface. This
is achieved by providing designers with instant feedback on changes to
specifications through its 2 panel design that allows researchers to see the
logical specification, routing paths and example questionnaires all within the
same tool.

SQBL and other standards

SQBL is not a competitor to any existing standard, mainly because a structured
approach to questionnaire design based on solid theory has never been attempted
before. SQBL fills a niche that other standards don’t yet do well.

For example, while DDI can archive any questionnaire as is, this is because
of the loose structure necessary for being able to archive uncontrolled
metadata. However, if we want to be able to make questionnaire specifications
that can be used to drive processes, what is needed is the strict structure of

Similarly, SQBL has loose couplings to other information through standard HTTP
URIs allowing linkages to any networked standard. For example, Date Elements may
be described in a DDI registry, which a SQBL question can reference via its
DDI-URI. Additionally, to support automation a survey instrument described
inside a DDI Data Collection, rather than pointing to a DDI Sequence containing
the Instrument details can use existing linkages to external standards to point
to a SQBL document via a standard URL. Once data collection is complete,
harmonisation can be performed as each SQBL module has questions pointing to
variables, so data has comparability downstream.

SQBL in action

The SQBL XML schemas are available on GitHub that
also contains examples and files from video tutorials.
There is a website with more information on the format that
provides more information on some of the principles of Structured Questionnaire

If you don’t like getting your hands dirty with XML you can download the
Windows version of the Canard Question Module Editor from Dropbox and start producing questionnaire specifications
immediately. All that needs to be done is to unzip the file and run the file
named . Due to dependencies flowcharts may not be immediately
available, however this can be fixed by installing the free third-party
graphing tool Graphviz

Lastly, there is a growing number of tutorial videos on how to use Canard on Youtube.

Video 1 – Basic Questions (2:17 min)
Video 2 – Complex Responses (2:17 min)
Video 3 – Simple Logic (4:11 min)

There is also an early beta video that runs through creating an entire
questionnaire showing the side-by-side preview. (13:21 mins)

Joining the SQBL community

First of all there is a mailing list for SQBL hosted by Google Groups:!forum/sqbl.

Along with this each of the GitHub repositories, include issue trackers. Both Canard and SQBL are in
early design stages so there is an opportunity for feedback and input to ensure
both SQBL and Canard support the needs of all questionnaire designers.

Lastly, while there are initial examples of conversion tools to transform SQBL
into DDI-Lifecycle 3.1 and XForms, there is room for growth. Given the
proliferation of customised solutions to deploy both paper and web-forms there
is a need for developers to support the creation of transformations from SQBL
into formats such as Blaise, LimeSurvey, CASES and more.

If you have made it this far thank you for reading all the way through, and I
look forward to all the feedback people have to offer.

Cheers and I look forward to feedback now or at IASSIST,

Samuel Spencer.
SQBL & Canard Lead Developer
IASSIST Asia/Pacific Regional Secretary

Sir Roland Wilson – The man who burned the census

“Bloody well do what we tell you and you’ll be fine.”
Roland Wilson (Secretary of the Treasury) speaking to Billy Wilson (Treasurer)

Sir Roland Wilson headed up Commonwealth Bureau of Census and Statistics (precursor to the Australian Bureau of Statistics), the Treasury and the Department of Labour and National Service, amonst others. If you are interested, you can read about his wide career (including his history as an early student of the Chicago School of Economics) in his well written obituary. While I won’t try and rewrite this, what itdoesn’t capture is my personal favourite story of Sir Wilson’s career, during his time as Commonwealth Statistician.

This story comes from Informing a Nation: the Evolution of the Australian Bureau of Statistics, which details the core prinicples at the heart of the public service, trust and service to the community. The quote below goes into detail, however the short version is that when faced with having to relinguish confidential information on individuals, violating the privacy of the Bureau and the public, Roland Wilson chose to torch the records and defy parliament than violate the trust of the public.

That I think is the level of bravery and commitment that should stand at the heart of all public servants. Its just a shame that they don’t sell those little silicon wrist bands branded WWRWD, that remind us all to ask “What Would Roland Wilson Do”?

Throughout the history of the Bureau, its statisticians have preserved the confidentiality of the information provided by individuals and businesses. Today, the Census and Statistics Act protects the confidentiality of data reported to the Bureau. However its statisticians through the decades have always ensured that the data reported to them by individual respondents remained confidential.

For example, Sir Roland Wilson (Commonwealth Statistician 1936–1940 and 1946–1948) once told the story of how legislation for a Census of Wealth was hastily drawn up in the early days of World War II. The legislation was badly drafted and mentioned that the Commissioner of Taxation could have access to the data – without making it clear that he could only access the collated information.

Subsequently, during a tax evasion case, the Commissioner of Taxation formed the view that he could win the case by accessing the defendant’s individual Census of Wealth data.

‘[He] … came storming into my office one day and demanded this bloke’s wealth card and I said he couldn’t have it. “Why?” “Because they are confidential and if it was used in a court case it could wreck our reputation”.

The Commissioner of Taxation, not content with this reply, took the matter to Cabinet and convinced it to approve his access to the individual’s data. Then he went back to Wilson to collect the information.

‘Oh, he was on the seventh heaven of delight and he came storming along with his two Deputies, waved the Cabinet decision at me and said, “You’ve got to hand those cards over to me”. “I’m sorry … I can’t.” [Said Wilson] “What do you mean? I’ve got a Cabinet decision!” [The Commissioner exclaimed]. ‘[Wilson replied] “You’re about a week too late. I piled them onto two trucks last week, sent them down to Sydney and incinerated them”.

- Sir Roland Wilson, interviewed in 1984.

Why I’ve chosen to make a new XML standard for questionnaires

XKCD #927

Normally I don’t like XKCD, but this is so true.

I’ve made no secret of the fact that I’ve been working on a new format for questionnaires. I recently registered a domain for the Structured Questionnaire Building Language, and have been releasing screenshots and a video of a new tool for questionnaire design that I’m working on. Considering that I’ll be covering this work at at least one conference this year, and given my close ties in a few technical communities I felt that it would be good to discuss why this is the case, and answer a few questions that people may have.

Why is a new format for questionnaire design necessary?

Over the past few years I’ve done a lot of research analysing how questionnaires are structured in a very generic sense. Given the simplistic nature of the logic traditionally found in paper and electronic questionnaires and their logical similarity to computer programming, I’ve theorised that it should be possible to use the same methods (and thus the same tools) to supports all questionnaires – including the oft ignored paper questionnaire. Unfortunately, attempts to improve questionnaires have focus on proprietary or limited use cases, which is why tools and formats such as Blaise, CASES and queXML exist, but generally only support telephone or web surveys. Likewise, all of these attempts have ignore the logical structure in various ways and discouraged questionnaire designers from becoming intimately, and necessarily familiar with the logic of their questionnaires.

SQBL on the other hand is an attempt at designing a specialised format to support the capture of the generic information that describes a questionnaire. Likewise, Canard is a parallel development of a tool to allow a researcher to quickly create this information, as a way to help them create their questionnaire, rather than just document it afterwards.

As a quick aside, if you are interested in this research on Structured Questionnaire Design, I’m still waiting publication, but if you email me directly, I’ll be glad to forward you as much as you care to read – and probably more.

Why not just use DDI?

Given the superficial overlap between SQBL and DDI, this is not an uncommon question even at this early stage. I’ve written previously that writing software for DDI isn’t easy, and when trying to write software that is user friendly, and can handle all of the edge cases that DDI does, and operate using the referential structures that make DDI so powerful its hard. Really hard. Given that a format is nothing without the tools to support it, I looked written a three part essay on how to extend DDI in the necessary ways to support complex questionnaires. However, even this is fraught with trouble as software that writes these extensions would have trouble reading “un-extended” DDI. What is needed is a tool that is powerful enough to capture the content required of well structured questionnaires, in a user-friendly way, and it seemed increasingly unlikely that this was possible in DDI.

A counterpoint is to also ask “why DDI?” DDI 2 and 3 are exemplary formats when looking at archival and discovery, however this is because both are very flexible, and can capture any and every possible use case – which is absolutely vital when working in an archive to capture what was done. However, when we turn this around and ask look at formats that can be predictably and reliably written and read what is needed is rigidity and strict structures. While such rigidity could be applied to DDI, it risks fracturing the user base leading to “archival DDI”, “questionnaire DDI” and who knows what else.

Thus the I deemed the decision to start again, with a strict narrow use case, uncomfortable but necessary.

What about DDI?

I did some soul searching on this (as much soul searching one can do around picking sides in a ‘standards war’), and realised that there really is no point in “picking sides”. SQBL isn’t perfect and isn’t yet complete, and more to the point it supports a very narrow use case. If I personally view DDI as an flexible archival format, there is a lot of work necessary to support conversion into and out of it to support discovery and reuse. Likewise, if I view SQBL as a rigid living format for creating questionnaires, the question becomes how to link this relatively limited content with other vital survey information. By definition SQBL has a limit useful timeframe, and once data has been collected (if not earlier) it is no longer necessary so conversion or linkages to other formats become required.

Some where between these overlaps is where DDI and SQBL will handshake, and perhaps in future standards this handshake will be formalised. Which means there is a lot of work on both sides of the fence, that I look forward to playing an active part. But in the interim, and for questionnaire design, I believe SQBL will prove to be a necessary new addition to the wide world of survey research standards.

3 quick questions to identify children in financially risky families

  1. Do you live with mommy and daddy? 
  2. Does mommy or daddy smoke?
  3. Do you eat a lot of fruit or nuts at home?

The Australian Bureau of Statistics recently released data cubes on Household Expendituregiving an incite into how Australians use their money. What is of interest is that they provide a breakdown of average weekly expenditure for a variety of products, tabulated against the number of financial stressors. The ABS has defined financial stressors are events where a household is unable to pay bills, goes without meals,

The reason this is interesting is that there are very few factors with a striking correlation to the number of financial stresses within a household as those below:

Number of indicators of financial stress experienced
Risk factors 0 1 2 3 4+
Tobacco products ($/week) 8.89 13.01 14.63 18.02 21.45
Newspapers ($/week) 3.17 2.80 2.64 1.47 1.48
Fruit and nuts  ($/week) 14.06 12.81 11.41 10.62 7.98
One parent family with dependent children (%) 1.9 4.7 8.3 9.8 19.3
All Renters (%) 18.6 28.3 34.0 39.3 54.4
Main source of income – Government pensions and allowances (%) 17.8 22.1 25.2 28.3 52.1

Note, that for newspaper and fruit and nut weekly expenditure, there is a negative correlation.

The problem with some of these metrics though is while they are all strongly correlated with financial stress, they may not all be obvious to children. For example, a child may not know about their parents income or if they live in a rental property, or certain actions may not be hidden from the child, such as a parent buying a newspaper on the way to work.

Other metrics however, are more obvious for children to notice and report. Such who they live with, obvious activities of their parents,  (like smoking) and their own diets. This leads us to the three questions listed above:

  1. Do you live with mommy and daddy?
  2. Does mommy or daddy smoke?
  3. Do you eat a lot of fruit or nuts at home?
Now while these may not account for same-sex couples, a child in a two parent same sex household would, given sufficient prompting, probably indicate they had two parents. Furthermore, this is based on aggregate information, however there is a good chance unit records may back these correlations up. Lastly, this is looking at correlation for risk factors, and cannot be used to suggest causation. Together however, these three questions can quickly give a strong indication of the risk of financial stress within a child’s household.

How statistics can over state the risk of youth suicide.

Update: I have renamed this post as the original was pointed out that in my attempt to have a short enough title to fit on twitter it came across as a little antagonistic. The original title is still visible in the URL so old links don’t break.

In the words of the new (and poorly named) “Soften the Fuck Up” campaign

Suicide is the leading cause of death amongst young folks and most of them are blokes.

People have been recently regurgitating figures from the Australian Bureau of Statistics talking about how the leading cause of death for males aged 15-45 is suicide. I was briefly taken aback and shocked at such a thought. I mean, I’m a male aged 15-45, is suicide in my near future? Until I came to the realisation: what other causes of death for someone my age be?

Combined deaths for males across selected (aggregated) causes (source ABS)

Combined deaths for males across selected (aggregated) causes (source ABS)

What the above graph shows is the combined number of deaths for each age bracket for the most prolific causes of death across the age range. From this a few things instantly stand out, firstly that the number of suicides is relatively steady across people lifespans. What this indicates is that suicide isn’t a youth issue, its a people issue, but we will go into depth for this later. But the fact is, fewer young people die overall compared to older age groups, coupled with a relatively steady suicide rate across the whole lifespan. So it is to be expected that suicide is more common in younger and healthier demographics, because there are few large cause of death.

The positive news is that only 60 people died from assault in 2009, but sadly such ‘good’ news is rarely newsworthy.

Secondly, that the number of deaths for young people is relatively low compared to older people. In fact the leading causes of death for people aged 35 and older (heart disease and cancer) are relatively non-existent in people under 35. In fact when these common and natural causes of death are removed, there is little left to cause death in younger populations. The positive news is that only 60 people died from assault in 2009, but sadly such ‘good’ news is rarely newsworthy.  Furthermore, when preparing this graph it became apparent, that the leading cause of death for those aged 15-24 isn’t suicide, but traffic accidents. When combining car and motorcycle deaths into a single figure these numbered greater that the number of suicides for the same year for the 15-24 age bracket.

Digging a little deeper we can take a stronger look at the relation between suicide and age, and we come up with a graph like that shown below:

Suicide rates for males aged 15-65 for the years 2000-2009
Suicide rates for males aged 15-65 for the years 2000-2009 (source ABS)

From this, we can see again that suicide peaks in around forty, before tapering off again. Although, this removes the issue highlighted, that suicide is still relatively steady, we can draw a quite positive message – over the last ten years, suicide has fallen for every group except those over the age of 55. Again a far cry from the youth warning we are accustomed.

Lastly, with suicide portrayed as primarily young male problem, the comparison with women across the lifecycle warrants attention.

Suicide as a percentage of deaths
Suicide as a percentage of deaths (Source ABS)

As the graph shows, suicide is closest to equal at young ages, comparative to the growing disparity as people age.

The problem with peoples interpretation of these mortality statistics released by the ABS is that the figures are segregated by age, which complicates their interpretation. By aggregating them into larger blocks, without accounting for the natural underlying growth in death rates the disparity in suicide rates between the age groups is almost hidden.

On a personal note,  200 males aged 14-24 died of ‘intentional self-harm’ in 2009, and I knew one of them. He was not a statistics, or a piece of data in a chart, he was a friend to many. What happened was a tragedy, but what I, and a lot of people learned, was that it was an unpredictable and unexpected event.

Suggesting that suicide or self-harm is a common or obvious event does a disservice to anyone who has been effected by suicide. We are fortunate enough to live is a society were suicide is relatively rare among all demographics. That is why it hurts when it touches us so strongly when it happens, because it is so uncommon.

I am not suggesting that we should not be vigilant with those close to us. Suicide can be prevented, but pulling out “scare statistics” that suggest that what happened should have been obvious does nothing to help those who are left. Suicide is not a subject that should be taken lightly and baseless, uneducated statistics have the potential to hurt a lot more the event itself.

420 convert classifications everyday

With the recent release of the new Australian Standard Classification of Drugs of Concern from the ABS, there was the opportunity to field test the Virgil CSV to DDI converter with real data to see how it held up. Fortunately, the classification was released as an Excel data cube that conformed almost entirely with the structures that Virgil supports. After a little cleaning of the CSV, it was able to run through the converter without few issues at all. Incidentally the most major error highlighted the massive oversight that the converter fails to add values for the codes! However this has been corrected and changes have been pushed in the svn, and a new version of the Windows tool will be pushed out this weekend.

A screen shot of Virgil with the converted classification

A screen shot of Virgil with the converted classification

Opening the newly created DDI file in the Virgil DDI CodeList Editor was another story and pointed out a few flaws with how it handles empty data. With the structure from the Excel file not containing descriptions for any category or any labels for the CodeScheme, there were a few small corrections made to accommodate freshly created DDI, but many of these problems will be ironed out by the time the CodeList editor is available for download.

While the converter hasn’t been fully integrated into the CodeList Editor, it will shortly be possible to create a single DDI file and import numerous CSV files to create a series of classificatory codelists in a single package. A practical and soon to be realised example would be the Australian Standard Classification of Drugs of Concern with the lists of drugs of concern, forms of drug and methods of consumption codelists all contained in a single machine processable DDI package.

For those who haven’t been able to download or run the converter, the output from this example is available for testing.

Questionnaire design with DDI – Part 5: Can it be done?

This is the fifth in a 5 part series of working with questionnaires and surveys managed using the Data Documentation Initiative XML standard. With DDI being an emerging technology, it is important that users are provided with best practises to ensure that they use the standard in a way that is logical, coherent and most importantly usable and reusable. This series of tutorials and discussions is aimed toward users who have some knowledge of DDI and would like to know how to effectively design markup existing and future questionnaires in DDI.

Part 5: Can it be done?

On the eve of IASSIST 2011, we are going to look back at the last few tutorials and ask the big question: Can DDI 3.1 be used as a viable format for the creation of online survey instruments? The simple answer is yes, but there are a few caveats.

Caveat 1: Only use a the transformation of a finalised survey

This is a very basic idea, but one that should be reiterated. When working to create a web survey, there will more than likely be an iterative process of design DDI, transform, examine form, refine DDI, repeat until done. This is a necessary part of development, as noone ever gets it right first time. However, once a form has been transformed there should be organisation sign-off to ensure that that form is never changed again. While the DDI will be versioned meaning the original DDI that the form would be based of would still be there, altering and overriding an in production form can only lead to data issues.

Caveat 2: Transforming DDI into an e-form a destructive transformation

InstrumentML as presented in Part 2 is not, and will never be, a part of the DDI 3.1 specification. It is a way to transform the logical structure of DDI into a form easily usable in certain situations. InstrumentML is a way of caching the implied instument flow to make machine processing easier. Likewise, the transformation from DDI to e-form is done to present the information in the DDI in a way that is easier for users to understand.

Caveat 3: Any changes to a HTML instance of a form will not be pushed back to DDI

It is a fact that no transformation that transforms a logical structure into a visual page will be perfect. So it is reasonalbe to assume that web pages created from DDI may need to be altered. Perhaps a list of options for response is better as a list of radio buttons compared to a drop down list, a line break is needed to alter word flow or the dimensions of text box need to be precise. These types of issues will rise and there are 3 cascading ways of resolving this:

  1. Alter the DDI to match the expected output,
  2. Alter the CSS to alter the presentation,
  3. If neither of the above produce the needed results, then and only then edit the generate HTML

With DDI allowing XHTML tags in most labels to allow semantic markup, this is the first and most appropriate way to alter pages. If what needs to be changes is wording this is also the only place to edit.

An example of where editing the HTML is need is where we have a question based on age where we are measuring labour figures. This is a numeric question, but in this case we have decided the populations below 18 and over 65 will be too small, so we are going to code everything outside these ranges to the codes ‘<18′ and ‘>65′. So we have maximums and minumums, and their codes, but on advise from a survey methodologist we won’t be restricting users from entering figures outside these ranges.
However our pre-generated HTML (based on the suggestions in part 4) will bring these across, so in this case we can edit this question to move the minimum from 18 to 0, as age still needs to be above 0 and remove the upper restriction.

Caveat 4: Final transformations should be packagable to be stored with the DDI

If a DDI instrument was transformed into a PDF format for printing, no good archivist would think twice about storing a copy of this as a manifestation of the instrument. Likewise, when transforming from DDI to other formats to ease machine use, keeping a copy of this is paramount. Software will change, and how one version of a tool transforms the DDI way not always be the same as how another transforms it.

In the instance of Ramona, a package consists of the HTML used to semantically mark the questions, the InstrumentML to contain the logical flow and the the CSS of the presentation layer. By storing this information, examining the survey at a layer date becomes easier for researchers who may no longer have access to the software that did the original transformation.

What is also important is making sure any valid alterations to a form, like those described in the previous caveat, are stored as a part of this package.

What we have is proof that it can be done, but its not done yet

As a closing point, people may be wondering if they can see the code that a lot of this information is based on. At this stage I’m not planning on releasing the code for Ramona just yet, mostly because its not very good. These tutorials are based on things I learned as I went about trying to create a DDI Web-form viewer, and as such it is less production code, and more a bloody narrative showing my battles and trials against what can be a very tough and complex standard.

Over the next few weeks I’ll be stripping the code down, separating the wheat from the chaff and rewriting large sections of it and with luck by the time EDDI2012 rolls around there should be a full functional DDI eforms tool ready to roll out into production.

This brings to a close this series of tutorials dealing with online survey development in DDI. In the coming weeks I’ll also start putting together a short introduction to DDI for new users, and spending the bulk of my free time actually writing code, rather than writing about writing code.

And lastly, enjoy IASSIST 2011 everyone!

Questionnaire design with DDI – Part 2: Where am I and where do I go next?

This is the second in a 5 part series of working with questionnaires and surveys managed using the Data Documentation Initiative XML standard. With DDI being an emerging technology, it is important that users are provided with best practises to ensure that they use the standard in a way that is logical, coherent and most importantly usable and reusable. This series of tutorials and discussions is aimed toward users who have some knowledge of DDI and would like to know how to effectively design markup existing and future questionnaires in DDI.

Part 2 : Where am I and where do I go next?

In the previous post on questionnaires using DDI we looked at how it is possible in DDI to repeat sequences of questions, potentially leading to endless loops of questions within a survey. In a paper survey this would be quickly picked up, but in an electronic survey this might not always be the case. To help identify these issues  a tool called Sheri was introduced that consumes a DDI instrument and check for potential issues of repeated questions and endless loops. Sheri also is able to generate a non-standard XML format representing an entire survey marked up in DDI. But this leads to the question “if one were to design a survey in DDI, why introduce another XML format?”

During research into DDI as a data capture standard it became quickly apparent, that for all of its benefits in designing surveys, DDI was very difficult for programs to consume to create instances of these instruments. This is due to how DDI describes an Instrument. In DDI a whole survey instrument is represented by a single tag – Instrument -, which in turn references a single Sequence within a ControlStructureScheme. This is simple enough, to understand, but once the first Sequence is found the structure becomes quite interesting. Every Sequence, or other ControlConstruct, maintains references to its child sequences. For example, a Sequence can reference multiple other Sequences, which reference more Sequences and so on. Likewise, a Loop maintains a list of linked ControlStructres to loop over, and conditional IfThenElse tags contains references in both Then and Else clause that refer to other ControlStructures. What this means is that although there is an implied hierarchy, it is stored as a flat list of data structures with references between them.

What this leads to is a difficulty in determining where exactly in the hierarchy one is if they are given only the id of the current structure, especially when dealing with the stateless nature of the web. For example, the test software I was writing, Ramona, would take a sequence id as an argument and render the corresponding DDI sequence as a form on a webpage with web controls for each question. What quickly became an issue was determining the what the next page to display was when a user was done with a sequence.

In a DDI ControlStructure, the ID of a child element is stored as the text of an ID element under a ControlConstructReference. What this means is that to determine the next possible sequence given a sequence id, you need to look for references to the object, not the object itself. Then from the reference, search back for an appropriate ancestor element and then return the next sibling to show the correct form. This quickly becomes complicated and when dealing with large DDI test  files doing plain text searches throughout an unmanaged * hierarchy it became apparent that this method was too slow and complex for real time returns.

A further complication arises if a survey reuses an object using multiple references (as opposed to using Loops). In cases such as this, using the above method it becomes impossible to easily determine which is the correct parent. The reason being that a reverse lookup for referring objects for an object referenced multiple times will only return a list of referring objects with no context about which one we need to follow.

For example in the following sequence:

Sequence id="Seq 1"
    sequence reference: Seq 3
Sequence id="Seq 2"
    sequence reference: Seq 3
Sequence id="Seq 3"
    question reference: Q1: Where did I come from?

Resolves to have a structure like:

Sequence id="Seq 1"
    Sequence id="Seq 3"
        question reference: Q1: Where did I come from?
Sequence id="Seq 2"
    Sequence id="Seq 3"
        question reference: Q1: Where did I come from?

But, given just the id of Sequence 3 we can’t determine if after answer the question the survey is over (if we arrived at Sequence 3 from Sequence 2), or if we still have to go to Sequence 2.

To solve these issues of both speed, development and determining location, it is therefore necessary for applications using DDI to “pre-compile” Instruments into a traditional hierarchy form. In both Ramona and Sheri, the solution is to resolve the references and copy the referenced elements into the parent structure. This allow much of the DDI metadata to be retained and used.

This is not valid DDI, and it is unlikely that it ever will be.

This is also not a problem. DDI is useful as an archival and transportation language for statistical metadata, and the flexibility that the current structure provides is quite useful. However, when looking at data collection, it can be safely assumed that the DDI Instrument would be relatively stable. If best practices are followed, once a DDI Instance is published it will never change. Thus it is a perfectly valid action to transform an Instrument in this way, as long as two conditions are met: this is seen as a one-way destructive transformation, and that the resulting pseudo-DDI instrument is never changed. To provide another example of why this is a normal situation, there are tools that are being developed to transform DDI into PDF questionnaires: this is very much the same process, the PDF is seen as a projection of the original DDI to make it easier for people to use, but not the actual ‘source-of-truth’. Transforming DDI Instruments into a dereferenced pseudo-DDI Instrument is exactly the same, a transform of complex metadata into a form that machines can easily work with.

There is one issue that these pseudo-DDI Instruments will have, and that is when a single data structure is referenced multiple times, it will occur multiple times in the resultant tree. In cases like this, it is still difficult to determine which element is the correct one when just given an ID. There are two possible solutions to this issue, the first being that instead of managing state based on a single ID, it is managed as the full XPath of the element, possibly speeding up traversal, but also presenting the possibility the the structure of the form could be shared with users – which may or may not be a security issue depending on the form. Alternatively, as discussed in the part one of these tutorials, restructuring Instruments so no structure is referenced more than once, making it easier to traverse through the form, as well as limiting user frustration.

It should be noted that restructuring is not a necessity in making DDI Instruments processable by software, just something that makes them easier to use using traditional methods and existing XML libraries, and can provide benefits to execution times if that is an issue. However, when dealing with desktop software, many of the issues of web development with regards to stateless vs. stateful systems or the scalability of systems with concurrent users cease to be an issue. In such situations, it is quite possible to work using the DDI Instrument directly, with manipulating the data strucutre, and in some cases may be preferable.

In conclusion, how strictly we manage the DDI metadata structure depends very strongly on the role the metadata plays in a system. In some cases, such as computer-aided interviewing where the instrument should be extremely stable, the transformation of DDI to a format that is more easily processed by systems, or even users, can be preferable to using plain DDI. what is important is to focus on DDI as a tool for increasing transparency and reusability in statistical processing, but as long as the methods used to transform DDI into intermediate forms are well documented there is no reason why this cannot be done as such the use of well-documented transformations would not violate existing best practice.

Next up… Questionnaire Design with DDI – Part 3: What am I doing here? – A look at best practices for what control structures to include in sequences and how to deal with logical structures and questions.

* Unmanaged in the sense that the hierarchy is stored as references between XML elements, and not as a traditional XML hierarchy, and as such traditional tree traversal methods for XML cannot be used.

Questionnaire design with DDI – Part 1: Will this survey ever end?

This is the first in a 5 part series of working with questionnaires and surveys managed using the Data Documentation Initiative XML standard. With DDI being an emerging technology, it is important that users are provided with best practises to ensure that they use the standard in a way that is logical, coherent and most importantly usable and reusable. This series of tutorials and discussions is aimed toward users who have some knowledge of DDI and would like to know how to effectively design markup existing and future questionnaires in DDI.

Part 1: Will this survey ever end?

This is a classic question that survey takers often ask, and not usually that politely. When a survey will end is an important question for users and designers alike, as the time users have to spend filling out a form can impact their likelihood of finishing or even starting a survey. So determining if a survey will end is of vital importance for users of DDI.
Unfortunately, the answer, based on simple analysis is that you can never tell. DDI uses a unique methodology of using complex referencing between sequencing objects to build up the structure of a questionnaire in a way that is highly reusable. Take for example the following pseudo-survey:

Survey: All about You!
    Part 1: Feelings
        Q1: Are you feeling happy today?
        Q2: How do warm summer days make you feel?
    Part 2: Favourites
        Q3: What is your favourite icde-cream?
        Q4: What is your animal?

Now, restructuring this in a way more analgous with DDI gives:

Sequence id="Part 1" title="Feelings"
    question reference: Q1: Are you feeling happy today?
    question reference: Q2: How do warm summer days make you feel?
Sequence id="Part 2" title="Favourites"
    question reference: Q3: What is your favourite ice-cream?
    question reference: Q4: What is your animal?
Instrument id="survey" title="All about You!"
    sequence reference: Part 1
    sequence reference: Part 2

The important thing to notice is that both Part 1 and 2 and the main survey are all the same structure: a DDI Sequence. In DDI, to increase reusability sequences can be included by reference within each other. However, the issue comes when a designer (either a person or piece of software) fails to account for this sending the survey taker into an endless loop. For example, looking at a different survey strucutre like DDI gives:

Sequence id="Verse 1"
    question reference: Q2n-1: Is this the song that never ends?
    sequence reference: Verse 2
Sequence id="Verse 2"
    question reference: Q2n: Does it go on and on my friends?
    sequence reference: Verse 1
# Apologies to Lamb Chop

An even shorter (but much less likely) example would be:

Sequence id="infinity"
    question reference: Qn: Do you believe how vastly, hugely, mind- bogglingly big Inifinity is?
    sequence reference: infinity
# Apologies to Douglas Adams

What these two examples show is how it is possible to send a user through the same sequence twice, potentially leading to an endless loop of questioning. With DDI providing appropriate mechanisms for Looping over questions, it is quite likely that this kind of structure will always be the result of a mistake. However, what it does highlight is how the flexibility DDI provides to allow users to reuse metadata when used inappropriately can cause issues.
Fortunately, as part of the research in developing a DDI web application for data collection, it was neccessary to create a module that was able determine the implied structure and possible ending of DDI Questionnaire. A web-service for this module (aka. Shari) is available online at
One of the notable features of Shari, is that is will create a hierarchical non-DDI based XML serialisation of a given questionnaire (the reasons for this will be covered in Part 2 of this series “Where am I?”). Along with this it will check that no two sequences have references to the same sequence to confirm that the questionnaire will halt. However, the ‘halting’ of the survey is however based on the provision that any DDI Loops within the survey are also able to end, but this is a true ‘halting problem’.
By throwing an error when ever a sequence is referenced twice, some may believe that this will lead to a perceived issue in which two sequences in different branches could link to a third sequence without causing an endless loop being rejected by the system. However, it is possible to rewrite any instrument that relies on using multiple references to a sequence to ensure convergence of a survey in a way that eliminates duplicated references. For example:

Instrument id="Main"
    sequence reference: Seq 1
Sequence id="Seq 1"
    question reference: Q1: Do you like A or B?
    if Q1 = A then goto sequence 2a else goto 3b
Sequence id="Seq 2a"
    question reference: Q2a: Why do you hate B?
    sequence reference: Seq 3
Sequence id="Seq 2b"
    question reference: Q2b: Why do you hate A?
    sequence reference: Seq 3
Sequence id="Seq 3"
    question reference: Q3: Wouldn't it be better if every one got along?

In this minimal example, it is trivial to see that this survey will always end. What is important to note is that this can be rewritten as:

Instrument id="Main"
    sequence reference: Seq 1
    sequence reference: Seq 3
Sequence id="Seq 1"
    question reference: Q1: Do you like A or B?
    if Q1 = A then goto sequence 2a else goto 3b
Sequence id="Seq 2a"
    question reference: Q2a: Why do you hate B?
Sequence id="Seq 2b"
    question reference: Q2b: Why do you hate A?
Sequence id="Seq 3"
    question reference: Q3: Wouldn't it be better if every one got along?

As the users steps through the branch, after either of the sequences 2a or b end, the survey steps ‘back’ from the inner sequence to Seq 1, not finding a following sibling for the If branch it steps back again to the parent instrument and finds Seq 1 has a following sibling and then steps into Seq 3.What should be taken away from this example, is that is is important before finalising a DDI questionnaire to understand the implied structure of the instrument and refactor it to ensure that it is minimal and logically correct as well as having the structure required by the survey designer.

In conclusion, it should be a little clearer about how to tame the flexibility that DDI allows when creating questionnaires and how to create logically correct survey instruments.

Next up… Questionnaire Design with DDI – Part 2: Where am I?- Examining why DDI has issues with non-predictability of movement through an instrument, and how to work around this.