Archive for the ‘ Statistics ’ Category

3 quick questions to identify children in financially risky families

  1. Do you live with mommy and daddy? 
  2. Does mommy or daddy smoke?
  3. Do you eat a lot of fruit or nuts at home?

The Australian Bureau of Statistics recently released data cubes on Household Expendituregiving an incite into how Australians use their money. What is of interest is that they provide a breakdown of average weekly expenditure for a variety of products, tabulated against the number of financial stressors. The ABS has defined financial stressors are events where a household is unable to pay bills, goes without meals,

The reason this is interesting is that there are very few factors with a striking correlation to the number of financial stresses within a household as those below:

Number of indicators of financial stress experienced
Risk factors 0 1 2 3 4+
Tobacco products ($/week) 8.89 13.01 14.63 18.02 21.45
Newspapers ($/week) 3.17 2.80 2.64 1.47 1.48
Fruit and nuts  ($/week) 14.06 12.81 11.41 10.62 7.98
One parent family with dependent children (%) 1.9 4.7 8.3 9.8 19.3
All Renters (%) 18.6 28.3 34.0 39.3 54.4
Main source of income – Government pensions and allowances (%) 17.8 22.1 25.2 28.3 52.1

Note, that for newspaper and fruit and nut weekly expenditure, there is a negative correlation.

The problem with some of these metrics though is while they are all strongly correlated with financial stress, they may not all be obvious to children. For example, a child may not know about their parents income or if they live in a rental property, or certain actions may not be hidden from the child, such as a parent buying a newspaper on the way to work.

Other metrics however, are more obvious for children to notice and report. Such who they live with, obvious activities of their parents,  (like smoking) and their own diets. This leads us to the three questions listed above:

  1. Do you live with mommy and daddy?
  2. Does mommy or daddy smoke?
  3. Do you eat a lot of fruit or nuts at home?
Now while these may not account for same-sex couples, a child in a two parent same sex household would, given sufficient prompting, probably indicate they had two parents. Furthermore, this is based on aggregate information, however there is a good chance unit records may back these correlations up. Lastly, this is looking at correlation for risk factors, and cannot be used to suggest causation. Together however, these three questions can quickly give a strong indication of the risk of financial stress within a child’s household.

How statistics can over state the risk of youth suicide.

Update: I have renamed this post as the original was pointed out that in my attempt to have a short enough title to fit on twitter it came across as a little antagonistic. The original title is still visible in the URL so old links don’t break.

In the words of the new (and poorly named) “Soften the Fuck Up” campaign

Suicide is the leading cause of death amongst young folks and most of them are blokes.

People have been recently regurgitating figures from the Australian Bureau of Statistics talking about how the leading cause of death for males aged 15-45 is suicide. I was briefly taken aback and shocked at such a thought. I mean, I’m a male aged 15-45, is suicide in my near future? Until I came to the realisation: what other causes of death for someone my age be?

Combined deaths for males across selected (aggregated) causes (source ABS)

Combined deaths for males across selected (aggregated) causes (source ABS)

What the above graph shows is the combined number of deaths for each age bracket for the most prolific causes of death across the age range. From this a few things instantly stand out, firstly that the number of suicides is relatively steady across people lifespans. What this indicates is that suicide isn’t a youth issue, its a people issue, but we will go into depth for this later. But the fact is, fewer young people die overall compared to older age groups, coupled with a relatively steady suicide rate across the whole lifespan. So it is to be expected that suicide is more common in younger and healthier demographics, because there are few large cause of death.

The positive news is that only 60 people died from assault in 2009, but sadly such ‘good’ news is rarely newsworthy.

Secondly, that the number of deaths for young people is relatively low compared to older people. In fact the leading causes of death for people aged 35 and older (heart disease and cancer) are relatively non-existent in people under 35. In fact when these common and natural causes of death are removed, there is little left to cause death in younger populations. The positive news is that only 60 people died from assault in 2009, but sadly such ‘good’ news is rarely newsworthy.  Furthermore, when preparing this graph it became apparent, that the leading cause of death for those aged 15-24 isn’t suicide, but traffic accidents. When combining car and motorcycle deaths into a single figure these numbered greater that the number of suicides for the same year for the 15-24 age bracket.

Digging a little deeper we can take a stronger look at the relation between suicide and age, and we come up with a graph like that shown below:

Suicide rates for males aged 15-65 for the years 2000-2009
Suicide rates for males aged 15-65 for the years 2000-2009 (source ABS)

From this, we can see again that suicide peaks in around forty, before tapering off again. Although, this removes the issue highlighted, that suicide is still relatively steady, we can draw a quite positive message – over the last ten years, suicide has fallen for every group except those over the age of 55. Again a far cry from the youth warning we are accustomed.

Lastly, with suicide portrayed as primarily young male problem, the comparison with women across the lifecycle warrants attention.

Suicide as a percentage of deaths
Suicide as a percentage of deaths (Source ABS)

As the graph shows, suicide is closest to equal at young ages, comparative to the growing disparity as people age.

The problem with peoples interpretation of these mortality statistics released by the ABS is that the figures are segregated by age, which complicates their interpretation. By aggregating them into larger blocks, without accounting for the natural underlying growth in death rates the disparity in suicide rates between the age groups is almost hidden.


On a personal note,  200 males aged 14-24 died of ‘intentional self-harm’ in 2009, and I knew one of them. He was not a statistics, or a piece of data in a chart, he was a friend to many. What happened was a tragedy, but what I, and a lot of people learned, was that it was an unpredictable and unexpected event.

Suggesting that suicide or self-harm is a common or obvious event does a disservice to anyone who has been effected by suicide. We are fortunate enough to live is a society were suicide is relatively rare among all demographics. That is why it hurts when it touches us so strongly when it happens, because it is so uncommon.

I am not suggesting that we should not be vigilant with those close to us. Suicide can be prevented, but pulling out “scare statistics” that suggest that what happened should have been obvious does nothing to help those who are left. Suicide is not a subject that should be taken lightly and baseless, uneducated statistics have the potential to hurt a lot more the event itself.

420 convert classifications everyday

With the recent release of the new Australian Standard Classification of Drugs of Concern from the ABS, there was the opportunity to field test the Virgil CSV to DDI converter with real data to see how it held up. Fortunately, the classification was released as an Excel data cube that conformed almost entirely with the structures that Virgil supports. After a little cleaning of the CSV, it was able to run through the converter without few issues at all. Incidentally the most major error highlighted the massive oversight that the converter fails to add values for the codes! However this has been corrected and changes have been pushed in the svn, and a new version of the Windows tool will be pushed out this weekend.

A screen shot of Virgil with the converted classification

A screen shot of Virgil with the converted classification

Opening the newly created DDI file in the Virgil DDI CodeList Editor was another story and pointed out a few flaws with how it handles empty data. With the structure from the Excel file not containing descriptions for any category or any labels for the CodeScheme, there were a few small corrections made to accommodate freshly created DDI, but many of these problems will be ironed out by the time the CodeList editor is available for download.

While the converter hasn’t been fully integrated into the CodeList Editor, it will shortly be possible to create a single DDI file and import numerous CSV files to create a series of classificatory codelists in a single package. A practical and soon to be realised example would be the Australian Standard Classification of Drugs of Concern with the lists of drugs of concern, forms of drug and methods of consumption codelists all contained in a single machine processable DDI package.

For those who haven’t been able to download or run the converter, the output from this example is available for testing.

Questionnaire design with DDI – Part 5: Can it be done?

This is the fifth in a 5 part series of working with questionnaires and surveys managed using the Data Documentation Initiative XML standard. With DDI being an emerging technology, it is important that users are provided with best practises to ensure that they use the standard in a way that is logical, coherent and most importantly usable and reusable. This series of tutorials and discussions is aimed toward users who have some knowledge of DDI and would like to know how to effectively design markup existing and future questionnaires in DDI.

Part 5: Can it be done?

On the eve of IASSIST 2011, we are going to look back at the last few tutorials and ask the big question: Can DDI 3.1 be used as a viable format for the creation of online survey instruments? The simple answer is yes, but there are a few caveats.

Caveat 1: Only use a the transformation of a finalised survey

This is a very basic idea, but one that should be reiterated. When working to create a web survey, there will more than likely be an iterative process of design DDI, transform, examine form, refine DDI, repeat until done. This is a necessary part of development, as noone ever gets it right first time. However, once a form has been transformed there should be organisation sign-off to ensure that that form is never changed again. While the DDI will be versioned meaning the original DDI that the form would be based of would still be there, altering and overriding an in production form can only lead to data issues.

Caveat 2: Transforming DDI into an e-form a destructive transformation

InstrumentML as presented in Part 2 is not, and will never be, a part of the DDI 3.1 specification. It is a way to transform the logical structure of DDI into a form easily usable in certain situations. InstrumentML is a way of caching the implied instument flow to make machine processing easier. Likewise, the transformation from DDI to e-form is done to present the information in the DDI in a way that is easier for users to understand.

Caveat 3: Any changes to a HTML instance of a form will not be pushed back to DDI

It is a fact that no transformation that transforms a logical structure into a visual page will be perfect. So it is reasonalbe to assume that web pages created from DDI may need to be altered. Perhaps a list of options for response is better as a list of radio buttons compared to a drop down list, a line break is needed to alter word flow or the dimensions of text box need to be precise. These types of issues will rise and there are 3 cascading ways of resolving this:

  1. Alter the DDI to match the expected output,
  2. Alter the CSS to alter the presentation,
  3. If neither of the above produce the needed results, then and only then edit the generate HTML

With DDI allowing XHTML tags in most labels to allow semantic markup, this is the first and most appropriate way to alter pages. If what needs to be changes is wording this is also the only place to edit.

An example of where editing the HTML is need is where we have a question based on age where we are measuring labour figures. This is a numeric question, but in this case we have decided the populations below 18 and over 65 will be too small, so we are going to code everything outside these ranges to the codes ‘<18′ and ‘>65′. So we have maximums and minumums, and their codes, but on advise from a survey methodologist we won’t be restricting users from entering figures outside these ranges.
However our pre-generated HTML (based on the suggestions in part 4) will bring these across, so in this case we can edit this question to move the minimum from 18 to 0, as age still needs to be above 0 and remove the upper restriction.

Caveat 4: Final transformations should be packagable to be stored with the DDI

If a DDI instrument was transformed into a PDF format for printing, no good archivist would think twice about storing a copy of this as a manifestation of the instrument. Likewise, when transforming from DDI to other formats to ease machine use, keeping a copy of this is paramount. Software will change, and how one version of a tool transforms the DDI way not always be the same as how another transforms it.

In the instance of Ramona, a package consists of the HTML used to semantically mark the questions, the InstrumentML to contain the logical flow and the the CSS of the presentation layer. By storing this information, examining the survey at a layer date becomes easier for researchers who may no longer have access to the software that did the original transformation.

What is also important is making sure any valid alterations to a form, like those described in the previous caveat, are stored as a part of this package.

What we have is proof that it can be done, but its not done yet

As a closing point, people may be wondering if they can see the code that a lot of this information is based on. At this stage I’m not planning on releasing the code for Ramona just yet, mostly because its not very good. These tutorials are based on things I learned as I went about trying to create a DDI Web-form viewer, and as such it is less production code, and more a bloody narrative showing my battles and trials against what can be a very tough and complex standard.

Over the next few weeks I’ll be stripping the code down, separating the wheat from the chaff and rewriting large sections of it and with luck by the time EDDI2012 rolls around there should be a full functional DDI eforms tool ready to roll out into production.

This brings to a close this series of tutorials dealing with online survey development in DDI. In the coming weeks I’ll also start putting together a short introduction to DDI for new users, and spending the bulk of my free time actually writing code, rather than writing about writing code.

And lastly, enjoy IASSIST 2011 everyone!

Questionnaire design with DDI – Part 2: Where am I and where do I go next?

This is the second in a 5 part series of working with questionnaires and surveys managed using the Data Documentation Initiative XML standard. With DDI being an emerging technology, it is important that users are provided with best practises to ensure that they use the standard in a way that is logical, coherent and most importantly usable and reusable. This series of tutorials and discussions is aimed toward users who have some knowledge of DDI and would like to know how to effectively design markup existing and future questionnaires in DDI.

Part 2 : Where am I and where do I go next?

In the previous post on questionnaires using DDI we looked at how it is possible in DDI to repeat sequences of questions, potentially leading to endless loops of questions within a survey. In a paper survey this would be quickly picked up, but in an electronic survey this might not always be the case. To help identify these issues  a tool called Sheri was introduced that consumes a DDI instrument and check for potential issues of repeated questions and endless loops. Sheri also is able to generate a non-standard XML format representing an entire survey marked up in DDI. But this leads to the question “if one were to design a survey in DDI, why introduce another XML format?”

During research into DDI as a data capture standard it became quickly apparent, that for all of its benefits in designing surveys, DDI was very difficult for programs to consume to create instances of these instruments. This is due to how DDI describes an Instrument. In DDI a whole survey instrument is represented by a single tag – Instrument -, which in turn references a single Sequence within a ControlStructureScheme. This is simple enough, to understand, but once the first Sequence is found the structure becomes quite interesting. Every Sequence, or other ControlConstruct, maintains references to its child sequences. For example, a Sequence can reference multiple other Sequences, which reference more Sequences and so on. Likewise, a Loop maintains a list of linked ControlStructres to loop over, and conditional IfThenElse tags contains references in both Then and Else clause that refer to other ControlStructures. What this means is that although there is an implied hierarchy, it is stored as a flat list of data structures with references between them.

What this leads to is a difficulty in determining where exactly in the hierarchy one is if they are given only the id of the current structure, especially when dealing with the stateless nature of the web. For example, the test software I was writing, Ramona, would take a sequence id as an argument and render the corresponding DDI sequence as a form on a webpage with web controls for each question. What quickly became an issue was determining the what the next page to display was when a user was done with a sequence.

In a DDI ControlStructure, the ID of a child element is stored as the text of an ID element under a ControlConstructReference. What this means is that to determine the next possible sequence given a sequence id, you need to look for references to the object, not the object itself. Then from the reference, search back for an appropriate ancestor element and then return the next sibling to show the correct form. This quickly becomes complicated and when dealing with large DDI test  files doing plain text searches throughout an unmanaged * hierarchy it became apparent that this method was too slow and complex for real time returns.

A further complication arises if a survey reuses an object using multiple references (as opposed to using Loops). In cases such as this, using the above method it becomes impossible to easily determine which is the correct parent. The reason being that a reverse lookup for referring objects for an object referenced multiple times will only return a list of referring objects with no context about which one we need to follow.

For example in the following sequence:

Sequence id="Seq 1"
    sequence reference: Seq 3
Sequence id="Seq 2"
    sequence reference: Seq 3
Sequence id="Seq 3"
    question reference: Q1: Where did I come from?

Resolves to have a structure like:

Sequence id="Seq 1"
    Sequence id="Seq 3"
        question reference: Q1: Where did I come from?
Sequence id="Seq 2"
    Sequence id="Seq 3"
        question reference: Q1: Where did I come from?

But, given just the id of Sequence 3 we can’t determine if after answer the question the survey is over (if we arrived at Sequence 3 from Sequence 2), or if we still have to go to Sequence 2.

To solve these issues of both speed, development and determining location, it is therefore necessary for applications using DDI to “pre-compile” Instruments into a traditional hierarchy form. In both Ramona and Sheri, the solution is to resolve the references and copy the referenced elements into the parent structure. This allow much of the DDI metadata to be retained and used.

This is not valid DDI, and it is unlikely that it ever will be.

This is also not a problem. DDI is useful as an archival and transportation language for statistical metadata, and the flexibility that the current structure provides is quite useful. However, when looking at data collection, it can be safely assumed that the DDI Instrument would be relatively stable. If best practices are followed, once a DDI Instance is published it will never change. Thus it is a perfectly valid action to transform an Instrument in this way, as long as two conditions are met: this is seen as a one-way destructive transformation, and that the resulting pseudo-DDI instrument is never changed. To provide another example of why this is a normal situation, there are tools that are being developed to transform DDI into PDF questionnaires: this is very much the same process, the PDF is seen as a projection of the original DDI to make it easier for people to use, but not the actual ‘source-of-truth’. Transforming DDI Instruments into a dereferenced pseudo-DDI Instrument is exactly the same, a transform of complex metadata into a form that machines can easily work with.

There is one issue that these pseudo-DDI Instruments will have, and that is when a single data structure is referenced multiple times, it will occur multiple times in the resultant tree. In cases like this, it is still difficult to determine which element is the correct one when just given an ID. There are two possible solutions to this issue, the first being that instead of managing state based on a single ID, it is managed as the full XPath of the element, possibly speeding up traversal, but also presenting the possibility the the structure of the form could be shared with users – which may or may not be a security issue depending on the form. Alternatively, as discussed in the part one of these tutorials, restructuring Instruments so no structure is referenced more than once, making it easier to traverse through the form, as well as limiting user frustration.

It should be noted that restructuring is not a necessity in making DDI Instruments processable by software, just something that makes them easier to use using traditional methods and existing XML libraries, and can provide benefits to execution times if that is an issue. However, when dealing with desktop software, many of the issues of web development with regards to stateless vs. stateful systems or the scalability of systems with concurrent users cease to be an issue. In such situations, it is quite possible to work using the DDI Instrument directly, with manipulating the data strucutre, and in some cases may be preferable.

In conclusion, how strictly we manage the DDI metadata structure depends very strongly on the role the metadata plays in a system. In some cases, such as computer-aided interviewing where the instrument should be extremely stable, the transformation of DDI to a format that is more easily processed by systems, or even users, can be preferable to using plain DDI. what is important is to focus on DDI as a tool for increasing transparency and reusability in statistical processing, but as long as the methods used to transform DDI into intermediate forms are well documented there is no reason why this cannot be done as such the use of well-documented transformations would not violate existing best practice.

Next up… Questionnaire Design with DDI – Part 3: What am I doing here? - A look at best practices for what control structures to include in sequences and how to deal with logical structures and questions.

* Unmanaged in the sense that the hierarchy is stored as references between XML elements, and not as a traditional XML hierarchy, and as such traditional tree traversal methods for XML cannot be used.

Questionnaire design with DDI – Part 1: Will this survey ever end?

This is the first in a 5 part series of working with questionnaires and surveys managed using the Data Documentation Initiative XML standard. With DDI being an emerging technology, it is important that users are provided with best practises to ensure that they use the standard in a way that is logical, coherent and most importantly usable and reusable. This series of tutorials and discussions is aimed toward users who have some knowledge of DDI and would like to know how to effectively design markup existing and future questionnaires in DDI.

Part 1: Will this survey ever end?

This is a classic question that survey takers often ask, and not usually that politely. When a survey will end is an important question for users and designers alike, as the time users have to spend filling out a form can impact their likelihood of finishing or even starting a survey. So determining if a survey will end is of vital importance for users of DDI.
Unfortunately, the answer, based on simple analysis is that you can never tell. DDI uses a unique methodology of using complex referencing between sequencing objects to build up the structure of a questionnaire in a way that is highly reusable. Take for example the following pseudo-survey:

Survey: All about You!
    Part 1: Feelings
        Q1: Are you feeling happy today?
        Q2: How do warm summer days make you feel?
    Part 2: Favourites
        Q3: What is your favourite icde-cream?
        Q4: What is your animal?

Now, restructuring this in a way more analgous with DDI gives:

Sequence id="Part 1" title="Feelings"
    question reference: Q1: Are you feeling happy today?
    question reference: Q2: How do warm summer days make you feel?
Sequence id="Part 2" title="Favourites"
    question reference: Q3: What is your favourite ice-cream?
    question reference: Q4: What is your animal?
Instrument id="survey" title="All about You!"
    sequence reference: Part 1
    sequence reference: Part 2

The important thing to notice is that both Part 1 and 2 and the main survey are all the same structure: a DDI Sequence. In DDI, to increase reusability sequences can be included by reference within each other. However, the issue comes when a designer (either a person or piece of software) fails to account for this sending the survey taker into an endless loop. For example, looking at a different survey strucutre like DDI gives:

Sequence id="Verse 1"
    question reference: Q2n-1: Is this the song that never ends?
    sequence reference: Verse 2
Sequence id="Verse 2"
    question reference: Q2n: Does it go on and on my friends?
    sequence reference: Verse 1
# Apologies to Lamb Chop

An even shorter (but much less likely) example would be:

Sequence id="infinity"
    question reference: Qn: Do you believe how vastly, hugely, mind- bogglingly big Inifinity is?
    sequence reference: infinity
# Apologies to Douglas Adams

What these two examples show is how it is possible to send a user through the same sequence twice, potentially leading to an endless loop of questioning. With DDI providing appropriate mechanisms for Looping over questions, it is quite likely that this kind of structure will always be the result of a mistake. However, what it does highlight is how the flexibility DDI provides to allow users to reuse metadata when used inappropriately can cause issues.
Fortunately, as part of the research in developing a DDI web application for data collection, it was neccessary to create a module that was able determine the implied structure and possible ending of DDI Questionnaire. A web-service for this module (aka. Shari) is available online at http://sandbox.kidstrythisathome.com/dditools/sheri.
One of the notable features of Shari, is that is will create a hierarchical non-DDI based XML serialisation of a given questionnaire (the reasons for this will be covered in Part 2 of this series “Where am I?”). Along with this it will check that no two sequences have references to the same sequence to confirm that the questionnaire will halt. However, the ‘halting’ of the survey is however based on the provision that any DDI Loops within the survey are also able to end, but this is a true ‘halting problem’.
By throwing an error when ever a sequence is referenced twice, some may believe that this will lead to a perceived issue in which two sequences in different branches could link to a third sequence without causing an endless loop being rejected by the system. However, it is possible to rewrite any instrument that relies on using multiple references to a sequence to ensure convergence of a survey in a way that eliminates duplicated references. For example:

Instrument id="Main"
    sequence reference: Seq 1
Sequence id="Seq 1"
    question reference: Q1: Do you like A or B?
    if Q1 = A then goto sequence 2a else goto 3b
Sequence id="Seq 2a"
    question reference: Q2a: Why do you hate B?
    sequence reference: Seq 3
Sequence id="Seq 2b"
    question reference: Q2b: Why do you hate A?
    sequence reference: Seq 3
Sequence id="Seq 3"
    question reference: Q3: Wouldn't it be better if every one got along?

In this minimal example, it is trivial to see that this survey will always end. What is important to note is that this can be rewritten as:

Instrument id="Main"
    sequence reference: Seq 1
    sequence reference: Seq 3
Sequence id="Seq 1"
    question reference: Q1: Do you like A or B?
    if Q1 = A then goto sequence 2a else goto 3b
Sequence id="Seq 2a"
    question reference: Q2a: Why do you hate B?
Sequence id="Seq 2b"
    question reference: Q2b: Why do you hate A?
Sequence id="Seq 3"
    question reference: Q3: Wouldn't it be better if every one got along?

As the users steps through the branch, after either of the sequences 2a or b end, the survey steps ‘back’ from the inner sequence to Seq 1, not finding a following sibling for the If branch it steps back again to the parent instrument and finds Seq 1 has a following sibling and then steps into Seq 3.What should be taken away from this example, is that is is important before finalising a DDI questionnaire to understand the implied structure of the instrument and refactor it to ensure that it is minimal and logically correct as well as having the structure required by the survey designer.

In conclusion, it should be a little clearer about how to tame the flexibility that DDI allows when creating questionnaires and how to create logically correct survey instruments.

Next up… Questionnaire Design with DDI – Part 2: Where am I?- Examining why DDI has issues with non-predictability of movement through an instrument, and how to work around this.

63 years of Australian CPI data

The latest Australian CPI figures were released last week by the Australian Bureau of Statistics.

Fortunately, each release includes re-weighted indices for each indicator. These cover 11 major topics, including food, clothing, housing and education. Unfortunately, the Excel spreadsheets these are distributed aren’t the easiest formats to process data from. This is because exporting data from Excel can be time consuming, and the data as it is stored in Excel neglects the hierarchical structure of the CPI indicators.

To help change this, I wrote a few scripts (and lovingly hand-crafted some XML) to help transform the CPI from Excel into a DSPL dataset, and have uploaded this into into the Google Public Data Explorer.

The dataset that has been uploaded is based of Table 12 from the Downloads section of the latest CPI page. This dataset includes the indices for each capital city, and Australia, for each level of indicator – from the broad total CPI to, for example, the more specific “Food”, “Bread and Cereals” and finally “Breads”. There are 12 total broad topic covered, including a miscellaneous group of indices that exclude some of the 11 topics, and there are 144 topics at the finest detail.

The end result is something like this:

If you want to play with the whole dataset, it is available on the Google Public Data Explorer, or if you would like to download the full DSPL dataset, that is available on the DSPL-R downloads page.

Probably one of the more interesting parts was how to create the hierarchical CPI indicators category in DSPL, but I’ll be following this post up later in the week with a tutorial on how to work with complex datasets.

Update: With the help of a kind statistician from Google the datset is now much better structured. The updated dataset is available here: http://code.google.com/p/dspl-r/downloads/list

Using metadata within statistical software

Today is the release of a beta of a package I am writing for the R statistical package to make it easier for researchers to utilise metadata within R and to make it more worthwhile for statisticians to provide metadata.

Most of the methods for R to import data rely solely on the importing of undocumented data, in fact one of the most common ways to import data is through raw CSVs. However, with the release of DSPL.R it is now possible to browse the metadata of a dataset within a statistical package.

For example, the following output is example output from the US Retail Sales dataset provided by Google:

> print (prep.dspl("~/example/census-retail-sales.zip"))
DSPL Dataset - For more info see: [www.kidstrythisathome.com/dspl.r]
------------                  or: [code.google.com/apis/publicdata/]

Name : Retail Sales in the U.S.
Description : Monthly Retail Trade and Food Services report
            for the United States. This dataset was prepared by Google based
            on data downloaded from the U.S. Census Bureau.
Concepts : 3  -  Type of business, Seasonality, Retail Sales Volume
Slices   : 1  -  retail_sales_business
Tables   : 3  -  businesses, seasonalities, retail_sales_business_tbl
Topics   : 3  -  Industry, Business, Gender

As this example shows, a user is able to load in a new dataset, and get an immediate sense for what the dataset contains. By being able to allow a user to be able to understand the meaning behind a dataset, without having to leave the statistical environment, users are able to seamlessly work with their data and metadata within the same interface.

While DSPL is seen as a newcomer to the statistical world, and the R is perceived(albeit wrongly) to be inferior to more established commercial statistical tools, the agility of R and the brevity of the DSPL standard act as a strong indicator of how, given time statistical metadata could become an integral part of the all statistical processes.

Those who survived count themselves lucky, but they will not count themselves.

The following is a letter I wrote, with the intention to forward it to the heads of Statistics New Zealand and those behind the decision to cancel the 2011 New Zealand Census, however I think it rings true for all concerned about the future of the Census in all nations.

————————–

The earthquakes that recently devastated Christchurch were grave, terrible events and while New Zealand and its close neighbour Australia mourn for the loss of life, I implore you: halting the census will not ease this grief. In fact, it will have a much graver impact on the statistical world, outside our small Pacific community.

I do not need to educate any of the intended recipients of this letter about the role of the census, but for completion I feel it is necessary to remind all how vital the census is to the community.

Each census is a fully comprehensive snapshot of the entire population

The data from a census provide nations with the ability to better understand their past and present and dictate their future. Each census is a fully comprehensive snapshot of the entire population; by comparing them, we reflect on how nations change over time. Studying how demographics evolve, we as a society are able to create public policy that ensures all citizens are represented and provided the resources they need to thrive. Census information helps to create fair electoral boundaries, encourages the creation of schools, hospitals and roads where they are needed by the community, and even assists medical research. There is a reason censuses have been going on for thousands of years, and why the establishment of a census bureau is one of the first actions of a just and educated nation.

citizens are forgetting how the census benefits them more than it benefits the state.

While the role of these agencies may have been clear at first, over time citizens are forgetting how the census benefits them more than it benefits the state. Both national censuses and the agencies that run them are already under threat from many parties. There are those who seek to cripple the census, either to superficially save tax dollars or alter the demographic information to better skew their positions. The government of the Right Honourable Stephen Harper, Prime Minister of Canada, recently made changes to the 2011 Canadian Census to make portions of the census voluntary. This move was surrounded by controversy and outcry from the statistics community. Much of this controversy focused on the methodological issues a voluntary ‘census’ would have on the data and how this would impact the community, as well as acting as a direct challenge by the conservative government to the agency’s independence.  Ultimately, this led to the resignation of Mr. Munir Sheikh, then Chief Statistician of Canada, as an act of protest regarding the autonomy of Statistics Canada.

The actions of those who wish to harm our profession are not confined to Canada. In the United States of America, the 2010 Census came under fire from political conservatives over its relevance, including calls from influential political icons to call for boycotts of the census. These critiques ultimately ended in the removal of the “long form”, which may lead to immeasurable problems around the reduced data quality of the census. Likewise, in the past decade, the Office of National Statistics in the United Kingdom has faced similar issues over its autonomy and independence from government, leading community leaders to question its impartiality as a statistics provider.

… the census is no less vital today than it was in 1911 … and it will still be relevant in 2051

The cancelling of the 2011 New Zealand Census will be the first cancelled census by New Zealand since World War Two, and joins a very short list of cancelled censuses by a Commonwealth nation in the same time. While the act of cancelling the 2011 census is an act of compassion, we run the risk of those who wish to call for an end to the census to use this tragedy as an example of why census is no longer a required instrument of a just and democratic society. I assure you that the census is no less vital today than it was in 1911 when the Australian public was first counted, and it will still be relevant in 2051 when New Zealand marks its 200 years of census.

The statistics bureaus of Australia and New Zealand hold a special position amongst Commonwealth and anglophone countries, as two of the most autonomous official statistical agencies, but this comes at a price; we should look upon ourselves as role models in the statistical community and go as far as it takes to ensure our independence and relevance, but above all the unquestionable quality of our data. While their are deep emotional reasons for cancelling of the census, based purely for the respect of those most hurt by the tragic events of the Christchurch earthquakes, sometimes we must look at the rational reasons to push ahead with painful policies.

Right now Christchurch is in mourning and recovery, but soon it will be time to rebuild your great city. The reconstruction works will employ many of those put out of work by the disasters due to loss of lives, homes or business, but will not employ them all. By running the census you are able to employ additional thousands to prepare for its great counting, thousands who may not have otherwise been employed. These will be temporary positions that will allow people to immediately begin rebuilding their lives and city. Rather than just receiving government benefits to help the rebuilding, you will be employing people in stable short-term work to allow them to begin the road to normal lifes again, work that will benefit them and their country.

News reports are already indicating that Statistics New Zealand will honour its contacts and agreements with those employed throughout the census. If the decision has been made to spend this money, then spend it helping your fellow countrymen. Not only will you be employing those who may have been jobless, but you are also effectively helping to push money into a damaged economy. Each census worker will need shelter, food, clothes—all things that can be bought within their town, pushing more funds into the area, and speeding the recovery.

In peace, as in war, ANZACs should stand shoulder to shoulder, ready to help each other

With the Australian people also due to run a census in 2011, the Australian Bureau of Statistics is already in preparations for what is said to be “Australia’s largest peacetime operation”. In peace, as in war, ANZACs should stand shoulder to shoulder, ready to help each other regardless of our mission or the circumstances. Australia as a nation have already given much to help our Tasman neighbours, and we should stand ready to give more. With infrastructure in place for several weeks prior to the additional census, and with so many standards designed in collaboration between our nations, Australia is more than ready to help you perform your civil duties.

Lastly, as with the census in Australia being so close to the flooding in Brisbane, running a census in the shadow of a grave disaster such as the one that befell you, will help us measure the damage that has occurred and will improve future estimates of damage. Performing this census will allow communities to understand better the damage that can befall them, and to better prepare them for the worse.

In a census, all are equal, the rich or the poor, the young or the old, every minority enumerated for all to see.

Honourable readers, Prime Ministers, Commonwealth Statisticians and fellow citizens, the census is not just an exercise in counting, it is the snapshot of a nation and the prime way that we as a nation can better understand ourselves and our place in the world and can’t be replaced by a mere sample. It is one of the most noble acts of a society as we make every citizen count. In a census, all are equal, the rich or the poor, the young or the old, every minority enumerated for all to see. It is the candle in the darkness that shines a light on society and exposes our weaknesses and triumphs. But there are those who wish to extinguish this flame of impartiality for their own ends, they are those who wish to hide the inequalities of life, shun the impoverished and disenfranchised and maintain the bigotry that hurts us all.

When we are besieged by such evil forces, we cannot falter, and by cancelling the New Zealand census, you give these malcontents the fuel they need to further jeopardise the ability for all statistical agencies to fulfill their vital role. If you continue with this action you harm us all, and I plead of you, continue the census. Give the census a firm date and give the people of your nation a glimmer of light and show them that not even shattered earth will stop a statistician’s resolve. Those who survived may count themselves lucky, without your enumeration their efforts will be for nought.

Can we find all the Roman toilets in ancient Jordan?

So Sam, what exactly do you do at uni?

Well, I’m glad you asked hypothetical reader, I solve practical problems. For example, for the last 2 months I along with 5 of my peers have been working with the Archaeology department at the University of Western Australia to design a tool to help field researchers track and visualise dig sites using Google Earth.

This work appears to have caught the eye of the publishing group at UWA and it was mentioned in a recent edition of the UWA News.

The tool will allow researchers to track and upload site information (including such information as “was this site a toilet”) via a webpage and have this be instantly viewable back here by researchers at UWA. So yes, in the near future UWA researchers will be able to easily find out where ancient Romans pooped when they were holidaying in Jordan!