Posts Tagged ‘ Virgil

Always double check the standard before writing code

A few weeks ago, I had the privilege of presenting at a collection of DDI Developers in Gothenburg at EDDI. There I presented one of my larger pieces of work, the Virgil-UI DDI Codelist Editor, for critique. While there I received advice, praise and most importantly constructive criticism for which I am grateful. However, this has brought to light a rather large problem.

It was pointed out that I made a small error when dealing with <Code> elements in DDI and accidentally gave them @id attributes, and it was noted that this should be an easy fix. Unfortunately, due to my missing this very early on in the development of Virgil the underlying model relies on Codes having ids to be able to easily make connections between the hierarchical user interface, the <Code>s and the <Category>s that give them meaning.

What this means is that both the DDI coming out of Virgil is invalid, and any valid DDI would not actually be able to be read by Virgil. Essentially, the Virgil model for handling DDI is broken and needs to be almost entirely rewritten and this might take quite a while.

Unfortunately, at this stage rewriting also means re-examining a lot of the initial ideas about what Virgil should be and has highlighted some interesting questions about the DDI model and DDI software, such as:

  1. Is abstracting the DDI model away from a user a good approach to software design? Yes.
    This was the crux of my talk at EDDI, and I still feel that abstracting the DDI model away from day-to-day users is necessary. The DDI model is complex and covers a wide range of tasks. I believe that designing software that helps users relate the model to specific tasks they are trying to do is a key to getting people to use DDI and think about how they can make their metadata support themselves and those around them.
  2. Is DDI a standard that is suitable to use for day to day management of information? Probably.
    In practice, the DDI standard needs to be able to be passed between software if it is to move from an archival standard to a practical statistical metadata standard. One of the things I wanted to achieve with Virgil, was a tool that not only produced DDI, but could also consume it from other sources. In the simplest case this to me meant being able to take a DDI file, and edit the contents of part of it, leaving the rest untouched, and in a lot of cases this is possible with DDI. However, since having to rethink how to manage classifications using DDI, I have realised that there are some objects that are not captured well within DDI and unfortunately classifications are one such example.
  3. Is the DDI model for managing codelists and classifications good enough? Sadly not.
    One of the reasons I relied so heavily on the invalid <Code> @ids was that I needed a hook to tie codes and categories together and without this it becomes very difficult to manage what a ‘classification’ is in DDI. Furthermore, classifications don’t exist in DDI per se, but are a rather loose agreement that if you combine <CodeScheme>s and <CategoryScheme>s you get a good approximation. However, this falls apart when we try to document the classification itself.
    For example, where do you store the name of a whole classification? There are three viable places (excuse the XPath) – as a //CodeScheme/Label (being the label of the hierarchy), as a //CategoryScheme/Label (being the label of the collection of classifying categories) or as a //LogicalProduct/Label (the label of the immediate parent that contains both the hierarchies and the categories).
    However, each of these approaches has inherent issues, as neither of these are the documented way to manage this information, and if 3 different agencies approached the problem in different ways, then their metadata becomes incomparable. This needs to be discussed further, as it will become a bigger issue as more tools start to try and manage such an important, and conceptually early in the lifecycle piece of metadata.

It should be noted that these issues don’t excuse overlooking the actual standard leading to this predicament. However, given the chance to re-examine how to correct the problem in Virgil, also gives me a chance to examine some of the issues I came across while trying to maintain classifications within DDI. Over the coming month or so while I am going to continue writing up some of the issues I identified with classifications within DDI3.1, how to work around these in the short term, and look at ways to correct the problem in future versions of the standard.

Lastly, in the short-term there will be an update to correct the Code/id problem in the CSV to DDI conversion, so the original use case of being able to mine legacy systems to produce valid DDI will still be filled.

Thanks again to everyone at EDDI for their input and company.

Virgil UI 0.0.1 Beta now live!!

After months of development, testing, coding and crying…. Virgil UI version 0.0.1b is now available for public beta testing.

This release sees the first public testing of a full-functional, classification and codelist specific editor based on and supporting the DDI Lifecycle XML format (DLML).

Features in this release of Virgil include:

Known issues in the 0.0.1b release that  will be fixed in a future release:

  • Codes or languages cannot be removed once added.
  • New CodeSchemes cannot be added manually, only when importing from CSV.
Also new is an updated version of the standalone CSV to DDI converter tool that fixes some outstanding bugs in multilingual imports and corrects a few mistakes when writing the DLML.
For more information on Virgil-UI there is a list of blog post outlining the development process, or you can checkout the Google Code page, view all the downloads, or submit bugs.

Virgil UI – Beta demo video

Just a quick update that was supposed to have gone up last night. There is a video up on youtube now, showing of some of the more finalised features of Virgil-UI.

This shows three big features – CSV import, drag-and-drop reordering of classifications and multilingual support for editing. This means a classification with a multilingual component, for example a Canadian Industry Classification could have the English and French components edited simultaneously.

As stated in the last post, there should be a Windows binary release of a beta version of Virgil-UI and an updated version of the convertor tool should be released early September.

Microupdate – Virgil-UI now has improved multilingual support

After a weekend spent literally fighting to get multilingual support working for me in Python and Qt, Virgil-UI now has wide ranging support for multiple languages – both in the editor and importing from CSVs with unusual character sets.

With this, and the previously unmentioned drag and drop support for reordering classifications, Virgil is approaching a point where it is almost ready for beta testing. In the coming week, I’ll be making a few small tweaks, along with a demonstration video. Hopefully, by early September it should be packaged up as a beta, for widespead testing amongst the DDI community – just in time for the close of submissions for this years European DDI Users meeting.

Below are a few screenshots showing of the two main multilingual support features in Virgil-UI – the ability to add and edit the labels and descriptions of codes, and the ability to view the classification tree in any language that has been added.

Along with all of these changes a number of bugs in the CSV to DDI import tool have been corrected and I’ll be pushing out an updated Windows binary of that alongside the main release of Virgil-UI.

Showing of the basic language editing functionality

Users can even select which language they want the tree structure of the classification to be displayed in.

 

Updates to the Virgil CSV to DDI Converter

A short and sweet update:

There was an oversight with the CSV converter not converting coded values to the proper place in the created DDI XML. This has been fixed and the changes have been pushed into SVN and a new version (0.0.2b) of the executable has been released on Google Code.

420 convert classifications everyday

With the recent release of the new Australian Standard Classification of Drugs of Concern from the ABS, there was the opportunity to field test the Virgil CSV to DDI converter with real data to see how it held up. Fortunately, the classification was released as an Excel data cube that conformed almost entirely with the structures that Virgil supports. After a little cleaning of the CSV, it was able to run through the converter without few issues at all. Incidentally the most major error highlighted the massive oversight that the converter fails to add values for the codes! However this has been corrected and changes have been pushed in the svn, and a new version of the Windows tool will be pushed out this weekend.

A screen shot of Virgil with the converted classification

A screen shot of Virgil with the converted classification

Opening the newly created DDI file in the Virgil DDI CodeList Editor was another story and pointed out a few flaws with how it handles empty data. With the structure from the Excel file not containing descriptions for any category or any labels for the CodeScheme, there were a few small corrections made to accommodate freshly created DDI, but many of these problems will be ironed out by the time the CodeList editor is available for download.

While the converter hasn’t been fully integrated into the CodeList Editor, it will shortly be possible to create a single DDI file and import numerous CSV files to create a series of classificatory codelists in a single package. A practical and soon to be realised example would be the Australian Standard Classification of Drugs of Concern with the lists of drugs of concern, forms of drug and methods of consumption codelists all contained in a single machine processable DDI package.

For those who haven’t been able to download or run the converter, the output from this example is available for testing.

Virgil UI – CSV to DDI converter now available for Windows

The day is finally here – Virgil c2d is available for Windows. You can download the zip archive from Google Code. In future this will be the place that new versions of the tool will be made available, and I am hoping that as people start using it and bug do get noticed that there will be activity, so be sure to check back often to see if changes are available.

For the time being though, download a copy of the beta, checkout some of the example CSVs and  learn about how the different CSV types look.

If you have issues getting the application to run, check the converter_ui.exe.log log file for any errors and be sure to raise a bug through the issue tracker.  If there are issues getting a file to covert check the structure settings are correct, and check the line that the error dialog indicates may be causing the issue. If you are still unable to get the CSV to convert raise an issue and attach the offending CSV file and I’ll see if the problem can be resolved.

When checking out the example CSVs the filenames give some hints to the structure of the data in them:

  • ss: semi-structured
  • mono: monolingual
  • pd: pre-defined language
  • pe: prefix embedded language

For the other files they have type:

  • anzsic 2006 – codes and titles.csv — Semi-strucutred, Monoglingual
  • anzsic.csv – Semi-strucutred, Monoglingual

 

Virgil UI – CSV Converter UI Files now up

Over the week I’ve been coding away wrapping the CSV to DDI converter module with a nice user interface. Well, after a weekend of work it has a user interface, whether it is nice is in the eye of the beholder. As with the rest of the Virgil project the python code for this tool is available on Google Code. Unfortunately I haven’t had time to compile this into a Windows executable suitable for novice use, but interested parties are again welcome to download and test the tool from source.

For the curious, I’ve again recorded a demonstration and put it up on youtube, which is embeded below:

Again there is no audio, but I’ve included a brief transcription below so people can get a better idea of what the demonstration is trying to illustrate:

  • Open the anzsic.csv file to briefly view the contents of the CSV holding the labels and some descriptions of categories in the 2006 Australia and New Zealand Standard Industrial Classification.
  • Execute the conversion tool,  and load the ansic.csv file
  • Select the correct structure options for the CSV, as per the allowed structures described in a previous post.
  • Add a default language code and ID prefix for the DDI Instance and all codes and categories.
  • Demoing the preview table, showing how the header row can be ignored.
  • Convert the file, in the background you can see debug text for each code encountered.
  • Open a folder to save, and confirm the folder is empty.
  • Open the newly created file.
  • Add some line breaks to the automatically created XML  and search for a term from the original CSV.

Hopefully by this time next week there will be a fully downloadable Windows executable available for people to try.

Virgil UI – Site is live and converter code is now available

Virgil UI is now starting to get ready for release to the public, with the first step being that it now has a Google Code project site, which is starting to include.

The first code to go up in the public repository is the utility code for the CSV to DDI conversion described in the last post. This includes the specialised CSV parser, library to create DDI 3.1 Codes and Categories and a sample command line interface to pull it all together.

For those who can’t wait for either a GUI tool or an executable command-line app, or just want to play with the code, feel free to grab the source. Keep in mind that this is very much pre-beta and is under active development, but if (or more likely when?) you come across a bug, be sure to report it through the issue tracker on the site. To help folks along a few sample CSVs are included to give developers an idea of the required format for the converter to consumer, but that isn’t a conclusive list of all the possible combinations of code and category list types.

In lieu of actual usage documentation (which will be added during the week) below is a sample execution:

python2.7 ./converter_cli.py -i ./test_files/anzsic.ss-pd.csv -c SemiStructured -C PreDefinedColumn -o outfile.xml -d DDIInstance_ID --codeSchemeID=Test_codeSchemeID --categorySchemeID=Test_categorySchemeID
python2.7 ./converter_cli.py                - Needed to execute the script
-i ./test_files/anzsic.ss-pd.csv            - The CSV file to transform
-c SemiStructured                           - The CSV CodeList type (see previous blogpost for more info)
-C PreDefinedColumn                         - The CSV Category type (see previous blogpost for more info)
-o outfile.xml                              - File to save the DDI to, if blank output to console
-d DDIInstance_ID                           - The ID for the new parent DDIInstance of the resultant file
--codeSchemeID=Test_codeSchemeID            - ID for the CodeScheme that will hold the codes - is also part of the prefix for all DDI code IDs
--categorySchemeID=Test_categorySchemeID    - ID for the CategoryScheme that will hold the codes - is also part of the prefix for all DDI category IDs

note. This does need Python 2.7 to use some of the more advanced XPath options in ElementTree that the DDI module uses.

Virgil UI – Converting from legacy to CSV to DDI

While my main machine has been out-of-action, I’ve devoted a more time to one of the first use cases that prompted the development of Virgil – transforming legacy CSVs into DDI 3.1.

One of the main features of Virgil is the ability to help users transition from legacy systems, using non-standard formats to using DDI as the main data language for managing codes, categories and classifications. Unfortunately, there is no way for any one system to support every format for classifications, however by targeting a lowest-common denominator we can process the bulk of the work. In this case the lowest common denominator is CSVs.

If a user or developer of a legacy system is able to transform their legacy format into one of several different CSV formats supported by Virgil, then they will be able to import, at the least the basic structure and metadata of their codes and classifications into DDI. With most of the code for the conversion tools done, I’ve begun putting together the wizard interface for Virgil UI, which will also form part of a standalone conversion tool. Within the next few weeks the standalone conversion tool will be ready for release, and made available as open-source with the supporting code.

Below is a list of questions that users and developers may have around how to prepare CSVs for conversion to DDI listing the convertible metadata, preferred CSV structure and developer support. Although there are restricted possibilities for CSV structuring options for conversion, if there is a need for expanding the formats or metadata available for conversion, make your needs known and this can be incorporated in to future development.


What metadata will be supported?

A user will be able to import the code values and the hierarchy of a classification, as well as labels and descriptions of categories. Labels and descriptions can be multilingual, and multiple languages per item are able to be imported.

Will I have to use Virgil-UI to use this converter

No. This converter will be available as a wizard within Virgil, but the UI for the wizard will be available as a standalone program for users who need to convert from a legacy system to DDI. Lastly, as the code will be entirely open-sourced, the Python module that performs the transformations will be able to be imported into any other Python piece of software. Lastly, since the converter module is written entirely using modules from the Python standard libraries, it will be usable by programs using languages that are compatible or have compatible python compilers – such as Java using Jython[http://www.jython.org/] or .Net using IronPython[http://ironpython.net/].

In summary there will be at least four ways developers and users will be able to implement the Virgil CSV-DDI converter tools.

What ‘formats’ of CSV will be supported?

CSVs are generally without structure, and are just a basic way of storing tabular data, but by using a simple combination of the following code and category forms within a CSV. When picking a structure, it is important that the ‘code’ columns come before any ‘category’ columns. However, and combination of a code and category column format if created correctly should convert from CSV to DDI without trouble.

Column options for importing codes and their hierarchy

Referential CSV Codelist
Order: Code , Parent
Notes: This can be reversed to go Parent, Code. If a parent is blank it is assumed that this node is a top level code in a CodeScheme

Example:
A, ,
1,A,
2,A,
B, ,
3,B,
4,B,

Semi-structured CSV Codelist

Order: (Empty,)*Code,
Notes: If the code is the first entry in a row then it is considered a top code in the CodeScheme. Any children of a code should be indented by only one column. The columns for labels and descriptions start in different columns depending on level of the hierarchy.

Example:
A,
 ,1,
 ,2,
B,
 ,3,
 ,4,

Aligned Semi-structured CSV Codelist

Order: (Empty,)*Code,(Empty,)*
Notes: If the code is the first entry in a row then it is considered a top code in the CodeScheme. Any children of a code should be indented by only one column. All nodes should be padded so that the columns for labels and descriptions start in the same columns.

Example:
A, ,
 ,1,
 ,2,
B, ,
 ,3,
 ,4,

Column options for importing multilingual categories

Prefix-embedded Language

Order: (Label,Description)+
Notes: As many languages as needed can be be repeated within the column as long as they have unique language codes.

Example: en-au;Chocolate,en-au;Confectionery based on the seed of the cacao plant,fr;Chocolat,fr; Confiseries à base de la graine de la plante de cacao

Pre-defined Column

Order: (language,Label,Description)+
Notes: As many languages as needed can be be repeated within the column as long as they have unique language codes.

Example: en-au,Strawberries,Tasty fruit that isn't a true berry,fr,Frasie,Fruits savoureux qui n'est pas une baie vrai

Monolingual

Order: (Label,Description)
Notes: When only importing a single language that isn’t expressed in the CSV a default language will need to be given when invoking the converter.

Example: Vegemite,A yeast extract spread only edible by people from Australia. No other translations exist because no one else can stand it.

Can this tool support tab-separated files?

Yes. In the wizard users will be given the opportunity to select from a range of delimiter options or enter their own delimiting character. When using this module in other code, it will also support any delimiter as long it is specified when calling the module.

How should a developer write CSV for the converter?

With no agreed upon standard for CSVs its hard for developers to try and write ‘standard’ CSVs. To simplify development and be as lenient as possible the Virgil CSV-DDI converter using the Python CSV module[http://docs.python.org/library/csv.html]. If you are writing your own CSV writer I’d suggest testing it against this module to make sure it works.

In a nut shell though – leading and trailing whitespace is trimmed and any entry that contains a comma (or specified delimiter) should be quoted with double (“) or single (‘) quote marks.

What will the wizard and standalone converter look like?

Something like this:

PyQT Mockup of the CSV/DDI ConverterClick for bigger…