Canonical variable translation | Continuous Metadata Capture Project

Excerpts from email conversation between Lars Vilhuber and George Alter:

Lars:

Variable provenance/transformation: I saw with interest at NADDI Jared and Jeremy's presentation on progress six months in. I had a couple of points that at least Jeremy (main presenter) hadn't quite realized or had good answers to. You may, or it may not have come up yet:

1) you would want, I think, assess as part of the transformation process (at least during development) whether your transformation is being correctly captured. Your co-authors mentioned "human verification", but that seems incomplete to me. In multilingual questionnaire design, you back-translate. That would seem a reasonably good technique here as well - since for each expression in VTL (canonical transformation), there should exist a canonical expression in each language (i.e., if you can translate Stata "g", "gen" and "generate" into some VTL expression, going back might yield "generate"). Translating back and re-running the transformed Stata program should then yield the exact same result. THAT is machine-verifiable, and would allow you to test accuracy against a wide range of programs, far more so than "human verification" could ever achieve (not saying that the human element is redundant, but it's not scalable here).

2) Two corollaries then arise: first, in order to reconstruct the program for back translation verification, you need to capture in some fashion the "untranslated" parts of the programs, and store it. These blobs of non-understood Stata/SPSS/etc. code do SOME transformation, and you may at minimum be able to associate them with named variables (unless of course, the code is run from within a Stata program/SAS macro/etc... that's a different kettle of worms)

3) Corollary two: if you do, in fact succeed with back translation, you just created the rosetta stone of programming languages. Even if you do not succeed to translate all parts of the language, this would be a VERY interesting product in and of itself. Here's the use-case scenario for the "limited" rosetta stone: remote tabulation/processing programs (NCHS, StatCan, German IAB = Josua) typically have a quite limited set of commands that they allow (whitelist), typically in only one programming language. Plug a "rosetta stone" in front, and you open these up to a greater variety of researchers, who can now submit Stata programs to a SAS submission system; or conversely, the data provider themselves can translate their whitelist into other languages.

George:

1) We have had a lot of discussion on this point: Should the data descriptions that we create be machine operational? There are a couple of issues.

First, this would require that VTL have certain characteristics as a language. (The specifics are beyond my expertise, but I rely on Carl and Jagadish for this.) Since we hope to use VTL for this purpose, we must rely on the VTL working group for defining the language. I think that they would like to make VTL operational, but I'm not sure. Their agenda for VTL is different from ours, because they are engaged in data creation. We have a conference call with them in early May, and we should know more afterwards.

Second, even if the VTL is operational, back translation would not necessarily produce the same code in Stata. For example, a 'recode' command could be represented as a set of 'replace ... if ...' statements. However, the Stata code does not need to be the same if the transformed data are the same.

Even with these considerations, back-translation from VTL to Stata code would probably not be difficult, but I have been steering the project away from this approach for purely practical reasons. I want to keep the project focused on our main task -- automating the process of adding translating data transformations into DDI. At this point, my goal is metadata that are understandable to a human not necessarily to a machine. (Note that I expect the VTL to be parsed again to produce descriptions that are more human readable.) Since this goal does not require VTL that is machine operational, I don't want to take time from the current project for that task.

On the other hand, I do have a design requirement that results in a test similar to the back-translation approach. If two datasets have the same VTL-enhanced DDI, then the data should be the same. We are working on test-scripts for different statistics packages that have the same effect on the data. We are also creating software that will compare two DDI files to see if they are the same. We should be able to start with the same data and metadata, and run scripts in SPSS and Stata that produce identical revised data and enhanced DDI. This process will also reveal differences between the packages that are not obvious, such as their handling of missing values.

2) This is another reason not to work on back translation now. As we develop our software, we do plan to encapsulate untranslated parts of the scripts in the DDI in exactly the way that you describe. We are not going to be able to do all of the possible data transformation possibilities in this project. (For example, I don't expect to do anything with variables created by procedures, such as predicted values and residuals from regression models.) On the other hand, it is always possible to create validation scripts that only use the commands that we can handle.

3) This is an excellent idea. Colectica and MTNA are both for profit software companies. If they (or anyone else) wants to do this, I will be very supportive. All of our software will be open source.

Lars:

I think the key response is: the back translation need not be the same (yes, recode might end up as if/then statements) as long as it yields the same data result. That is the idea behind back translation too: that the concept behind the question gets correctly captured (which may mean slightly different translations, or even different words). I think you would still need it. I don't think your requirement of "DDI+VTL the same -> underlying data is the same" will get you quite there, because you first need to generate the two different data sets in two different ways (Stata, SPSS). You'll get more power with the back translation, I would think. That being said, they are almost the same expression, just different implementation .