Excerpts from email conversation between Lars Vilhuber and George Alter:
Lars:
Variable provenance/transformation: I saw with interest at NADDI Jared and Jeremy's presentation on progress six months in. I had a couple of points that at least Jeremy (main presenter) hadn't quite realized or had good answers to. You may, or it may not have come up yet:
1) you would want, I think, assess as part of the transformation process (at least during development) whether your transformation is being correctly captured. Your co-authors mentioned "human verification", but that seems incomplete to me. In multilingual questionnaire design, you back-translate. That would seem a reasonably good technique here as well - since for each expression in VTL (canonical transformation), there should exist a canonical expression in each language (i.e., if you can translate Stata "g", "gen" and "generate" into some VTL expression, going back might yield "generate"). Translating back and re-running the transformed Stata program should then yield the exact same result. THAT is machine-verifiable, and would allow you to test accuracy against a wide range of programs, far more so than "human verification" could ever achieve (not saying that the human element is redundant, but it's not scalable here).
2) Two corollaries then arise: first, in order to reconstruct the program for back translation verification, you need to capture in some fashion the "untranslated" parts of the programs, and store it. These blobs of non-understood Stata/SPSS/etc. code do SOME transformation, and you may at minimum be able to associate them with named variables (unless of course, the code is run from within a Stata program/SAS macro/etc... that's a different kettle of worms)
3) Corollary two: if you do, in fact succeed with back translation, you just created the rosetta stone of programming languages. Even if you do not succeed to translate all parts of the language, this would be a VERY interesting product in and of itself. Here's the use-case scenario for the "limited" rosetta stone: remote tabulation/processing programs (NCHS, StatCan, German IAB = Josua) typically have a quite limited set of commands that they allow (whitelist), typically in only one programming language. Plug a "rosetta stone" in front, and you open these up to a greater variety of researchers, who can now submit Stata programs to a SAS submission system; or conversely, the data provider themselves can translate their whitelist into other languages.
George:
Lars:
I think the key response is: the back translation need not be the same (yes, recode might end up as if/then statements) as long as it yields the same data result. That is the idea behind back translation too: that the concept behind the question gets correctly captured (which may mean slightly different translations, or even different words). I think you would still need it. I don't think your requirement of "DDI+VTL the same -> underlying data is the same" will get you quite there, because you first need to generate the two different data sets in two different ways (Stata, SPSS). You'll get more power with the back translation, I would think. That being said, they are almost the same expression, just different implementation .