About the Project


Articles about the project:

Automating the Capture of Data Transformation Metadata from Statistical Analysis Software

Provenance Metadata for Statistical Data: An Introduction to Structured Data Transformation Language (SDTL)


Poster on Continuous Capture of Metadata project

Download poster as PDF


As the research community responds to increasing demands for public access to scientific data, the need for improvement in data documentation has become critical. Accurate and complete metadata is essential for data sharing and for interoperability across different data types. However, the process of describing and documenting scientific data has remained a tedious, manual process even when data collection is fully automated. Researchers are often reluctant to share data even with close colleagues, because creating documentation takes so much time. This project will greatly reduce the cost and increase the completeness of metadata by creating tools to capture data transformations from general purpose statistical analysis packages. Researchers in many fields use the main statistics packages (SPSS®, SAS®, Stata®, R) for data management as well as analysis, but these packages lack tools for documenting variable transformations in the manner of a workflow system or even a database. At best the operations performed by the statistical package are described in a script, which more often than not is unavailable to future data users. This project will develop new tools that will work with common statistical packages to automate the capture of metadata at the granularity of individual data transformations. In addition to providing much more detailed documentation than currently available, these tools will create efficiencies and reduce the costs of data collection, preparation, and re-use. Sofware-independent data transformation descriptions will be used to update metadata in two internationally accepted standards, the Data Documentation Initiative (DDI) and Ecological Metadata Language (EML). Our project targets two research communities (social and behavioral sciences and earth observation sciences) with strong metadata standards that rely heavily on statistical analysis software, but it is generalizable to other domains, such as biomedical research.