rwebdb 07-July-2010
Yesterday I completed work on a Perl program that creates an XML file of all variable codes and their descriptions for all 25 years of IPEDS directory data. This completes the initial set of metadata work for the data warehouse project.
NCES releases IPEDS metadata in the form of files appropriate for statistical use. There are versions for SAS, SPSS, and Stata. Each survey year has its own set of these statistical programs. So for the 25 years in our time series, this means 25 distinct statistical programs.
Combining IPEDS metadata into a comprehensive XML file requires scraping data from each of the 25 statistical programs.
I chose to use the SPSS files. My last post demonstrated how to combine these 25 files into a single XML file with metadata on variable names, data type, data width, and variable descriptions.
The work I just completed uses the same 25 SPSS statistical programs to create a single XML file with metadata on variable codes and the labels that describe the meaning of these codes.
We now have metadata in comprehensive XML files that will help during the warehouse design and deployment phases.
For example, take a look at a small snippet of the XML metadata on codes and labels. It shows the variable called control (ie, public or private) for all 25 years in the time series. If the control variable is one that seems important to include in the warehouse (it is), then this XML can help make sense of the changes in meanings that occurred in this variable over the time series.
Our next task is to begin designing the warehouse. This involves identifying a population of colleges and universities to include. It also involves identifying what data to include about the population.
I should make clear that we’ve only just scratched the surface of the IPEDS data files by confining attention to the 25 directory files (ie, files that identify colleges and universities). This represents only about 4% of the roughly 600 IPEDS data files available. Somewhere in those files is the financial data on expenses and revenues that we’ll need to address the question of why college costs so much.
One nice thing is that the Perl and XQuery already written to deal with directory data files is easily adaptable for dealing with other data files. So, for example, when creating metadata on financial variables we already have the tools in place to do that easily.
Even so there is still much to do. Very fun. And I do hope I’ve demonstrated that you don’t just mashup data files like IPEDS. It requires care, attention to detail, help from others, and time. There are no shortcuts.
New
1. gen_ipedsVars_metadata_codes.pl
Creates an xml file called ipedsVars_metadata_codes.xml that contains metadata on code values and labels used by IPEDS for all directory variables and all survey years.
2. example_ipedsVars_metadata_codes.txt
An example of the xml output from running gen_ipedsVars_metadata_codes.pl.
