rwebdb – 30-June-2010

Another piece of IPEDS metadata completed today. This one pulls into one XML file the roughly 600 variables that appear in IPEDS directory files. By variable and year, the xml provides metadata on variable name, data type (numeric or alpha), field width, description, and the number of years each variable appears in the 25 years of surveys.

Take a look at example_ipedsVars_metadata.xml to see a small portion of the metadata file. One nice feature is that you can write XQuery programs against the metadata to check for inconsistencies and changes in variable specifications over time. For example, the control variable in the example shows several changes over the 25 years that the variable appears. The most significant of these is probably an error in 1997 when the field width is listed as 101 characters. Not likely. Years from 1980 to 1986 also have variable descriptions that differ from those in later years. This may mean nothing. Or it may indicate a change in the wording of the survey question and the meanings that should be attached to this variable. The year 1986-87 (see timeline) was the transition from HEGIS to IPEDS so changes in meanings very well could have occurred then.

At present my intent is merely to gather variable metadata in one convenient xml file. When we actually get around to identifying variables to include in the data warehouse, we need to go through a “cleaning” or “transformation” stage. The metadata will be enormously helpful at that point as we play detective and try to make the warehouse data the best it can be for time series analysis.

One other metadata xml file awaits. This one will capture in one place the variable code values and their definitions over time. For example, in the control variable the code of ’2′ is used as ‘Private-only’ in 1980 but ‘Private not-for-profit’ in 2008. Again, changes like these can tell us a lot about the variables and their meaning, and this will help when deciding what data actually gets included in the warehouse.

So the next step is code metadata. And after that I’ll begin to play with warehouse design just to get that started.

 


 
New
1. gen_fileList.xq
Generates a list of a specified IPEDS file type (choices are SPS, SAS, DO) that contain variable metadata on data type, field width, and descriptions. This list gets processed by the Perl program gen_ipedsVars_fmt_lbl.pl.

2. gen_ipedsVars_fmt_lbl.pl
Creates an xml file with metadata on variable datatype, field width, and descriptions. Output is ipedsVars_fmt_lbl.xml that provides the input for gen_ipedsVars_metadata.xq.

3. gen_ipedsVars_metadata.xq
This program uses variable metadata in ipedsVars_fmt_lbl.xml to produce ipedsVars_metadata.xml. Metadata includes: variable name, year, datatype, field width, description, and number of years that the variable appeared in IPEDS surveys.

4. example_ipedsVars_metadata.xml
One small portion of ipedsVars_metadata.xml to provide an example of the metadata it contains.