rwebdb 11-June-2010
There are two things you’ve heard me say ad nauseam in this rwebdb series. One is the need to maintain data integrity by taking the time to live in the data until you know it. The second is the need for metadata to successfully manage a warehouse project of even modest complexity.
This post concerns metadata. Listed below are five links to new XQuery programs I wrote. The first two merely illustrate IPEDS metadata used to make my life as the warehouse designer easier. Utility scripts like these are essential because they eliminate the amount of manual effort involved in a project. Over time the number of utility scripts will increase and probably at some point, particularly in a larger project, I’d end up creating metadata about these metadata utilities in an order to organize them for easier use.
The last three XQuery start building the metadata about IPEDS data variables. Here I mean things like:
- variable names (eg, unitid);
- variable descriptions (eg, “unique identification number for an institution”);
- variable type (eg, numeric or alpha);
- data format (eg, integer or decimal if numeric; string length if alpha);
- variable codes and descriptions (eg, in 2008 the type of institutional control is coded 1 for Public, 2 for Private not-for-profit, and 3 for Private for-profit).
Some of this metadata will end up appearing in the warehouse. But it can also help during design to better understand the meaning of the data and to identify critical junctures in time when the meaning changed. Code values and their meanings can, and frequently do, change over time. Likewise for variables and their meaning. The designer best realize this and find a sensible way to deal with it. That’s where we’re headed on this tack toward variable metadata.
The results of the three XQuery metadata programs are interesting. Recall that there are about 600 IPEDS data files available for download. To date I’ve only downloaded the 25 files containing institutional directory data over the period from 1980 to 2008. I wondered how many variables we’re dealing with in these 25 files. One XQuery (#4) provides an answer. The distinct (ie, unique) number of directory variables is 577. Let’s see … 577 variables in 25 files means ??? [lots] of variables in 600 files. Not bad. Fun.
I also wondered how many of the 577 directory variables appear in all or most of the 25 different years and how many appear only a few times. One XQuery program (#5) plays with that question. Here’s a link to the XML output from the program. Note that there are only 10 variables that appear in all 25 years. You can bet that these 10 variables are awfully important. In fact, given the magnitude of the changes in the surveys over time, I’d guess that some of these 10 variables were added ex post by IPEDS staff to facilitate easier time series analysis. Nice.
Some of you may be interested in the initial XML tree structure I used for the variable metadata. See XQuery (#3). It does not yet have variable codes and code descriptions, but there is room for these. I’m not certain how well this data structure will work in practice. Experience will provide an answer. And I’ll iterate the design if necessary. Ease of use is the critical factor determining the suitability of this metadata design.
New
1. list_ipedsName.xq
Lists the names that IPEDS gave to their data files.
2. count_unitid_by_year.xq
Illustrates the use of metadata files to get a count of unitids (ie, institutions) in each of the IPEDS directory data files.
3. gen_ipedsVars.xq
Generates an xml file called ipedsVars.xml that contains metadata on IPEDS variables.
4. check_ipedsVars_counts.xq
Get basic counts of IPEDS variables in directory data files.
5. check_ipedsVars_freq.xq
Returns a frequency distribution showing the number of variables by the number of years they are represented in the IPEDS directory data files.
