rwebdb – Warehouse Build Begins
Another small milestone. The warehouse data file now contains its first two variables.
One variable is the institution unitid that uniquely identifies each institution. That was just a straight load from the IPEDS data files, so not much exciting there. The second variable, called rControl, is more typical of the work ahead. It is a consolidation of the IPEDS control variable (ie, public or private control) across all 25 years of survey data. The codes used in the IPEDS control variable vary by survey year. The rControl variable makes the codes consistent across all surveys.
rControl represents the data cleanup process needed to construct a unified time series. Over the next couple months, I’ll add new warehouse variables. Some will be classificatory variables like rControl. Others will be financial variables.
Expect to see bumps in the road created by discontinuities in the surveys. Code changes are a simple change that occurs over time. Far more serious are changes in variable definition or meaning (eg, changes in legislation or regulation that alter the entire process by which values for a variable are determined).
The point being that a lot of data curation remains.
The good news, however, is that rControl required lots of experimentation with procedures for moving data from the IPEDS data files to the warehouse XML. The sequence of XQuery programs shown below illustrates these procedures. Basically it’s a variation on the classic Extraction, Transformation, Load (ETL) of any warehouse build.
I started by creating a nearly empty XML shell to hold the warehouse data, extracted IPEDS data for the 25 survey years into a staging XML, did the transformations needed to make rControl consistent over all survey years, populated the warehouse XML with the staging data, and then verified the rControl data in the warehouse with data from the original IPEDS data files.
Note that the warehouse verification done in these steps is preliminary. At a later time we’ll do a more rigorous test by verifying the warehouse data against summary results published by the National Center for Education Statistics (NCES). One such example is Table 265 in the Digest of Education Statistics 2009 already discussed earlier.
Lots yet to do, but having procedures in place for assembly of IPEDS variables into the warehouse is good start. Expect these procedures to change some as we meet novel data conundrums.
1. gen_whPop.xq
Uses the population of warehouse institutions (ie, the 5602 institutions in pop_distinct_institutions.xml) to populate an initial whPop.xml with each of the 25 years of IPEDS data. This is a basically a shell XML that only includes the institution unitid variable. As warehouse variables get constructed and verified, they will be rolled into whPop.xml.
2. rControl.xq
Creates one XML for the rControl warehouse variable that includes all survey years. This XML is a staging file for assembly into the warehouse. example_rControl.xml is an example of the output of this XQuery.
3. rControl_to_whPop.xq
Creates an updated whPop.xml that includes the rControl variable. whPop.xml is the warehouse data file. example_whPop.xml is an example of the output from this program. The XQuery is dynamic in the sense that it changes based on the survey year. This is accomplished by using an XML that contains the year specific logic. See example_recodes_xml.txt.
4. rControl_freq.xq
Frequency distribution by year for the variable rControl in the warehouse XML whPop.xml. example_rControl_freq.xml is an example of the output of this XQuery. This frequency distribution is one component used to verify that the rControl warehouse variable was created correctly. See item #5 for the second component.
5. verify_whPop_freq_1var_1year.xq
Frequency distribution for a specified IPEDS variable in a specified survey year, filtered to include only those institutions in the warehouse population. The output of this program can be used to compare with frequency distributions from whPop.xml (see item #4) to help ensure that variables in whPop.xml are being built correctly. example_verify_whPop_freq_1var_1year.xml is an example of the output from this XQuery. Note that the results for 2007 agree with those in item #4.
