rwebdb 12-May-2010
In the last few days, I downloaded the remaining IPEDS data files that contain directory data on colleges and universities in the United States between 1980 and the present. Thus begins the long but absolutely essential task of living in the data long enough to recognize both its warts and its beauty. You cannot, at least in my experience, build a data warehouse or conduct meaningful data analysis without passing through this stage. It literally can take months.
With this dip into the data, I found more corrupt data that needed scrubbing. Not anything substantial; just some control characters (ie, non-printing things like escape sequences). I revised the Perl program (3rd revision now) to clean the data when it converted the CSV to XML.
I then wrote two XQuery programs. One compares the population of institutions in each data file with the counts in the IPEDS data dictionaries (ie, meta-data). The task of verifying that the XML data was generated correctly will require lots more work, but getting the initial population count right is a step forward. No data files seem to exist for 1999. And the 1986 data file showed irregularities in the form of duplicates. So I wrote a second XQuery program to identify the duplicates. It turned out that one institution ID appeared more than 1200 times in the file.
To document these quirks I started a list of questions about IPEDS data that require answers. just part of the process of getting acquainted with the data. It’ll be a while yet before we’re friends.
Links to items mentioned:
1. Revised Perl program that converts CSV to XML: genXML.pl
2. List of questions about IPEDS data: ipedsq.html [now here 09-June-2010]
3. Xquery program to check count of institutions in IPEDS data files: check_unitid_counts.xq
4. XQuery program to identify duplicate institutions ID numbers: find_duplicate_unitids.xq
