rwebdb 19-May-2010
Unlike database systems intended for transaction processing (eg, recording the specifics of a sale), a data warehouse is used for reporting, analysis, research, and planning. The designer deliberately removes the complexity of the transactional data structures so that people, technical and non-technical alike, can ask questions of the data and get answers. Goodbye table joins; hello flattened easiness.
I like to begin design by considering the Who question (see this PDF chapter from SQL Primer). Who exactly is the population that will be included in the warehouse? The answer to this will take quite some time to get right. But in the present situation we do know that IPEDS consists of colleges and universities. And we also know that each institution is uniquely identified by an ID number called a unitid. So that’s a place to start.
How many distinct institutions appear in the 25 years of IPEDS? Are they all pretty much the same institutions year-to-year or do new ones appear while others disappear (eg, cease operation or merge)? How many institutions appear in all 25 years of data? How many appear in only a few years? Of the entire set of distinct institutions, how many do we really want to keep when considering costs and pricing of higher education in the United States? For example, is it important to include all the schools of cosmetology?
These are all good questions that provide an opportunity to play with the data, become familiar with it, and allow the answers to unfold naturally.
For an initial effort, I set out to determine the maximum possible population size for the time series by identifying the number of distinct institutions across all years of data. The word “distinct” will be determined by unitid as a first approximation. We can deal at a later time with situations where the unitid for an institution changes for some reason (hopefully there aren’t many).
I started by writing an XQuery program to return an XML file listing the distinct unitids in a single year. I could have run this program 25 times (ie, once for each year of data). Instead I used the metadata discussed in the last post to write a second XQuery that produced a shell script (ie, command file) that will run the 25 years for me in one operation. This is a nice example of using metadata to help reduce the complexity of managing 595 data files spread across 25 years.
I then wrote a third XQuery that took the 25 individual files with distinct unitids and combined them to determine the number of distinct unitids across all years. The answer turns out to be 19,887. There are basically about 20,000 colleges and universities represented in the 25 years of IPEDS data. That puts an upper bound on the warehouse population.
Even though the metadata used in this stage was helpful for automating tiresome tasks, it is not yet robust. For example, I don’t yet have an unambiguous way to determine which of the 595 IPEDS files contain directory data. I relied on the IPEDS survey title because nearly all include the word “Directory.” But 4 years did not. While the current metadata is sufficient for a demonstration of its benefits, it’s only a start. More metadata scaffolding lies ahead.
Links:
1. First XQuery produces an XML file of distinct unitids (institutions) for a specific year: distinct_unitids_by_year.xq
2. Second XQuery uses metadata to produce a shell script of the commands needed to run the “first XQuery” for each of the 25 years: gen_xq_distinct_unitids.xq. Running this shell script from the operating system produces 25 XML files with distinct unitids for each year.
3. Third XQuery combines the 25 XML files from step 2 and determines the total number of distinct unitids across all the years of data: distinct_unitids_all_years.xq
