rwebdb 20-May-2010
Dynamic query programs are one quite powerful technique used when constructing data warehouses. Dynamic programs are simply programs that write other programs. They use metadata (eg, data dictionaries) as input and produce a program (ie, text file) that runs against the actual data. For even moderately complex datasets, dynamic programs can save a huge amount of work and produce more error-resistant programs than ones done by hand.
In the last rwebdb post, I illustrated a simple dynamic XQuery program that produced a shell script. Since then I updated the metadata files with a flag to indicate if an IPEDS data file contained directory information (ie, unitid, name of institution, etc). This allowed me to then write a dynamic XQuery program that produces another XQuery program that determines the number of distinct institutions in the 25 years of IPEDS data (about 20,000).
Dynamic programs tend to be compact and obtuse, because you’re working at a level abstracted from the actual data you eventually want to query. So you write the program while thinking about what the final query should look like. This takes some practice, but gets easier with time. And the benefits can really be substantial.
The new dynamic XQuery is still only a simple example of the power possible with this technique. For example, the program is still pretty situationally specific. Meaning that it’s written with directory data files in mind. Even so, it can provide a nice template for handling other types of data files; for example, those with cost or revenue data that we’ll get to later.
As we get deeper into the project and improve the metadata available, I suspect that the generality of the dynamic programs will increase.
Next up is to play with different data structures to use in the warehouse design. The goal is simplicity of use. So I’ll use the 20,000 institutions, structure the data in several different ways, and write queries (eg, a frequency distribution showing a count of institutions by the number of years for which IPEDS exists for them). Writing queries is a great way to determine ease of use. Some data structures will work better than others. And this will provide guidance as the warehouse design unfolds.
New
1. gen_xq_distinct_unitids_all_years.xq
Creates an XQuery program called count_distinct_unitids_all_years.xq that when run provides a distinct count of institutions in the 25 years of IPEDS directory data files.
2. count_distinct_unitids_all_years.xq
Program created by gen_xq_distinct_unitids_all_years.xq.
3. list_directory_files.xq
Used to list the IPEDS data files tagged in ipedsFiles.xml as containing directory data.
Updated
1. scrape_ipedsFiles_metadata.xq
Creates an XML file of metadata on IPEDS data files.
2. gen_xq_distinct_unitids.xq
Creates a shell script to run distinct_unitids_by_year.xq for each year of IPEDS data.
3. distinct_unitids_by_year.xq
Creates an XML file that contains the distinct unitids in a single data file.
