rwebdb 14-May-2010

With even a small data warehouse like the one planned here, you can quickly reach a level of complexity where coping tools are needed and must be built.

The IPEDS data covers 25 of the years between 1980 and 2008. When I started thinking about the design of an integrated warehouse covering some portion of this data, I thought I’d begin by seeing how many distinct institutions are represented in the various data files. [More about this in the next post.] But, first, I needed to gather information about the data files themselves. Things like the survey year, the data filename, the name of the specific survey instrument used, and a description of the survey. Data about the data (metadata) is essential for managing a data project of any complexity.

I gathered the metadata using the IPEDS website and then wrote an XQuery program to scrape the IPEDS HTML page and create an XML metadata file. One benefit of doing this is that I found the 25 years included 595 distinct data files. Very cool. It’s an unmanageable number if done manually. Hence the need for metadata and tools like Perl and XQuery that can do much of the work for you.

In production data warehouses, you can frequently use extensive metadata and query tools to write dynamic programs that are created at run-time. The present warehouse is an experiment and not a production system, so it’s not clear at this point how much of the data extraction, transformation, and load (ETL) will require automation. It’ll be interesting to see.

New:
1. XQuery program to scrape IPEDS metadata from HTML and to create XML: scrape_ipedsFiles_metadata.xq

Revised:
1. Perl program to generate XML from the IPEDS CSV data files: genXML.pl