rwebdb 08-June-2010

The last rwebdb post left unfinished the task of ensuring that our downloaded IPEDS data matched aggregate summaries in published NCES reports. In this case I was trying to duplicate the count of institutions that appeared in Table 265 of the Digest of Education Statistics 2009. Matching published results is a great way to ensure you’re using the same definitional logic used by data stewards. It gets warehouse designers, developers, and analysts familiar with the data, and it builds confidence in the dataset being constructed.

There were no problems duplicating counts in Table 265 over the time period 2000-01 to 2008-09. But prior years failed, consistent with significant changes that occurred in IPEDS in 2000 (review the timeline if needed).

Last week I contacted NCES and got a very helpful series of emails in return from Samuel Barbett, who is an IPEDS staff member. He provided code fragments that allowed me to duplicate counts in Table 265 back to 1987-88. The XQuery programs that I wrote are listed in the links below.

There are several important points to note:

  • Data stewards provide an invaluable resource when building a data warehouse.
  • Validating counts across the 20+ year time series from 1987-88 to 2008-09 required four separate programs, indicating again the fragility of definitions over time. New variables get added, old ones dropped. That kind of thing.
  • We now have about 10 data variables that appear in the XQuery that will be important to represent in the warehouse. This gives us a foundation for deciding what data to include.
  • It may seem a major pain in the rear-end to worry about all the data changes managed by the four XQuery. And it is a pain. But that is the nature of data, especially time series data. It is messy. It takes time to deal with the mess. Unless that occurs, however, the integrity of the data cannot be maintained.

As I worked the time series back to 1987 it became clear that I needed more metadata about the data variables. That’s on tap next. Dealing with variable names, descriptions, and code values across time. More minutiae, all of it essential.


New
1. verify_nces_des2009_table265_2000_and_later.xq
2. verify_nces_des2009_table265_1998.xq
3. verify_nces_des2009_table265_1996-1997.xq
4. verify_nces_des2009_table265_1987-1995.xq

Obsoleted
1. verify_nces_doe2009_table265_recent.xq