rwebdb 29-May-2010
Based on my last rwebdb post, our chances of actually constructing a time series of IPEDS data may appear bleak. That is not at all the case. For certain, the numerous historical changes in the HEGIS and IPEDS data will provide many “gotcha” moments. But these are typical for even modestly complex data projects and are not unexpected.
We must remain vigilant, however. To borrow a term from social science research and journalism, we must triangulate. Over and over again to ensure we’re not off on a sidetrack to nowhere. And once we get to a point where we actually feel good about some portion of the data, we should start to worry. At that point we’ll invite people expert in the use of this data to shoot holes in what we’ve done. It’s all part of the process needed to create a robust warehouse.
In this post, I’ll illustrate one of the ways in which we can help ourselves construct a meaningful time series.
The NCES publishes literally a treasure trove of reports based on IPEDS data. We can, and should, use these published reports as a means of verifying that our time series data produces the same results that NCES gets.
Here’s a specific example. Table 265 in the NCES Digest of Education Statistics, 2009 provides data on degree-granting institutions by control and type of institution. The table includes selected years from 1949-50 to 1985-86 and annual results from 1986-87 to the present. Coinciding with the transition from HEGIS to IPEDS data, this break at 1986-87 should come as no surprise given the critical dates in the last post.
I wrote an XQuery program to verify data in our XML data files against the NCES values shown in Table 265. This worked fine for all years from 2008-09 back to 2000-01. Everything matched. As mentioned previously, data for 1999-2000 is not available for download so I could not verify that year. Verification for 1998-99 failed, and I did not attempt any year prior to that.
What to make of this? Fall 2000 was the first time when IPEDS was administered as three separate surveys. Any number of things could have caused the XQuery program to fail for years prior to 2000. The meaning of the data variables might have changed (eg, things like the meaning of a response coded as “1″). Or the names of the data variables may have changed (eg, maybe the control variable was called cntl). Or one or more of the data variables used in the XQuery program may not appear in the directory files prior to 2000. Hopefully they appear in other files, however, at least back to 1986 when the transition between HEGIS and IPEDS occurred. We’ll see.
It will be easy enough to determine the cause. But the point is this. By probing the data this way, we learn more about it. And that knowledge will get reflected in the metadata that we’ll assemble around the data. It’s a learning process. And it’s an absolutely essential step that cannot be avoided or finessed. It’s a way of paying our dues.
New
1. verify_nces_doe2009_table265_recent.xq
Used to verify the counts of degree-granting institutions for recent years in Table 265 of the NCES Digest of Education Statistics, 2009. Verifying published table values is an important way to ensure that we’re using the same definitions that NCES used.
