rwebdb 27-May-2010
One of the fun, and sometimes frustrating, features of assembling a time series from disparate data sources is history itself. Things change over time. Funny how that happens. But it has big consequences for retrospective construction of datasets.
It’s easy to demonstrate this. Consider any data variable used in the most recent IPEDS survey. Let’s use the UNITID as an example. This is the identification number unique to each institution. How far back in time can we track this variable without changes in its meaning?
In the case of UNITID the answer is pretty hopeful. A bit of research shows that UNITID was not always used to identify institutions. The Higher Education General Information Survey (HEGIS), which was the precursor to IPEDS, used an identification code called FICE (the Federal Interagency Committee on Education).
But, luckily for us, at some point the NCES constructed a crosswalk table between FICE and UNITID. The older CSV data files, all bearing creation dates of 2004, include UNITID values. So, for example, the 1980 data files include UNITID even though at the time the 1980 survey was conducted the FICE codes identified institutions.Thank you NCES!
However UNITID is just one of thousands of variables in the roughly 600 data files spread across 25 years. Verifying the integrity of variables across time is an issue we’ll encounter repeatedly as the warehouse gets built.
Here’s a historical outline with important dates when things changed in the HEGIS and IPEDS data. It’s meant to flag important time periods when we should expect changes in the surveys and the data. Basically it’s a “heads up, be careful.”
- 1949-50 to 1965-66: NCES Education Directory, Colleges and Universities. Paper only?
- 1966-67 to 1985-86: HEGIS surveys. The universe here is about “3,400 accredited institutions of higher education.”
- 1986-87-present: IPEDs surveys. The universe here is “approximately 6,500 postsecondary institutions, including universities and colleges, as well as institutions offering technical and vocational education beyond the high school level.”
- 1993: Prior to this year, data from technical and vocational institutions was collected with a sample survey. After 1993 data was collected in a census of all such instituions.
- 1996: The classification of institutions offering college and university education changed.
Prior to 1996, institutions that had courses leading to an associate’s or higher degree or that had courses accepted for credit toward those degrees were considered higher education institutions. Higher education institutions were accredited by an agency or association that was recognized by the U.S. Department of Education or were recognized directly by the Secretary of Education. Tables, or portions of tables, that use only this standard are noted as “higher education” in the Digest. The newer standard includes institutions that award associate’s or higher degrees and that are eligible to participate in Title IV federal financial aid programs. Tables that contain any data according to this standard are titled “degree-granting” institutions. Time-series tables may contain data from both series, and they are noted accordingly. The impact of this change on data collected in 1996 was not large.
- 2000: Since 2000 data now gets collected in surveys at different times during an academic year. Fall, winter, and spring surveys cover different types of information.
Many changes over the years, so we should expect many “gotcha” moments when constructing a time series with the HEGIS and IPEDS data. As illustration, here’s a footnote to a table on degree-granting institutions that appears in the NCES Digest of Education Statistics, 2009.
Data through 1995-96 are for institutions of higher education, while later data are for degree-granting institutions. Degree-granting institutions grant associate’s or higher degrees and participate in Title IV federal financial aid programs. The degree-granting classification is very similar to the earlier higher education classification, but it includes more 2-year colleges and excludes a few higher education institutions that did not grant degrees.
This is the nature of real data. It’s messy. And unless great care is taken in constructing the time series, any subsequent analysis may produce gibberish.
Metadata and people (eg, NCES data experts) are absolutely essential for the success of this project, as they are for any other data project of even modest complexity.
Very fun stuff. And hopefully, too, a plug for data curation.
