rwebdb 29-July-2010

As I’ve said before, getting the population correct is the single most important step when constructing a database query. If you don’t define the population accurately, then the query results are garbage.

The same is true when developing a data warehouse. Get the population wrong and you’ve shot yourself in the foot.

So it’s best to take time when identifying a warehouse population. It’s also helpful to think of the population as tentative until after you get users with subject matter knowledge to use it over sustained periods and to trust it. Then maybe the iterative population definition and development begin to settle down.

With that in mind, I offer an initial population definition for the warehouse being built here. It’s based on Table 265 in the Digest of Education Statistics 2009. This annual statistical report is prepared by the National Center for Education Statistics, who is also responsible for the IPEDS surveys being used to build this warehouse.

Table 265 provides summary counts of degree-granting institutions by control and type of institution for selected years from 1949-50 to 2008-09. In earlier posts (here, here, and here), I used Table 265 to verify that the programs I wrote were reading the IPEDS data correctly.

Here I extend the use of Table 265 to define the warehouse population. Before getting to the specifics, however, it’s important to understand a footnote to Table 265.

Data through 1995-96 are for institutions of higher education, while later data are for degree-granting institutions. Degree-granting institutions grant associate’s or higher degrees and participate in Title IV federal financial aid programs. The degree-granting classification is very similar to the earlier higher education classification, but it includes more 2-year colleges and excludes a few higher education institutions that did not grant degrees. … Changes in counts of institutions over time are partly affected by increasing or decreasing numbers of institutions submitting separate data for branch campuses.

There’s a lot to like in this footnote. It literally drips with the reality of dealing with complex survey instruments over a protracted period of time. Surveys often grow and change to reflect the times. That has certainly been the case with IPEDS and its precursor HEGIS.

But the folks at NCES did a very nice job in Table 265 of using their knowledge of IPEDS to provide a time series that nearly allows comparison of apples to apples over an extended period of time, at least at the institution level. Once we get down into the detail of financial data, we’ll need to again be very conscious of definitional changes that would create analysis problems.

But, for now, where the problem is one of identifying institutions to include in the warehouse, NCES and Table 265 provide lots of help.

Here then is the initial warehouse population. It consists of 5602 distinct (ie, different) institutions that appear in Table 265 for years from 1987-88 to 2008-09. This includes all public, private not-for-profit, and private for-profit institutions that qualify under the definitions of Table 265 as institutions of higher education or degree-granting institutions.

If you care to get into the gory details of how I got the population of 5602 institutions, I’ve included a list of links below. There are 7 new XQuery programs, one text file used to verify that the tentative warehouse population agrees with Table 265, a small snippet of the XML that identifies the population, and an XML that defines the filtering criteria needed in each survey year to conform to Table 265 specifications.

I know this is a heavy dose of new stuff, but I’ll come back to it many times in future posts while trying to see how far this initial population definition can go before it needs revision (as it surely will).

 


 
New
1. pop_2000_and_later_by_year.xq
Identifies institutions to include in the warehouse for years 2000 and later.

2. pop_1998_by_year.xq
Identifies institutions to include in the warehouse for the year 1998.

3. pop_1996_and_1997_by_year.xq
Identifies institutions to include in the warehouse for survey years 1996 and 1997.

4. pop_1987_to_1995_by_year.xq
Identifies institutions to include in the warehouse for survey years from 1987 to 1995.

5. pop_combined_all_years.xq
Combines all XML that identify institutions to include in the data warehouse.

6. verify_pop_all_years.xq
Verify counts from pop_all_years.xml versus Table 265 of the NCES Digest of Education Statistics 2009.

7. verify_pop_all_years.txt
Results from running verify_pop_all_years.xq. Used to compare with Table 265 of the Digest of Education Statistics 2009.

8. pop_distinct_institutions.xq
Combines institutions from pop_all_years.xml into an XML where each institution is listed only once.

9. example_pop_distinct_institutions.txt
Shows a very small portion of the 5602 distinct institutions in the XML created by running pop_distinct_institutions.xq.

10. warehouse_population_filters.xml
Contains the filters used to define warehouse populations from 1987 to 2008. These definitions yield counts that verify against Table 265 in the Digest of Education Statistics 2009.