rwebdb 13-July-2010

The star schema is a traditional design for a relational data warehouse. The key features are that it has a central fact table with several connected dimension tables. Shown below is a star schema design from Wikipedia.

Courtesy of Wikimedia Commons.


For example, in our case, the central fact structure might contain data on each institution’s revenues, expenses, and prices for each year in the time series. Dimensions include those classificatory variables commonly used for analysis. So, for example, institution control (ie, public; private) might be a useful dimension variable. The dimension and fact structures are then tied together with special keys that are indexed to improve query performance.

A star schema provides a rough starting point for our design. Organizationally it helps to think of two kinds of variables, fact and dimension, even though this is not a relational warehouse of tables. What we’re building is a small warehouse (well, a data mart actually) where the structures are XML rather than SQL tables.

Today I wrote two XQuery programs that help to identify candidates for dimension variables. Both programs use the metadata XML that we’ve assembled recently. One program rank orders variables by the number of years that they appear in the 25 years of IPEDS data. You can see a portion of the output in example_list_vars_by_numYears.txt.

Variables that appear in many survey years are good candidates for inclusion in the warehouse, with one caveat. The ultimate use of the warehouse is analysis rather than reporting. Some common variables like institution address are unlikely to ever be used in analysis, while the control variable or the level variable or a location variable (eg, region or state) might be very useful for sub-setting data and comparing sub-sets.

The second XQuery program provides another effort to narrow the selection of possible variables. Based on the first program we can come up with a handful of initial possibilities. The second program provides details about variable codes and labels used throughout the time series. The idea here is that the degree of variation over time represents an approximate “degree-of-difficulty” when trying to reconcile changes in meaning of a variable over time.

Based on the output of the two XQuery programs, I selected three variables as initial possibilities: control; level (eg, 4 years or more); and sector (kind of a combination of control and level). At this point it’s just a start. We’ll iterate over these decisions numerous times.

The next step is to examine the three selected dimension candidates, think about how they might be used in analysis, determine how robust they’ll be for time series analysis, and then decide if transformations are possible that would improve the time series.

Beyond that we have the task of locating fact variables and other dimension variables buried away in the many IPEDS data files that we’ve not yet touched.

When doing this, we’ll need to consider the strong possibility that NCES did not uniformly classify variables into data files over time. For example, so far we’ve only consider directory files. Some variables may exist in directory files for some survey years and in non-directory files in other years. So as we get deeper into the remaining data files, we’ll need to iterate over some of the tentative design decisions made previously.

This may seem like a mess. But basically it’s a very natural process of designing and re-designing until you get it right. Even then it won’t be right. Other iterations over design decisions will occur during development, during analysis testing, and during alpha and beta use by users.
 


 
New
1. list_vars_by_numYears.xq
Lists IPEDS directory variables by the number of years that the variables have appeared in IPEDS surveys. Includes name, number of years, first year, last year, and description of the variable in the last year. Used to help identify variables for possible inclusion in the data warehouse.

2. example_list_vars_by_numYears.txt
A snippet of the output from running list_vars_by_numYears.xq, showing those variables with number of years ranging between 12 and 25.

3. list_metadata_1var.xq
Lists all metadata for a single variable specified at run-time. Used to help identify variables for possible inclusion in the data warehouse.

4. list_metadata_1var.txt
An example of the output from running list_metadata_1var.xq for the variable named control.