rwebdb 25-May-2010

As I mentioned in the last rwebdb post, the next step in this project is to work out an initial design for the warehouse data. I started by playing with the 20,000 distinct institutions that appear in IPEDS data files from 1980 to 2008.

One simple design is to flatten everything so that there are 20,000 institutions with associated variables. As XML this design might work if we only include a few variables:

<inst unitid=”######” v1_1980=”x” … v1_2008=”y” v2_1980=”a” … v2_2008=”b” … />

Other alternatives add a bit more structure. Here’s one with a year element:

<inst unitid=”######”
<year yr=”1980″ v1=”x” v2=”a” … />

<year yr=”2008″ v1=”y” v2=”b” … />
</inst>

Or a similar structure with a variable element:

<inst unitid=”######”
<variable var=”v1″ 1980=”x” … 2008=”y” />
<variable var=”v2″ 1980=”a” … 2008=”b” />

</inst>

Or a fourth option with structure for both years and variables:

<inst unitid=”######”
<year yr=”1980″
<variable var=”v1″ value=”x” />
<variable var=”v2″ value=”a” />

</year>

<year yr=”2008″
<variable var=”v1″ value=”y” />
<variable var=”v2″ value=”b” />

</year>
</inst>

Design decisions affect ease-of-use. At this point in the project we have no basis for identifying which of the 4 designs above (or others) best suits our needs. All we can do is create XML data with various designs and then query them. That provides experience with the following questions:

  • which design is easiest to build?
  • which design is easiest to query?
  • which design produces the fastest query times (ie, cpu usage)
  • which design produces the most compact files (ie, disk space)?

As I suggested previously, one interesting query examines the frequency with which the 20,000 institutions appear in the 25 years of data. How many colleges and universities only appear in a few years … or in all years?

Before building the various designs and answering that question, I wanted an independent answer that I trusted. This can help determine the accuracy of the test queries. And it builds experience with the data.

I wrote an XQuery program that took the 25 files of distinct institutions and combined them into a single file. For example, an institution that appeared in 5 of the IPEDS data files would appear 5 times in this combined file.

I then used R to plot the distribution showing the number of institutions by the number of years they are represented in the IPEDS data. Results are below. Note: This is an SVG file, so it may not be visible in older browsers. But it is the first of many SVG to appear in this project. SVG represents a huge opportunity for data visualizations. In case WordPress balks at permanently embedding the graph, here is a link to the SVG.

The graph shows an interesting distribution. About 2,000 institutions only appear once in the 25 years, while approximately 3,000 institutions appear in all 25 years.
 


New
1. freq_inst_by_numYr.xq
Frequency distribution of number of institutions by the number of years that the institutions appear in the 25 IPEDS directory data files 1980 – 2008.