(: OBSOLETED! date: 23-September-2010 author: Gary Lewis notes: 1. Replaced by scrape_ipedsFiles.xq. :) (: file: scrape_ipedsFiles_metadata.xq date: 13-May-2010 author: Gary Lewis purpose: Creates an XML file of metadata on IPEDS data files. Attributes include the CSV filename, the IPEDS filename, survey year, survey component, and title of the component. usage: 1. Browse to http://nces.ed.gov/ipeds/datacenter/DataFiles.aspx 2. Choose "all years" and "all surveys" from drop-down menus. Then "continue". 3. Use browser "Save page as" to save the page as ipedsFiles_metadata.html. 4. Run the tidy program to convert the html to xml. Use the following command: tidy --error-file ipedsFiles_metadata.err --output-file ipedsFiles_metadata.xml --output-xhtml yes --add-xml-decl yes --quote-nbsp no --char-encoding utf8 ipedsFiles_metadata.html 5. Check ipedsFiles_metadata.err for errors. Warnings are fine, but errors should be fixed. 6. Edit ipedsFiles_metadata.xml and remove the xmlns namespace attribute and value from the tag. 7. Run this XQuery program from location where the program is located. It will create ipedsFiles.xml. example: zorba -o ../scraped/ipedsFiles.xml -f -q scrape_ipedsFiles_metadata.xq -z indent=yes revision history: 19-May-2010 Gary Lewis Revised to: 1. Added a directory indicator (dir_ind) for each file. 1=yes; 0=no. The XML output will need manual editing of dir_ind for those few years where "directory" does not appear in the title. 2. Run the program from the location of the XQuery program rather than the xml data. :) { let $tmp1 := for $i in fn:doc("../scraped/ipedsFiles_metadata.xml")/html/body//table/tbody/tr[@class = "idc_gridviewrow"] let $year := $i/td[1], $survey := $i/replace(td[2], "[\x00-\x1F]", " "), (: strip non-printing characters from attribute values :) $title := $i/replace(td[3], "[\x00-\x1F]", " "), $filename := $i/td[4], (: mixed case names used by IPEDS :) $csvname := $i/lower-case(td[4]), (: CSV filenames are all lowercase :) $dir := if (matches(lower-case($title), "directory")) then "1" else "0" return return $tmp1 }