Web Query Tools – Part 2
There’s huge interest in harnessing the web as a giant database, and the motivations for this are easily as diverse as the interest is large. I’m just looking for a precursor.
Two quick examples. Google researchers describe “a unified Web knowledge base” as the “holy grail of web information extraction” (PDF), and the company energetically applies their research to harvest additional structured data from natural language text, HTML-embedded tables, and the so-called deep-web of databases sheltered behind form-only access.
Tim Berners-Lee, whose legacy is the web, provides another example. In a video from TED, he calls for a “new reframing” that will release the “unlockedĀ potential” of the web through linked data. The idea is that people and computers could traverse data related to other data, from A to B to … wherever you want to go as long as the relationships, the tuple linkings, are available. Berners-Lee’s new reframing translates into a web-of-data that could be used by scientists, citizens, social reformers, businesses, entrepreneurs, and governments in innovative new ways.
I applaud these efforts even though I realize that the potential for good and for ill seems equally strong. But my focus is nothing so grand as a web-of-data. I’d be happy if I could just answer a profoundly simple question: “Please give me a list of all web tutorials on python programming.” More symbolically I want to answer questions of the form: “Give me a list of <a> about <b>.” From there, of course, the possible questions become far more interesting.
For certain my query would be trivial if a web-of-data existed. But it does not, even though progress is being made. So, in the interim I’m looking for a substitute.
My latest excursions have taken me into the world of XML, and specifically into XQuery and, to a lesser extent, into the XQuery relatives XSLT and XPath. All of these are W3C recommendations that have been implemented in both open source and proprietary products.
Why XQuery? Well, because XML is part of the fabric of the web, and because data in XML format can be queried with XQuery, and because I stumbled upon Zorba via an O’Reilly xml.com article called “Something Tells Me You Need to Pay Attention to This.” How could I resist a title like that?
I started playing with Zorba’s XQuery about two months ago. Maybe I’ve written a couple hundred queries now. Like the learning curve for SQL, it’s clear that a couple hundred is at least an order of magnitude too small to become skilled. But it’s great fun and I’m encouraged by what I see.
One example. Tony Hirst at the Open University has done some cool things recently with The Guardian’s new data API. In one of these projects, Tony used data from The Guardian’s university guide to do a mashup on student satisfaction in architecture and planning programs at various UK universities. It featured a very nice use of DabbleDB database.
Data integration is one of the strengths of XQuery, so I set about following Tony’s lead to see if I could duplicate his mashup but by using Zorba. It was very fun and I learned tons. You can see the results in this PDF.
I’m now ready to start work on another XQuery data integration, this time using the new FRED API from the Federal Reserve Bank of St. Louis. Not that I’m particularly interested in banking-related data, but FRED uses a REST web service architecture and will allow me to play moreĀ thoroughly with Zorba’s REST capability. And the volume of the data will allow me to stress test the performance of Zorba’s XQuery.
If you are interested at all by Zorba, I’d recommend you read some of the technical documents where you can catch glimpses of longer term development objectives and experience some of the chutzpah that must exist in the development team. A recent example is XQuery in the Browser, which was presented this week at the 18th International World Wide Web Conference in Madrid. The article basically takes aim at JavaScript. As another example, check out the plans of a 3 year-old startup called 28msec and some of their technical papers. I particularly enjoyed the architecture discussion in Donald Kossmann’s slide presentation on Building Web Applications without a DBMS (PDF).
Ok, I better stop. It’s already beginning to sound like an infomercial. Hopefully some of my sincere enthusiasm comes through, however. It’s a hopeful time. And now it’s back to making a dent in that order of magnitude learning curve.

Matthew Theobald — April 25, 2009 @ 2:50 am
You may interested in an emrging web standard for the “deep”, “hidden”, “invisible”, “structured” or “dark” web. The Internet Search Environment Number catalogs the form based interfaces with robust metadata. See the animated “space” elevator speech video on http://blog.isen.org