Recent Bookmarks, 24-Sep-2009
Harvesting Relational Tables from Lists on the Web (pdf)
Hazem Elmeleegy, Jayant Madhaven, Alon Halevy. VLDB ’09, August 24-28, Lyon, France.
“This paper proposes a technique for extracting tables from [HTML] lists … [that] is designed to be completely domain-independent and hence apply to any list found on the Web. … [E]xperiments demonstrate the ability of our technique to extract tables with high accuracy. … [Additional analysis has] led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the Web.”
gml: Two of the three authors are prominent Google researchers. It’s very impressive work that demonstrates yet again the ability of Google to find and open new data sources at web scale.
Data Integration for the Relational Web (pdf)
Michael J. Cafarella, Alon Haley, Nodira Khoussainova. VLDB ’09, August 24-28, Lyon, France.
“The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. … This paper describes OCTOPUS, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web.”
gml: Trying to query relational data extracted from HTML tables and lists, and to do it at web scale, presents some formidable problems. The key idea of the authors’ technique is “to offer the user a set of best-effort operators that automate the labor-intensive tasks.” This is very impressive research, which was conducted while all three authors worked at Google.
