YQL is a very cool tool to extract data from HTML documents on the web. Let’s face facts: HTML is a terrible data format as far too many documents out there are either broken, have a wrong encoding or simply are not structured the way they should be. Therefore it can be quite a mess to try to read a HTML document and then find what you were looking for using regular expressions or tools that expect XML compatible HTML documents. Python fans will know about beautiful soup for example that does quite a good job working around most of these issues.
Using YQL you can however use a simple web service to extract data from HTML documents. As an added bonus, the YQL engine will remove falsely encoded characters and run the data retrieved through HTML Tidy to get valid HTML back. For example to get the body content of CNN.com all you’d need to do is a:
select * from HTML where url=”http://cnn.com”
The really cool thing about YQL is that it allows you to XPATH to filter down the data you want to extract. For example to get all the links from cnn.com you can use:
select * from html where xpath=”//a” and url=”http://cnn.com”
If you only want to have the text content of the links you can do the following:
select content from html where xpath=”//a” and url=”http://cnn.com”
You could use this for example to translate links using the Google translation API:
select * from google.translate where q in (
select content from html where url=”http://cnn.com” and xpath=”//a”
) and target=”fr”
Now, the other day my esteemed colleague Dirk Ginader came up with a bit of a brain teaser for me. His question was what to do when the HTML document you try to get needs POST data sent to it for it to render properly? You can append GET parameters to the URL, but not POST so the normal HTML document is not enough.
span class="caps">HTML PUBLIC “-//W3C//DTD HTML 4.01//EN”
Test for HTML POST table
Below this should be a “yay!” when
the right POST data was submitted.
The next step was to write an open table for YQL that does the necessary request and transformations.
Using this, you can now send POST data to any HTML document (unless its robots.txt blocks the YQL server or it needs authentication) and get the HTML content back. To make it work, you define the table using the “use” command:
use “http://isithackday.com/hacks/htmlpost/htmlpost.xml” as htmlpost;
select * from htmlpost where
and postdata=”foo=foo&bar=bar” and xpath=”//p”
You can try this example in the console.
I’ve also added the table to the open YQL tables repository on github so it should show up sooner or later in the console.
Here’s a quick explanation what is going on: