Skip to content
  • Christian Heilmann Avatar
  • About this
  • Archives
Christian Heilmann
  • Codepo8 on GitHub
  • Chris Heilmann on Mastodon
  • Codepo8 on Twitter
  • Chris Heilmann on YouTube

Posts Tagged ‘scraping’

  • 🔙 Older Entries

Using YQL to read HTML from a document that requires POST data

Monday, November 16th, 2009

YQL is a very cool tool to extract data from HTML documents on the web. Let’s face facts: HTML is a terrible data format as far too many documents out there are either broken, have a wrong encoding or simply are not structured the way they should be. Therefore it can be quite a mess to try to read a HTML document and then find what you were looking for using regular expressions or tools that expect XML compatible HTML documents. Python fans will know about beautiful soup for example that does quite a good job working around most of these issues.

Using YQL you can however use a simple web service to extract data from HTML documents. As an added bonus, the YQL engine will remove falsely encoded characters and run the data retrieved through HTML Tidy to get valid HTML back. For example to get the body content of CNN.com all you’d need to do is a:

select * from HTML where url="http://cnn.com"

select * from HTML where url="http://cnn.com"

The really cool thing about YQL is that it allows you to XPATH to filter down the data you want to extract. For example to get all the links from cnn.com you can use:

select * from html where xpath="//a" and url="http://cnn.com"

select * from html where xpath="//a" and url="http://cnn.com"

If you only want to have the text content of the links you can do the following:

select content from html where xpath="//a" and url="http://cnn.com"

select content from html where xpath="//a" and url="http://cnn.com"

You could use this for example to translate links using the Google translation API:

select * from google.translate where q in (
  select content from html where url="http://cnn.com" and xpath="//a"
) and target="fr"

select * from google.translate where q in ( select content from html where url="http://cnn.com" and xpath="//a" ) and target="fr"

Now, the other day my esteemed colleague Dirk Ginader came up with a bit of a brain teaser for me. His question was what to do when the HTML document you try to get needs POST data sent to it for it to render properly? You can append GET parameters to the URL, but not POST so the normal HTML document is not enough.

The good news is that YQL allows you to extend it in many ways, one of them is using an execute block in an open table to convert data with JavaScript on the server. The JavaScript has full e4x support and allows you to do any HTTP request. So the first step to solve Dirk’s dilemma was to write a demo page (the form was added to test it out):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <title>Test for HTML POST table</title>
 
<body>
  <p>Below this should be a "yay!" when 
    the right POST data was submitted.</p>
<?php if(isset($_POST['foo']) && isset($_POST['bar'])){
  echo "<p>yay!</p>";
}?>
<form action="index.php" method="post" accept-charset="utf-8">
  <input type="text" name="foo" value="is">
  <input type="text" name="bar" value="set">
  <input type="submit" value="Continue &rarr;">
</form>
  </body>
</html>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <title>Test for HTML POST table</title> <body> <p>Below this should be a "yay!" when the right POST data was submitted.</p> <?php if(isset($_POST['foo']) && isset($_POST['bar'])){ echo "<p>yay!</p>"; }?> <form action="index.php" method="post" accept-charset="utf-8"> <input type="text" name="foo" value="is"> <input type="text" name="bar" value="set"> <input type="submit" value="Continue &rarr;"> </form> </body> </html>

The next step was to write an open table for YQL that does the necessary request and transformations.

<?xml version="1.0" encoding="UTF-8"?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
  <meta>
  <author>Christian Heilmann</author>
  <description>HTML pages that need post data</description>
  <sampleQuery><![CDATA[
select * from {table} where
url='http://isithackday.com/hacks/htmlpost/index.php' 
and postdata="foo=foo&bar=bar" and xpath="//p"]]></sampleQuery>
  <documentationURL></documentationURL>
  </meta>
  <bindings>
    <select itemPath="" produces="XML">
    <urls>
      <url>{url}</url>
    </urls>
    <inputs>
      <key id="url" type="xs:string" required="true" paramType="variable"/>
      <key id="postdata" type="xs:string" required="true" paramType="variable"/>
      <key id="xpath" type="xs:string" required="true" paramType="variable"/>
    </inputs>
    <execute>
    <![CDATA[
      var myRequest = y.rest(url);  
      var data = myRequest.accept('text/html').
                 contentType("application/x-www-form-urlencoded").
                 post(postdata).response;
      var xdata = y.xpath(data,xpath);
      response.object = <postresult>{xdata}</postresult>;
    ]]>
    </execute>
  </select> 
  </bindings>
</table>

<?xml version="1.0" encoding="UTF-8"?> <table xmlns="http://query.yahooapis.com/v1/schema/table.xsd"> <meta> <author>Christian Heilmann</author> <description>HTML pages that need post data</description> <sampleQuery><![CDATA[ select * from {table} where url='http://isithackday.com/hacks/htmlpost/index.php' and postdata="foo=foo&bar=bar" and xpath="//p"]]></sampleQuery> <documentationURL></documentationURL> </meta> <bindings> <select itemPath="" produces="XML"> <urls> <url>{url}</url> </urls> <inputs> <key id="url" type="xs:string" required="true" paramType="variable"/> <key id="postdata" type="xs:string" required="true" paramType="variable"/> <key id="xpath" type="xs:string" required="true" paramType="variable"/> </inputs> <execute> <![CDATA[ var myRequest = y.rest(url); var data = myRequest.accept('text/html'). contentType("application/x-www-form-urlencoded"). post(postdata).response; var xdata = y.xpath(data,xpath); response.object = <postresult>{xdata}</postresult>; ]]> </execute> </select> </bindings> </table>

Using this, you can now send POST data to any HTML document (unless its robots.txt blocks the YQL server or it needs authentication) and get the HTML content back. To make it work, you define the table using the “use” command:

use "http://isithackday.com/hacks/htmlpost/htmlpost.xml" as htmlpost;
select * from htmlpost where
url='http://isithackday.com/hacks/htmlpost/index.php'
and postdata="foo=foo&bar=bar" and xpath="//p"

use "http://isithackday.com/hacks/htmlpost/htmlpost.xml" as htmlpost; select * from htmlpost where url='http://isithackday.com/hacks/htmlpost/index.php' and postdata="foo=foo&bar=bar" and xpath="//p"

You can try this example in the console.

I’ve also added the table to the open YQL tables repository on github so it should show up sooner or later in the console.

Here’s a quick explanation what is going on:

<?xml version="1.0" encoding="UTF-8"?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
  <meta>
  <author>Christian Heilmann</author>
  <description>HTML pages that need post data</description>
  <sampleQuery><![CDATA[
select * from {table} where
url='http://isithackday.com/hacks/htmlpost/index.php' 
and postdata="foo=foo&bar=bar" and xpath="//p"]]></sampleQuery>
  <documentationURL></documentationURL>
  </meta>

<?xml version="1.0" encoding="UTF-8"?> <table xmlns="http://query.yahooapis.com/v1/schema/table.xsd"> <meta> <author>Christian Heilmann</author> <description>HTML pages that need post data</description> <sampleQuery><![CDATA[ select * from {table} where url='http://isithackday.com/hacks/htmlpost/index.php' and postdata="foo=foo&bar=bar" and xpath="//p"]]></sampleQuery> <documentationURL></documentationURL> </meta>

You define the schema and add meta data like the author, a description and a sample query. The latter is really important as that will show up in the YQL console when people click the table. You should normally also provide a documentation URL, but this post wasn’t written when I wrote the table so I kept it empty.

  <bindings>
    <select itemPath="" produces="XML">
    <urls>
      <url>{url}</url>
    </urls>

<bindings> <select itemPath="" produces="XML"> <urls> <url>{url}</url> </urls>

The bindings of the table describe the real API data endpoints the table points to. You have select, insert, update and delete – much like any other database. You provide an itemPath to cut down on the data returned and tell YQL if the data returned is XML or JSON.

    <inputs>
      <key id="url" type="xs:string" required="true" paramType="variable"/>
      <key id="postdata" type="xs:string" required="true" paramType="variable"/>
      <key id="xpath" type="xs:string" required="true" paramType="variable"/>
    </inputs>

<inputs> <key id="url" type="xs:string" required="true" paramType="variable"/> <key id="postdata" type="xs:string" required="true" paramType="variable"/> <key id="xpath" type="xs:string" required="true" paramType="variable"/> </inputs>

The inputs section defines what variables are expected, if they are required and what their IDs are. These IDs will be available for you as variables in the embedded JavaScript block and are normally defined by the API your table points to.

    <execute>
    <![CDATA[
      var myRequest = y.rest(url);  
      var data = myRequest.accept('text/html').
                 contentType("application/x-www-form-urlencoded").
                 post(postdata).response;
      var xdata = y.xpath(data,xpath);
      response.object = <postresult>{xdata}</postresult>;
    ]]>
    </execute>

<execute> <![CDATA[ var myRequest = y.rest(url); var data = myRequest.accept('text/html'). contentType("application/x-www-form-urlencoded"). post(postdata).response; var xdata = y.xpath(data,xpath); response.object = <postresult>{xdata}</postresult>; ]]> </execute>

Here comes the JavaScript magic inside the execute block. The y.rest(url) command sends a REST query to the URL. in the easiest form this would just mean to get the data back but in our case we need to define a few more things. We expect html back so we set the request accept header to text/html. This also ensures that the result is run through HTML Tidy before it is returned. The content type has to be like a form submission and we need to send the string postdata as a post request. The response then contains whatever our request brings back.

As we want to have the handy functionality of the original HTML table, we also need to do an xpath transformation which is done with the method of the same name.

Any JavaScript in the execute block needs to define a response.object which will become the result of the YQL query. As you can see, the E4X support of YQL allows you to simply write XML blocks without any DOM pains and you can embed any JavaScript variables inside curly braces.

  </select> 
  </bindings>
</table>

</select> </bindings> </table>

And we’re done. Using YQL execute you can move a lot of JavaScript that does complex transformations to the Yahoo server farm without slowing down your end user’s computers. And you have a secure environment to boot as there are no DOM vulnerabilities.

Tags: executetables, HTML, opentables, scraping, serverside javascript, yql
Posted in General | 11 Comments »

Converting a data table on the web to an autocomplete translator with YQL and YUI

Monday, August 31st, 2009

During the Summer of Widgets hack event last weekend, Tomas Caspers, Nina Wieland and Jens Grochdreis had the idea of creating a translation tool to translate from the local Cologne accent to German and back.

For this, they found a pretty impressive data source on the web, namely this web site by Reinhard Kaaden. The task was now to turn this into a fancy interface to make it easy for people to enter a “Kölsch” term and get the German equivalent and vice versa. For this, I proposed YQL und YUI and here is a step-by-step explanation of how you can do it.

You can see the final outcome here: Deutsch-Kölsch übersetzer
or by clicking the screenshot:

Deutsch-Koelsch Uebersetzer by  you.

Step 1: Retrieve and convert the data

A very easy way to get data from the web is using YQL. In order to get the whole HTML of the source page all we had to do is select * from html where url='http://www.magicvillage.de/~reinhard_kaaden/d-k.html'. That gave us the whole data though and we only wanted to get the content of the tables.

Using Firebug and looking up some XPATH we came up with the following statement that would give us the language pairs as German-Koelsch inside paragraphs: //table[1]/tr/td/p[not(a)]. The not(a) statement is needed to filter out the A-Z navigation table cells. We chose JSON as the output format in YQL and dktrans as the callback function name.

All in all this gave us a URL that would load the data we wanted and send it to the function dktrans once it has been pulled:



All that had to go in there to create the Autocomplete controls was more or less 100% copied from the simple Autocomplete example on the YUI site.
First thing is to get some handlers to the input fields I want to populate with the translation data:

var di = YAHOO.util.Dom.get(‘deutschinput’);
var ci = YAHOO.util.Dom.get(‘koelschinput’);

Then you need to instantiate the data source for the autocomplete and give it the language array. As a responseSchema you can define a field called term:

dktransdata.cologneDS = new YAHOO.util.LocalDataSource(
dktransdata.koelsch
);
dktransdata.cologneDS.responseSchema = {fields:[‘term’]};

Next you need to instantiate the AutoComplete widget. This one gets three parameters: the input element, the output container and the data source. You can set useShadow to get a small dropshadow on the container:

dktransdata.cologneAC = new YAHOO.widget.AutoComplete(
‘koelschinput’,’koelschoutput’,dktransdata.cologneDS
);
dktransdata.cologneAC.useShadow = true;

This turns the input of the Cologne language into an Autocomplete, but it doesn’t yet populate the other field. For this we need to subscribe to the itemSelectEvent of the AutoComplete widget. The event handler of that event gets a few parameters, the text content of the chosen element is the first element of the third element in the second parameter (this is explained in detail on the YUI site). All you need to do is set the value of the other field to the corresponding element of the translation maps we defined:

dktransdata.cologneAC.itemSelectEvent.subscribe(cologneHandler);
function cologneHandler(s,a){
di.value = dktransdata.dk[a[2][0]];
}

All that is left is to do the same for the German to Cologne field:


dktransdata.germanDS = new YAHOO.util.LocalDataSource(
dktransdata.deutsch
);
dktransdata.germanDS.responseSchema = {fields:[‘term’]};
dktransdata.germanAC = new YAHOO.widget.AutoComplete(
‘deutschinput’,’deutschoutput’,dktransdata.germanDS
);
dktransdata.germanAC.useShadow = true;
dktransdata.germanAC.itemSelectEvent.subscribe(germanHandler);
function germanHandler(s,a){
ci.value = dktransdata.kd[a[2][0]];
}

Step 5:Putting it all together

You can see the full source of the translation tool on GitHub and can download it there, too.
Of course we are not really finished here as this only works in JavaScript environments. As the translator was meant to be a widget though, this was not an issue. That the autocomplete does not seem to work on mobiles is one, though :).

Making this work without JavaScript would be pretty easy, too. As the data is returned in JSON we can also use this in PHP and write a simple form script If wanted, I can do that later.

Tags: autocomplete, conversion, hack, scraping, yql, YUI
Posted in General | 4 Comments »

Tutorial: scraping and turning a web site into a widget with YQL

Tuesday, August 25th, 2009

During the mentoring sessions at last weekend’s Young Rewired State one of the most asked questions was how you can easily re-use content on the web. The answer I gave was by using YQL and I promised a short introduction to the topic so here it is. What we are going to do here and now is to turn a web sites into a widget with YQL and a few lines of JavaScript:

Turning a web page into a widget with yql by  you.

Say you have a web site with a list of content and you want to turn it into widget to include in other web sites. For example this list of funny TV facts (which is really a Usenet classic). The first thing you need to do with this is to find out its structure, either by looking at the source code of the page or by using Firebug:

Note: The original joke site is dead, so I fixed the widget to use another one. The concept still works though.

Finding out the HTML structure by using firebug by  you.

If you right-click on the item in Firebug you can get the XPATH to the element you want to reach – we’ll need this later. In this case the xpath is /html/body/ul/li[92] which gets us that single element. If we want all TV facts, then we need to shorten this to //ul/li.

Copying the XPATH in firebug by  you.

The next step is to go to the YQL console and enter the following statement.

select * from html where url=’http://www.dcs.gla.ac.uk/~joy/fun/jokes/TV.html’ and xpath=’//ul/li’

This follows the syntax select * from html where url='{url}' and xpath='{xpath}'. This will result in YQL pulling the page content and giving it back to us as XML:

Yahoo! Query Language - YDN by  you.

Notice that YQL has inserted P elements in the results. This is because YQL runs the XML through HTML Tidy to remove invalid HTML. This means that we need to alter our XPATH to be //ul/li/p to get to the texts.

The next step is to define the output format as JSON, define a callback function with the name funfacts, hit the test button, wait for the results and copy and paste the REST query.

Steps to get the HTML in the right format by  you.

That’s all you need to do. You will now have the HTML as a JavaScript-readable object and all you need to do is to define a function called funfacts that gets the data from YQL and add another SCRIPT node with the REST URL you copied from YQL as the src attribute:


The function will get the data from YQL as you were able to see in the console. Therefore getting to the TV facts is as easy as accessing o.query.results.p.

The rest of the functionality is plain and simple DOM Scripting. Check the comments for explanations:

Funny TV facts

Some funny TV facts



Add a bit of styling and you’ll end up with quite a cool little widget powered by the data on the jokes site. Check the source of the demo to see all the CSS needed.

That is all there is to it – get scraping!

Tags: development, javascript, scraping, widget, yql
Posted in General | 39 Comments »

Screencast: Building an online profile of distributed data with YQL

Wednesday, April 15th, 2009

Distributing your information all over the web has become a common practice over the last few years and it makes a lot of sense. By covering lots of distribution channels you can reach various audiences and get comments and feedback from them.

You also make yourself independent of a single online resource – if your server is unavailable your data is still around. I could go on with the benefits of distribution (after all I’ve written a book on the subject) but let’s take a look on the flipside: by spreading your data all over the web you also spread yourself thin and you want a single resource to act as your main URL.

People have been telling me for a while that they don’t have time to find all the things I leave across the web and that they are wondering if there’s a single entry point. One Solution is FriendFeed but you want to be able to style your “online profile” more than that.

This is where YQL comes into the equation. Using YQL, a YUI CSS grid, a few dozen lines of PHP and a bit of CSS I managed to pull together My online portfolio http://icant.co.uk and you can do this as easily. The following screencast shows you how it is done:

You can also download a readable version of the screencast for ipods.

Since I put together the screencast (which was a bit hurried as I needed to catch a flight) I’ve updated the idea with yet another script that scrapes the resulting HTML document to create an RSS feed of all my data on the web.

  • Check out the source of index.php
  • Check out the source of feed.php

Using YQL has a few more benefits than reading all the different sources yourself and mixing them up: the results are cached for you, YQL’s connection to the web is very much likely to be faster than yours which makes the fetching process easier and you have full control over what’s happening as YQL output gives you diagnostics information.

I’ll talk more about in YQL in various talks in the nearer future, and there are even more interesting changes to the system itself around the corner. Stay alert for awesome updates.

Tags: api, data distribution, scrapi, scraping, screencast, webapi, ydn, yql
Posted in General | 4 Comments »

Building a hack using YQL, Flickr and the web – step by step

Wednesday, March 11th, 2009

As you probably know, I am spending a lot of time speaking and mentoring at hack days for Yahoo. I go to open hack days, university hack days and even organized my own hackday revolving around accessibility last year.

One of the main questions I get is about technologies to use. People are happy to find content on the web, but getting it and mixing it with other sources is still a bit of an enigma.

Following I will go through a hack I prepared at the Georgia Tech University hack day. I am using PHP to retrieve information of the web, YQL to filter it to what I need and YUI to do the CSS layout and add extra functionality.

The main ingredient of a good hack – the idea

I give a lot of presentations and every time I do people ask me where I get the pictures I use from. The answer is Flickr and some other resources on the internet. The next question is how much time I spend finding them and that made me think about building a small tool to make this easier for me.

This is how Slidefodder started and following is a screenshot of the hack in action. If you want to play with it, you can download the Slidefodder source code.

Slide Fodder - find CC licensed photos and funpics for your slides

Step 1: retrieving the data

The next thing I could have done is deep-dive into the Flick API to get photos that I am allowed to use. Instead I am happy to say that using YQL gives you a wonderful shortcut to do this without brooding over documentation for hours on end.

Using YQL I can get photos from flickr with the right license and easily display them. The YQL statement to search photos with the correct license is the following:


select id from flickr.photos.search(10) where text=’donkey’ and license=4

Retrieving CC licensed photos from flickr in YQL

You can try the flickr YQL query here and you’ll see that the result (once you’ve chosen JSON as the output format) is a JSON object with photo results:


{

“query”: {
“count”: “10”,
“created”: “2009-03-11T01:23:00Z”,
“lang”: “en-US”,
“updated”: “2009-03-11T01:23:00Z”,
“uri”: “http://query.yahooapis.com/v1/yql?q=select+*+from+flickr.photos.search%2810%29+where+text%3D%27donkey%27+and+license%3D4”,
“diagnostics”: {
“publiclyCallable”: “true”,
“url”: {
“execution-time”: “375”,
“content”: “http://api.flickr.com/services/rest/?method=flickr.photos.search&text=donkey&license=4&page=1&per_page=10”
},
“user-time”: “376”,
“service-time”: “375”,
“build-version”: “911”
},
“results”: {
“photo”: [
{

“farm”: “4”,
“id”: “3324618478”,
“isfamily”: “0”,
“isfriend”: “0”,
“ispublic”: “1”,
“owner”: “25596604@N04”,
“secret”: “20babbca36”,
“server”: “3601”,
“title”: “donkey image”
}

[...]
]

}
}

}

The problem with this is that the user name is not provided anywhere, just their Flickr ID. As I wanted to get the user name, too, I needed to nest a YQL query for that:

select farm,id,secret,server,owner.username,owner.nsid from flickr.photos.info where photo_id in (select id from flickr.photos.search(10) where text='donkey' and license=4)

This gives me only the information I really need (try the nested flickr query here):


{

“query”: {
“count”: “10”,
“created”: “2009-03-11T01:24:45Z”,
“lang”: “en-US”,
“updated”: “2009-03-11T01:24:45Z”,
“uri”: “http://query.yahooapis.com/v1/yql?q=select+farm%2Cid%2Csecret%2Cserver%2Cowner.username%2Cowner.nsid+from+flickr.photos.info+where+photo_id+in+%28select+id+from+flickr.photos.search%2810%29+where+text%3D%27donkey%27+and+license%3D4%29”,
“diagnostics”: {
“publiclyCallable”: “true”,
“url”: [
{

“execution-time”: “394”,
“content”: “http://api.flickr.com/services/rest/?method=flickr.photos.search&text=donkey&license=4&page=1&per_page=10”
},
[...]
],
“user-time”: “1245”,
“service-time”: “4072”,
“build-version”: “911”
},
“results”: {
“photo”: [
{

“farm”: “4”,
“id”: “3344117208”,
“secret”: “a583f1bb04”,
“server”: “3355”,
“owner”: {
“nsid”: “64749744@N00”,
“username”: “babasteve”
}

}
[...]
}

]
}

}
}

The next step was getting the data from the other resources I am normally tapping into: Fail blog and I can has cheezburger. As neither of them have an API I need to scrape the HTML data of the page. Luckily this is also possible with YQL, all you need to do is select the data from html and give it an XPATH. I found the XPATH by analysing the page source in Firebug:

Using Firebug to find the right xpath to an image

This gave me the following YQL statement to get images from both blogs. You can list several sources as an array inside the in() statement:


select src from html where url in (‘http://icanhascheezburger.com/?s=donkey’,’http://failblog.org/?s=donkey’) and xpath=”//div[@class=’entry’]/div/div/p/img”

Retrieving blog images using YQL

The result of this query is again a JSON object with the src values of photos matching this search:


{

“query”: {
“count”: “4”,
“created”: “2009-03-11T01:28:35Z”,
“lang”: “en-US”,
“updated”: “2009-03-11T01:28:35Z”,
“uri”: “http://query.yahooapis.com/v1/yql?q=select+src+from+html+where+url+in+%28%27http%3A%2F%2Ficanhascheezburger.com%2F%3Fs%3Ddonkey%27%2C%27http%3A%2F%2Ffailblog.org%2F%3Fs%3Ddonkey%27%29+and+xpath%3D%22%2F%2Fdiv%5B%40class%3D%27entry%27%5D%2Fdiv%2Fdiv%2Fp%2Fimg%22”,
“diagnostics”: {
“publiclyCallable”: “true”,
“url”: [
{

“execution-time”: “1188”,
“content”: “http://failblog.org/?s=donkey”
},
{

“execution-time”: “1933”,
“content”: “http://icanhascheezburger.com/?s=donkey”
}

],
“user-time”: “1939”,
“service-time”: “3121”,
“build-version”: “911”
},
“results”: {
“img”: [
{

“src”: “http://icanhascheezburger.files.wordpress.com/2008/09/funny-pictures-you-are-making-a-care-package-very-correctly.jpg”
},
{

“src”: “http://icanhascheezburger.files.wordpress.com/2008/01/funny-pictures-zebra-donkey-family.jpg”
},
{

“src”: “http://failblog.files.wordpress.com/2008/11/fail-owned-donkey-head-intimidation-fail.jpg”
},
{

“src”: “http://failblog.files.wordpress.com/2008/03/donkey.jpg”
}

]
}

}
}

Writing the data retrieval API

The next thing I wanted to do was writing a small script to get the data and give it back to me as HTML. I could have used the JSON output in JavaScript directly but wanted to be independent of scripting. The script (or API if you will) takes a search term, filters it and executes both of the YQL statements above before returning a list of HTML items with photos in them. You can try it out for yourself: search for the term donkey or search for the term donkey and give it back as a JavaScript call

I use cURL to get the data as my server has external pulling of data via PHP disabled. This should work for most servers, actually.

Here’s the full “API” code:


';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $flickurl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
$flickrphotos = json_decode($output);
foreach($flickrphotos->query->results->photo as $a){
$o = $a->owner;
$out.= '
  • '. ''; $href = 'http://www.flickr.com/photos/'.$o->nsid.'/'.$a->id; $out.= ''.$href.' - '.$o->username.'
  • '; } $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $failurl); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $output = curl_exec($ch); curl_close($ch); $failphotos = json_decode($output); foreach($failphotos->query->results->img as $a){ $out.= '
  • '; if(strpos($a->src,'failblog') = 7){ $out.= ''; } else { $out.= ''; } $out.= ''.$a->alt.'
  • '; } $out.= ''; if($_GET['js']=’yes’){ $out.= ‘’})’; } echo $out; } else { echo ($_GET[‘js’]!==’yes’) ? ‘

    Invalid search term.

    ’ : ‘seed({html:”Invalid search Term!”})’; } } ?>

    Let’s go through it step by step:


    if($_GET[‘js’]===’yes’){
    header(‘Content-type:text/javascript’);
    $out = ‘seed({html:’‘;
    }

    I test if the js parameter is set and if it is I send a JavaScript header and start the JS object output.


    if(isset($_GET[’s’])){
    $s = $_GET[’s’];
    if(preg_match(“/^[0-9|a-z|A-Z|-| |+|.|_]+$/”,$s)){

    I get the search term and filter out invalid terms.

    
    $flickurl = 'http://query.yahooapis.com/v1/public/yql?q=select'.
    '%20farm%2Cid%2Csecret%2Cserver%2Cowner.username'.
    '%2Cowner.nsid%20from%20flickr.photos.info%20where%20'.
    'photo_id%20in%20(select%20id%20from%20'.
    'flickr.photos.search(10)%20where%20text%3D''.
    $s.''%20and%20license%3D4)&format=json';
    $failurl = 'http://query.yahooapis.com/v1/public/yql?q=select'.
    '%20*%20from%20html%20where%20url%20in'.
    '%20('http%3A%2F%2Ficanhascheezburger.com'.
    '%2F%3Fs%3D'.$s.''%2C'http%3A%2F%2Ffailblog.org'.
    '%2F%3Fs%3D'.$s.'')%20and%20xpath%3D%22%2F%2Fdiv'.
    '%5B%40class%3D'entry'%5D%2Fdiv%2Fdiv%2Fp%2Fimg%22%0A&'.
    'format=json';
    

    These are the YQL queries, you get them by clicking the “copy url” button in YQL.


    $out.= ‘
      ‘;

    I then start the output list of the results.


    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $flickurl);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $output = curl_exec($ch);
    curl_close($ch);
    $flickrphotos = json_decode($output);
    foreach($flickrphotos->query->results->photo as $a){
    $o = $a->owner;
    $out.= ‘
  • ‘.
    ‘‘.$href.’ – ‘.$o->username.’
  • ‘;
    }

    I call cURL to retrieve the data from the flickr yql query, do a json_decode and loop over the results. Notice the rather annoying way of having to assemble the flickr url and image source. I found this by clicking around flickr and checking the src attribute of images rendered on the page. The images with the “ico” class should tell me where the photo was from.


    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $failurl);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $output = curl_exec($ch);
    curl_close($ch);
    $failphotos = json_decode($output);
    foreach($failphotos->query->results->img as $a){
    $out.= ‘
  • ‘;
    if(strpos($a->src,’failblog’) = 7){
    $out.= '';
    } else {
    $out.= '';
    }

    $out.= ''.$a->alt.'

  • ';
    }

    Retrieving the blog data works the same way, all I had to do extra was check for which blog the resulting image came from.


    $out.= ‘‘;
    if($_GET[‘js’]=’yes’){
    $out.= ‘’})’;
    }

    echo $out;

    I close the list and – if JavaScript was desired – the JavaScript object and function call.


    } else {
    echo ($_GET[‘js’]!==’yes’) ?
    ‘

    Invalid search term.

    ’ :
    ‘seed({html:”Invalid search Term!”})’;
    }

    }
    ?>

    If there was an invalid term entered I return an error message.

    Setting up the display

    Next I went to the YUI grids builder and created a shell for my hack. Using the generated code, I added a form, my yql api, an extra stylesheet for some colouring and two IDs for easy access for my JavaScript:


    HTML PUBLIC “-//W3C//DTD HTML 4.01//EN”
    “http://www.w3.org/TR/html4/strict.dtd”>


    Slide Fodder – find CC licensed photos and funpics for your slides





    Slide Fodder


















    Slide Fodder by Christian Heilmann, hacked live at Georgia Tech University Hack day using YUI and YQL.

    Photo sources: Flickr, Failblog and I can has cheezburger.






    Rounding up the hack with a basket

    The last thing I wanted to add was a “basket” functionality which would allow me to do several searches and then copy and paste all the photos in one go once I am happy with the result. For this I’d either have to do a persistent storage somewhere (DB or cookies) or use JavaScript. I opted for the latter.

    The JavaScript uses YUI and is no rocket science whatsoever:


    function seed(o){
    YAHOO.util.Dom.get(‘content’).innerHTML = o.html;
    }

    YAHOO.util.Event.on(‘f’,’submit’,function(e){
    var s = document.createElement(‘script’);
    s.src = ‘yql.php?js=yes&s=’+ YAHOO.util.Dom.get(’s’).value;
    document.getElementsByTagName(‘head’)[0].appendChild(s);
    YAHOO.util.Dom.get(‘content’).innerHTML = ‘‘;

    YAHOO.util.Event.preventDefault(e);
    });

    YAHOO.util.Event.on(‘content’,’click’,function(e){
    var t = YAHOO.util.Event.getTarget(e);
    if(t.nodeName.toLowerCase()===’img’){
    var str = ‘

    ‘;
    if(t.src.indexOf(‘flickr’)!==-1){
    str+= ‘

    ‘+t.parentNode.getElementsByTagName(‘a’)[0].innerHTML+’

    ‘;
    }

    str+=’x

    ‘;
    YAHOO.util.Dom.get(‘basket’).innerHTML+=str;
    }

    YAHOO.util.Event.preventDefault(e);
    });
    YAHOO.util.Event.on(‘basket’,’click’,function(e){
    var t = YAHOO.util.Event.getTarget(e);
    if(t.nodeName.toLowerCase()===’a’){
    t.parentNode.parentNode.removeChild(t.parentNode);
    }

    YAHOO.util.Event.preventDefault(e);
    });

    Again, let’s check it bit by bit:


    function seed(o){
    YAHOO.util.Dom.get(‘content’).innerHTML = o.html;
    }

    This is the method called by the “API” when JavaScript was desired as the output format. All it does is change the HTML content of the DIV with the id “content” to the one returned by the “API”.


    YAHOO.util.Event.on(‘f’,’submit’,function(e){
    var s = document.createElement(‘script’);
    s.src = ‘yql.php?js=yes&s=’+ YAHOO.util.Dom.get(’s’).value;
    document.getElementsByTagName(‘head’)[0].appendChild(s);
    YAHOO.util.Dom.get(‘content’).innerHTML = ‘ ‘src=”http://tweeteffect.com/ajax-loader.gif”’+
    ‘style=”display:block;margin:2em auto”>‘;
    YAHOO.util.Event.preventDefault(e);
    });

    When the form (the element with th ID “f”) is submitted, I create a new script element,give it the right src attribute pointing to the API and getting the search term and append it to the head of the document. I add a loading image to the content section and stop the browser from submitting the form.


    YAHOO.util.Event.on(‘content’,’click’,function(e){
    var t = YAHOO.util.Event.getTarget(e);
    if(t.nodeName.toLowerCase()===’img’){
    var str = ‘
    ‘;
    if(t.src.indexOf(‘flickr’)!==-1){
    str+= ‘

    ‘+t.parentNode.getElementsByTagName(‘a’)[0].innerHTML+’

    ‘;
    }

    str+=’x

    ‘;
    YAHOO.util.Dom.get(‘basket’).innerHTML+=str;
    }

    YAHOO.util.Event.preventDefault(e);
    });

    I am using Event Delegation to check when a user has clicked on an image in the content section and create a new DIV with the image to add to the basket. When the image was from flickr (I am checking the src attribute) I also add the url of the image source and the user name to use in my slides later on. I add an “x” link to remove the image from the basket and again stop the browser from doing its default behaviour.


    YAHOO.util.Event.on(‘basket’,’click’,function(e){
    var t = YAHOO.util.Event.getTarget(e);
    if(t.nodeName.toLowerCase()===’a’){
    t.parentNode.parentNode.removeChild(t.parentNode);
    }

    YAHOO.util.Event.preventDefault(e);
    });

    In the basket I remove the DIV when the user clicks on the “x” link.

    That’s it

    This concludes the hack. It works, it helps me get photo material faster and it took me about half an hour to build all in all. Yes, it could be improved in terms of accessibility, but this is enough for me and my idea was to show how to quickly use YQL and YUI with a few lines of PHP to deliver something that does a job :)

    Tags: flickr, hack, HTML, javascript, php, scraping, yql
    Posted in General | 5 Comments »

    • < Older Entries
    Skip to search
    Christian Heilmann is the blog of Christian Heilmann chris@christianheilmann.com (Please do not contact me about guest posts, I don't do those!) a Principal Program Manager living and working in Berlin, Germany.

    Theme by Chris Heilmann. SVG Icons by Dan Klammer . Hosted by MediaTemple. Powered by Coffee and Spotify Radio.

    Get the feed, all the cool kids use RSS!