Christian Heilmann

Author Archive

Going a little crazy – one HTTP request RSS reader in JavaScript

Monday, December 21st, 2009

the joker and two face by  ♠NiJoKeR♣. Ok, using YQL and playing around with the console can make you go a bit too far.

A few days ago and in response to my 24 ways article on YQL my friend Jens Grochtdreis asked me how to get the thumbnails and some other data from the Slideshare site in one YQL request. He tried multiple XPATH filtering until I pointed out that there is a perfectly valid RSS feed with thumbnails.

That made we wonder why we really have to care about the detection of a feed but instead use it when it is there and let the computer do the detection for us. What I wanted to do was to turn the following HTML automatically into a list with the feed data as embedded lists:

The ungodly YQL request I came up with was the following:

select
title,link,content.thumbnail,thumbnail,description
from feed where url in (
select href from html where url in (
"http://wait-till-i.com",
"http://flickr.com/photos/codepo8",
"http://slideshare.com/cheilmann",
"http://youtube.com/chrisheilmann"
) and
xpath="//link[contains(@type,'rss')][1]")
|unique(field="link")

What is going on here? I am using the html table to read in each of the resources I want to analyse:

select * from html where url in (
"http://wait-till-i.com",
"http://flickr.com/photos/codepo8",
"http://slideshare.com/cheilmann",
"http://youtube.com/chrisheilmann"
)

Then I use xpath and return the first link element that has a type attribute that contains the word RSS. In YQL I only take its href attribute.

select href from html where url in (
"http://wait-till-i.com",
"http://flickr.com/photos/codepo8",
"http://slideshare.com/cheilmann",
"http://youtube.com/chrisheilmann"
) and
xpath="//link[contains(@type,'rss')][1]")

Notice the joy that is xpath syntax… 0 is the first – every developer knows that! We then use the feed table to get the feed information from each of these hrefs as urls:

select
title,link,content.thumbnail,thumbnail,description
from feed where url in (
select href from html where url in (
"http://wait-till-i.com",
"http://flickr.com/photos/codepo8",
"http://slideshare.com/cheilmann",
"http://youtube.com/chrisheilmann"
) and
xpath="//link[contains(@type,'rss')][1]")

The last thing that was a problem is that Flickr returns the photo items several times that way as it has a feed for the url of the photo and one for the link to the license of the photo. Therefore we needed to use unique() to get only the first of these:

select
title,link,content.thumbnail,thumbnail,description
from feed where url in (
select href from html where url in (
"http://wait-till-i.com",
"http://flickr.com/photos/codepo8",
"http://slideshare.com/cheilmann",
"http://youtube.com/chrisheilmann"
) and
xpath="//link[contains(@type,'rss')][1]")
|unique(field="link")

So, this actually does what we want – we have all the different requests in one HTTP request and then only need some JavaScript to display it. The data coming back is a mess, as it is just an array of items – so we need to loop and check the link of each to know when to go to the next list item.

This is very quick and dirty:

var x = document.getElementById('feeds');
var containers = [];
if(x){
var links = x.getElementsByTagName('a');
var resources = [];
var urls = [];
for(var i=0,j=links.length;i'+items[i].title+'';
if(items[i].thumbnail || items[i].content){
var thumb = items[i].thumbnail || items[i].content.thumbnail;
out += '';
} else {
if(items[i].description.indexOf('src')!=-1){
var thumb = items[i].description.split('src="')[1];
thumb = thumb.split('"')[0];
out += '';
}
}
out += '';
if((items[i+1] && items[i+1].link.substr(0,20) !=
items[i].link.substr(0,20))){
containers[c].innerHTML+='
    '+out+'
'; c++; out=''; } } containers[c].innerHTML+='
    '+out+'
'; }

However, the bad news about this is that it is pretty pointless as the performance is terrible. Not really surprising if you see what the YQL servers have to do and how much data gets loaded and analysed.

pointless performance by  you.

You could of course cache the result locally and thus get it down to a very small amount. However, if you go this way you might as well go fully server-side.

I am currently working on making icant.co.uk perform much faster, so watch this space for a generic RSS displayer :)

cURL – your “view source” of the web

Friday, December 18th, 2009

What follows here is a quick introduction to the magic of cURL. This was inspired by the comment of Bruce Lawson on my 24 ways article:

Seems very cool and will help me with a small Xmas project. Unfortunately, you lost me at “Do the curl call”. Care to explain what’s happening there?

What is cURL?

OK, here goes. cURL is your “view source” tool for the web. In essence it is a program that allows you to make HTTP requests from the command line or different language implementations.

The cURL homepage has all the information about it but here is where it gets interesting.

If you are on a Mac or on Linux, you are in luck – for you already have cURL. If you are operation system challenged, you can download cURL in different packages.

On aforementioned systems you can simply go to the terminal and do your first cURL thing, load a web site and see the source. To do this, simply enter

curl "http://icant.co.uk"

And hit enter – you will get the source of icant.co.uk (that is the rendered source, like a browser would get it – not the PHP source code of course):

showing with curl

If you want the code in a file you can add a > filename.html at the end:

curl "http://icant.co.uk" > myicantcouk.html

Downloading with curl by  you.

( The speed will vary of course – this is the Yahoo UK pipe :) )

That is basically what cURL does – it allows you to do any HTTP request from the command line. This includes simple things like loading a document, but also allows for clever stuff like submitting forms, setting cookies, authenticating over HTTP, uploading files, faking the referer and user agent set the content type and following redirects. In short, anything you can do with a browser.

I could explain all of that here, but this is tedious as it is well explained (if not nicely presented) on the cURL homepage.

How is that useful for me?

Now, where this becomes really cool is when you use it inside another language that you use to build web sites. PHP is my weapon of choice for a few reasons:

  • It is easy to learn for anybody who knows HTML and JavaScript
  • It comes with almost every web hosting package

The latter is also where the problem is. As a lot of people write terribly shoddy PHP the web is full of insecure web sites. This is why a lot of hosters disallow some of the useful things PHP comes with. For example you can load and display a file from the web with readfile():

<?php
  readfile('http://project64.c64.org/misc/assembler.txt');
?>

Actually, as this is a text file, it needs the right header:

<?php
  header('content-type: text/plain');
  readfile('http://project64.c64.org/misc/assembler.txt');
?>

You will find, however, that a lot of file hosters will not allow you to read files from other servers with readfile(), or fopen() or include(). Mine for example:

readfile not allowed by  you.

And this is where cURL comes in:

<?php
header('content-type:text/plain');
// define the URL to load
$url = 'http://project64.c64.org/misc/assembler.txt';
// start cURL
$ch = curl_init(); 
// tell cURL what the URL is
curl_setopt($ch, CURLOPT_URL, $url); 
// tell cURL that you want the data back from that URL
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
// run cURL
$output = curl_exec($ch); 
// end the cURL call (this also cleans up memory so it is 
// important)
curl_close($ch);
// display the output
echo $output;
?>

As you can see the options is where things get interesting and the ones you can set are legion.

So, instead of just including or loading a file, you can now alter the output in any way you want. Say you want for example to get some Twitter stuff without using the API. This will get the profile badge from my Twitter homepage:

<?php
$url = 'http://twitter.com/codepo8';
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
curl_close($ch);
$output = preg_replace('/.*(<div id="profile"[^>]+>)/msi','$1',$output);
$output = preg_replace('/<hr.>.*/msi','',$output);
echo $output;
?>

Notice that the HTML of Twitter has a table as the stats, where a list would have done the trick. Let’s rectify that:

<?php
$url = 'http://twitter.com/codepo8';
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
curl_close($ch);
$output = preg_replace('/.*(<div id="profile"[^>]+>)/msi','$1',$output);
$output = preg_replace('/<hr.>.*/msi','',$output);
$output = preg_replace('/<?table>/','',$output);
$output = preg_replace('/<(?)tr>/','<$1ul>',$output);
$output = preg_replace('/<(?)td>/','<$1li>',$output);
echo $output;
?>

Scraping stuff of the web is but one thing you can do with cURL. Most of the time what you will be doing is calling web services.

Say you want to search the web for donkeys, you can do that with Yahoo BOSS:

<?php
$search = 'donkeys';
$appid = 'appid=TX6b4XHV34EnPXW0sYEr51hP1pn5O8KAGs'.
         '.LQSXer1Z7RmmVrZouz5SvyXkWsVk-';
$url = 'http://boss.yahooapis.com/ysearch/web/v1/'.
       $search.'?format=xml&'.$appid;
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
curl_close($ch);
$data = simplexml_load_string($output);
foreach($data->resultset_web->result as $r){
  echo "<h3><a href=\"{$r->clickurl}\">{$r->title}</a></h3>";
  echo "<p>{$r->abstract} <span>({$r->url})</span></p>";
}
?>

You can also do that for APIs that need POST or other authentication. Say for example to use Placemaker to find locations in a text:

$content = 'Hey, I live in London, England and on Monday '.
           'I fly to Nuremberg via Zurich,Switzerland (sadly enough).';
$key = 'C8meDB7V34EYPVngbIRigCC5caaIMO2scfS2t'.
       '.HVsLK56BQfuQOopavckAaIjJ8-';
$ch = curl_init(); 
define('POSTURL',  'http://wherein.yahooapis.com/v1/document');
define('POSTVARS', 'appid='.$key.'&documentContent='.
                    urlencode($content).
                   '&documentType=text/plain&outputType=xml');
$ch = curl_init(POSTURL);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, POSTVARS);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  
$x = curl_exec($ch);
$places = simplexml_load_string($x, 'SimpleXMLElement',
                                LIBXML_NOCDATA);    
echo "<p>$content</p>";
echo "<ul>";
foreach($places->document->placeDetails as $p){
  $now = $p->place;
  echo "<li>{$now->name}, {$now->type} ";
  echo "({$now->centroid->latitude},{$now->centroid->longitude})</li>";
};
echo "</ul>";
?>

Why is all that necessary? I can do that with jQuery and Ajax!

Yes, you can, but can your users? Also, can you afford to have a page that is not indexed by search engines? Can you be sure that none of the other JavaScript on the page will not cause an error and all of your functionality is gone?

By sticking to your server to do the hard work, you can rely on things working, if you use web resources in JavaScript you are first of all hoping that the user’s computer and browser understands what you want and you also open yourself to all kind of dangerous injections. JavaScript is not secure – every script executed in your page has the same right. If you load third party content with JavaScript and you don’t filter it very cleverly the maintainers of the third party code can inject malicious code that will allow them to steal information from your server and log in as your users or as you.

And why the C64 thing?

Well, the lads behind cURL actually used to do demos on C64 (as did I). Just look at the difference:

horizon 1990

haxx.se 2000

Developer Evangelism book update – new chapter on writing slides, new print version

Tuesday, December 15th, 2009

Yesterday I spent my evening updating the Developer Evangelism Handbook:

New cover. by  you.

The updates include:

The rest remains the same:

Developer Evangelism is a new kind of role in IT companies. This is the handbook how to be successful in it.

A developer evangelist is a spokesperson, mediator and translator between a company and both its technical staff and outside developers. If you think this would be a good role for you, here’s the developer evangelist handbook which gives you some great tips on how to do this job.

Using the handbook you’ll learn how to:

  • Find great web content and promote it.
  • Write for the web and create engaging code examples.
  • Use the web and the social web to your advantage to reach, research and promote.
  • Prepare and deliver great presentations

My five favourite apps to get my job done

Monday, December 14th, 2009

I was asked by .net magazine to list my five most used applications at the moment and why not publish my results here, too. So here is what I use to deliver my day to day job as a developer evangelist:

My favourite apps - ForkLift by  you.

Krusader on Linux, Total Commander on Windows or Forklift on Mac – are the tools I use the most. I do a lot of working with files on various servers and S3 accounts and I am mostly a keyboard user. I love being able to use any web resource like an external hard drive, use ZIPs like folders, and not getting asked if i really want to do this when deleting a file. The other thing all of these do well is giving me a quick access to the command line in the folder I am in to do heavy duty file editing and being able to see two resources side by side make it easy to trace duplicates and diff two folders.

My favourite apps - TextMate by  you.

Textmate is my weapon of choice for any editing – including writing. It is very fast, extremely extensible and doesn’t distract me with lots of menus and panels. Keyboard shortcuts and tab completion make it very fast to edit in it and the ability to run scripts from your editor makes it very powerful to create documentation automatically from a folder of files.

Picture 1 by  you.

Skitch kept me from waiting for Photoshop opening many a time. For quick screen shots, simple editing of the screen shots and annotating it does a great job. Its integration with flickr makes it also very useful to quickly annotate something and show it to the world.

My favourite apps - iShowU by  you.

My favourite apps - MPEG Streamclip by  you.

My favourite tools - VLC by  you.

Ishowu is probably the easiest (and cheapest) screencasting tool I’ve ever used. When I still used Windows I used Camtasia and it feels very clunky in comparison. Together with mpegstreamclip for simple editing of the final video and VLC for filtering and encoding tasks this makes your life much easier when your job is to quickly show some functionality to the world.

My favourite tools - Audacity by  you.

As I want to record my talks and as screencasts turn out better when you don’t do the things and talk at the same time but instead dub the movie afterwards Audacity is my tool for any audio editing and mixing. It is amazing just how much you can do with an open source tool these days – tasks that in my days as an audio producer were only possible with ProTools.

That’s the five. Of course I use more and for other specific tasks (Keynote and Adium and DropBox come to mind), but I was asked for five, so here you go.

Building a (re)search interface for Yahoo, Bing and Google with YQL

Wednesday, December 9th, 2009

If you do a lot of research using web searches can be frustrating. Different search engines have different results, you need to open things in tabs and in general it can be pretty time consuming to find what you need.

To make this a bit easier I thought it’d be cool to have an interface that searches Yahoo, Google and Bing at the same time and thus I built GooHooBi:

As explained in the screen cast the thing under the hood of GooHooBi is YQL. Instead of fussing about with all the different search APIs all I did was to play with the YQL console and put together the following YQL statement:

select * from query.multi where queries='
select Title,Description,Url,DisplayUrl
from microsoft.bing.web(20) where query="cat";
select title,clickurl,abstract,dispurl
from search.web(20) where query="cat";
select titleNoFormatting,url,content,visibleUrl
from google.search(20) where q="cat"
'

The query.multi table in YQL allows you to list a few queries which will be executed one after the other on the YQL server. The queries themselves I put together by using the different tables in the console and only selecting what I really need from each of them.

You can try this query in the YQL console and you can see the JSON output.

The rest is pretty easy. Cut this up into a parameterized string and do a cURL call:

$query = filter_input(INPUT_GET, 'search', FILTER_SANITIZE_SPECIAL_CHARS);

$queries[] = 'select Title,Description,Url,DisplayUrl '.
'from microsoft.bing.web(20) where query="'.$query.'"';
$queries[] = 'select title,clickurl,abstract,dispurl '.
'from search.web(20) where query = "'.$query.'"';
$queries[] = 'select titleNoFormatting,url,content,visibleUrl '.
'from google.search(20) where q="'.$query.'"';
$url = "select * from query.multi where queries='".join($queries,';')."'";
$api = 'http://query.yahooapis.com/v1/public/yql?q='.
urlencode($url).'&format=json&env=store'.
'%3A%2F%2Fdatatables.org%2Falltableswithkeys&diagnostics=false';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $api);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
$data = json_decode($output);

Then loop over the results and assemble the HTML output.

You can check the source code of GooHooBi.

In addition to this, here’s a half hour live coding screencast how to build something similar:

Building a search mashup with YQL using Google, Yahoo and Bing – live :) from Christian Heilmann on Vimeo.

The source of the code built in this screencast is also on GitHub.