Christian Heilmann

Posts Tagged ‘proxy’

Loading external content with Ajax using jQuery and YQL

Sunday, January 10th, 2010

Let’s solve the problem of loading external content (on other domains) with Ajax in jQuery. All the code you see here is available on GitHub and can be seen on this demo page so no need to copy and paste!

OK, Ajax with jQuery is very easy to do – like most solutions it is a few lines:

$(document).ready(function(){
  $('.ajaxtrigger').click(function(){
    $('#target').load('ajaxcontent.html');
  });
});

Check out this simple and obtrusive Ajax demo to see what it does.

This will turn all elements with the class of ajaxtrigger into triggers to load “ajaxcontent.html” and display its contents in the element with the ID target.

This is terrible, as it most of the time means that people will use pointless links like “click me” with # as the href, but this is not the problem for today. I am working on a larger article with all the goodies about Ajax usability and accessibility.

However, to make this more re-usable we could do the following:

$(document).ready(function(){
  $('.ajaxtrigger').click(function(){
    $('#target').load($(this).attr('href'));
    return false;
  });
});

You can then use load some content to load the content and you make the whole thing re-usable.

Check out this more reusable Ajax demo to see what it does.

The issue I wanted to find a nice solution for is the one that happens when you click on the second link in the demo: loading external files fails as Ajax doesn’t allow for cross-domain loading of content. This means that see my portfolio will fail to load the Ajax content and fail silently at that. You can click the link until you are blue in the face but nothing happens. A dirty hack to avoid this is just allowing the browser to load the document if somebody really tries to load an external link.

Check out this allowing external links to be followed to see what it does.

$(document).ready(function(){
  $('.ajaxtrigger').click(function(){
    var url = $(this).attr('href');
    if(url.match('^http')){
      return true;
    } else {
      $('#target').load(url);
      return false;
    }
  });
});

Proxying with PHP

If you look around the web you will find the solution in most of the cases to be PHP proxy scripts (or any other language). Something using cURL could be for example proxy.php:

<?php
$url = $_GET['url'];
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
curl_close($ch);
echo $content;
?>

People then could use this with a slightly changed script (using a proxy):

$(document).ready(function(){
  $('.ajaxtrigger').click(function(){
    var url = $(this).attr('href');
    if(url.match('^http')){
      url = 'proxy.php?url=' + url;
    }
    $('#target').load(url);
    return false;
  });
});

It is also a spectacularly stupid idea to have a proxy script like that. The reason is that without filtering people can use this to load any document of your server and display it in the page (simply use firebug to rename the link to show anything on your server), they can use it to inject a mass-mailer script into your document or simply use this to redirect to any other web resource and make it look like your server was the one that sent it. It is spammer’s heaven.

Use a white-listing and filtering proxy!

So if you want to use a proxy, make sure to white-list the allowed URIs. Furthermore it is a good plan to get rid of everything but the body of the other HTML document. Another good idea is to filter out scripts. This prevents display glitches and scripts you don’t want executed on your site to get executed.

Something like this:

<?php
$url = $_GET['url'];
$allowedurls = array(
  'http://developer.yahoo.com',
  'http://icant.co.uk'
);
if(in_array($url,$allowedurls)){
  $ch = curl_init(); 
  curl_setopt($ch, CURLOPT_URL, $url); 
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
  $output = curl_exec($ch); 
  curl_close($ch);
  $content = preg_replace('/.*<body[^>]*>/msi','',$output);
  $content = preg_replace('/<\/body>.*/msi','',$content);
  $content = preg_replace('/<?\/body[^>]*>/msi','',$content);
  $content = preg_replace('/[\r|\n]+/msi','',$content);
  $content = preg_replace('/<--[\S\s]*?-->/msi','',$content);
  $content = preg_replace('/<noscript[^>]*>[\S\s]*?'.
                          '<\/noscript>/msi',
                          '',$content);
  $content = preg_replace('/<script[^>]*>[\S\s]*?<\/script>/msi',
                          '',$content);
  $content = preg_replace('/<script.*\/>/msi','',$content);
  echo $content;
} else {
  echo 'Error: URL not allowed to load here.';
}
?>

Pure JavaScript solution using YQL

But what if you have no server access or you want to stay in JavaScript? Not to worry – it can be done. YQL allows you to load any HTML document and get it back in JSON. As jQuery has a nice interface to load JSON, this can be used together to achieve what we want to.

Getting HTML from YQL is as easy as using:

select * from html where url="http://icant.co.uk"

YQL does a few things extra for us:

  • It loads the HTML document and sanitizes it
  • It runs the HTML document through HTML Tidy to remove things .NETnasty frameworks considered markup.
  • It caches the HTML for a while
  • It only returns the body content of the HTML - so no styling (other than inline styles) will get through.

As output formats you can choose XML or JSON. If you define a callback parameter for JSON you get JSON-P with all the HTML as a JavaScript Object – not fun to re-assemble:

foo({
  "query":{
  "count":"1",
  "created":"2010-01-10T07:51:43Z",
  "lang":"en-US",
  "updated":"2010-01-10T07:51:43Z",
  "uri":"http://query.yahoo[...whatever...]k%22",
  "results":{
    "body":{
      "div":{
        "id":"doc2",
        "div":[{"id":"hd",
          "h1":"icant.co.uk - everything Christian Heilmann"
        },
        {"id":"bd",
        "div":[
        {"div":[{"h2":"About this and me","
        [... and so on...]
}}}}}}}});

When you define a callback with the XML output you get a function call with the HTML data as string in an Array – much easier:

foo({
  "query":{
  "count":"1",
  "created":"2010-01-10T07:47:40Z",
  "lang":"en-US",
  "updated":"2010-01-10T07:47:40Z",
  "uri":"http://query.y[...who cares...]%22"},
  "results":[
    "<body>\n    <div id=\"doc2\">\n<div id=\"hd\">\n 
     <h1>icant.co.uk - \n
     everything Christian Heilmann<\/h1>\n 
      ... and so on ..."
  ]
});

Using jQuery’s getJSON() method and accessing the YQL endpoint this is easy to implement:

$.getJSON("http://query.yahooapis.com/v1/public/yql?"+
          "q=select%20*%20from%20html%20where%20url%3D%22"+
          encodeURIComponent(url)+
          "%22&format=xml'&callback=?",
  function(data){
    if(data.results[0]){
      var data = filterData(data.results[0]);
      container.html(data);
    } else {
      var errormsg = '<p>Error: can't load the page.</p>';
      container.html(errormsg);
    }
  }
);

Putting it all together you have a cross-domain Ajax solution with jQuery and YQL:

$(document).ready(function(){
  var container = $('#target');
  $('.ajaxtrigger').click(function(){
    doAjax($(this).attr('href'));
    return false;
  });
  function doAjax(url){
    // if it is an external URI
    if(url.match('^http')){
      // call YQL
      $.getJSON("http://query.yahooapis.com/v1/public/yql?"+
                "q=select%20*%20from%20html%20where%20url%3D%22"+
                encodeURIComponent(url)+
                "%22&format=xml'&callback=?",
        // this function gets the data from the successful 
        // JSON-P call
        function(data){
          // if there is data, filter it and render it out
          if(data.results[0]){
            var data = filterData(data.results[0]);
            container.html(data);
          // otherwise tell the world that something went wrong
          } else {
            var errormsg = '<p>Error: can't load the page.</p>';
            container.html(errormsg);
          }
        }
      );
    // if it is not an external URI, use Ajax load()
    } else {
      $('#target').load(url);
    }
  }
  // filter out some nasties
  function filterData(data){
    data = data.replace(/<?\/body[^>]*>/g,'');
    data = data.replace(/[\r|\n]+/g,'');
    data = data.replace(/<--[\S\s]*?-->/g,'');
    data = data.replace(/<noscript[^>]*>[\S\s]*?<\/noscript>/g,'');
    data = data.replace(/<script[^>]*>[\S\s]*?<\/script>/g,'');
    data = data.replace(/<script.*\/>/,'');
    return data;
  }
});

This is rough and ready of course. A real Ajax solution should also consider timeout and not found scenarios. Check out the full version with loading indicators, error handling and yellow fade for inspiration.