Christian Heilmann

Posts Tagged ‘web’

Witnessing the death of the web as a news medium

Monday, June 3rd, 2024

As some of you may know, I started out as a radio journalist. And when I discovered the web in around 1996, I knew that, to me, radio and TV were not the dominant news media any longer. Nowhere but on the web was it possible to research and cross-reference from dozens or resources with various origins. You could directly access the press agencies for news without having to read the politically or sensationalist tainted derivates in various outlets.

The amazing thing was the humble link. And as the cool kids said back then Cool URIs don’t change. The powers of the web were:

  • being able to link to other resources,
  • remixing, and
  • bookmarking for later use.

In other words, the web was about retention and accumulation of content. An ever growing library that by its very nature was self-indexing and cross-referencing. And this is what is being actively killed these days. But let’s go back a bit before I start focusing on that problem. Let’s take a peek at the slow decline of the web as a news medium.

Coffin Nail 1: Publishing for “free” comes at a price

The great thing about the web was that everyone could become a publisher and let their voice be heard. Finding places to write and create web pages was easy. But many of them were also short-lived and we learned the hard way when – for example – Geocities shut down, “free” didn’t mean “yours forever online”.

Coffin Nail 2: Moving to tagging and commenting

When “web2.0” became a thing, the publishing model got turned on its head. Instead of writing in an own publication, the idea was to comment and do smaller posts on a topic, linking to resources, or adding a funny image without alternative text. Accumulatively adding to threads, so to say. A bit of a reminder of Bulletin Boards or Forums, but with less focus.

At that time I worked on various social media ideas in Yahoo, hitherto one of the main sources for people’s daily news, replacing daily papers. The model of Yahoo and others back then was simple: buy news content, spruce it up a bit and show ads around it.

Even then some dark patterns evolved, like splitting up longer content into carousels and pagination not for the sake of the user, but to record yet another click. Clicks and interaction means ad displays, reading was kind of a necessary evil from a monetisation point of view. This is also when the first ideas of creating sticky, viral and – let’s call it by its real name – addictive and lock-in content came up. Something we perfected now, but still wanted to avoid back then.

Pulling Nails: approaching web 2.0 ethically

Back then “web 2.0” or user generated content was something we didn’t quite trust and the biggest no-no was to create a product for a community for the sake of having one. This anti-pattern was called the Potemkin Villages, when historically people build fake villages for the emperor to see when driving past so he’d see growth where there wasn’t any.

So, instead of growing a community, you build an empty product. Without filling that one already with some content, this was a non-starter. People are happy to comment and add to something that already exists. Only a few are real content creators, and those were more likely to have an own blog.

Our ideas for creating social media products were simple:

  • We wanted to encourage human created answers and not machines spurting out data.
  • We wanted to encourage people to write high quality content and reward them for it.
  • We wanted to allow for human questions and dabbled with natural language processing.

And we found two important facts. People are much more likely to create content when you either:

  • start with an existing community and give them a space online or
  • when you put something at the centre of the social platform that people care about on an emotional level.

Facebook expanded on already existing university groups. LinkedIn and its European equivalent Xing was about finding a job and telling people where you work, so it was convenience rather than an emotional bond.

The “new” factor was also a big one. Delicous, for example, was thriving, with people bookmarking, describing and tagging resources and sharing them with friends. Yahoo Bookmarks did a similar thing, but without a focus on the social aspect. It also sported the already dusty Yahoo brand, which didn’t attract the cool, new content creators.

Flickr was about Photos, something people care about deeply. Upcoming was about events, Ravelry was about knitting, Dopplr was about sharing your travel plans with friends. Things were interesting and the good will and effort the community put into tagging, cleaning up and categorising content for others was fun to see. All was a validation of our assumptions of the emotional core of a good social network.

One big thing that was also part of this was that every product had a data API, that allowed you to create Mashups with the information and empowered techies to find new use cases for it. I even wrote a book about this with a colleague that took off like a lead balloon – but that’s another story.

We lost spectacularly with that approach. As it was about the people, not about the quick success and the money it made.

Then came the times of micro blogging, with Tumblr leading the charge, but also MySpace, Bandcamp and many more. Still, there was a semblance of something emotional at their core and people used these systems as their virtual homes and identity. But, there was already a “fire and forget” mentality that came with it. People didn’t expect these things to have an edit history, and they kept getting changed. Maybe people were burned by Geocities’ demise, but one thing these places on the web were not – lasting.

Coffin Nail 3: Time-sensitive content

Another thing that soon became apparent is that a lot of content became time sensitive – or, well – created with a defined expiry date in mind.

This has always been the case in the creative arts. When Web Design started to be a thing, getting to design a web site for a movie or a festival was a carte blanche to go wild and push the limits of the platform. You knew that the product had a fixed life span, and nobody will give a monkey’s in a month time.

It got trickier when news outlets did the same. I remember when the Guardian and the BBC had full access to the archives. I even remember when other newspapers and news aggregator content was available to remix. But soon any news content past 30 days was deleted from the web and you had to rely on Google Cache or The Internet Archive’s WayBackMachine to quote content made a month ago.

Publishers started realising that throwing out more and short-lived, dramatic content is how you get the clicks. And this is what it was all about.

Coffin Nail 4: Search Engine Optimisation

The next deep cut to the web as a publication medium was search engine optimisation. Sites stopped linking to other sites, and instead started to link to their own, search engine optimised archive and overview pages to keep the users in the system.

I’ve always hated that. “Politician X did this which is related to Y” with Y being a keyword linked to an older publication on the same platform. This is not a citation or verification – it is a waste of my time.

As a content creator with old, well indexed content you keep getting offers to add links to boost, frankly, content-less pages that are ads or products. I get about 20 of them a week, some praising my “great content” and quoting an archive page like It’s insulting and a waste of time.

Real black hat SEO went further with generating link farms and SEO optimised blogs and fake sites all linking to another. It was the first indication of the journey towards content being created for bots and crawlers rather than for humans. All of this was done as the only monetisation model of the web that really worked and brought big money was ads. And this lead to the next problem.

Coffin Nail 5: The Ad blocker arms race

Meanwhile, content sites do cost money, so you need to get it from somewhere. Subscription models are tricky and don’t really translate from printed newspapers to online. So, publishers did what they knew – they displayed ads. First subtly, then almost unbearably so.

Having a blaring, auto-playing video advertising an SUV is not as sleazy as a popup on more, uhm, exotic content sites about male enhancement products, but its technically the same thing. Other ads and platforms like Facebook were even more intrusive and followed you around the web. People adding an official “Share to Facebook” button to their sites means you are being spied on.

This lead privacy and security advocates to build add-ons to browsers that would remove intrusive ads and third party includes.

The practical upshot of that for everyone was that ads were removed from the pages and all the annoyances were gone. And you could even claim that you used ad blockers to protect your privacy, and not because you want to have everything for free. The users won.

But the market has a penchant for fighting back.

Ads became even more intrusive, included into media like images and videos. Many sites tried an adorable approach to detect ad blockers and tell people to please not use them. Others made their products dependent on JavaScript delivered from the same CDNs as ads, thus breaking the experience and adding overhead and making things less resilient for every user.

Things cost money, and instead of trying to find better ways to make people pay for content online that they think it is worth, the market did something much, much worse.

Coffin Nail 6: The death of web search

Search is big money, browsers are expensive and loss-leaders. Chrome exists to advertise Google content and send you to Google search. Microsoft Edge exists to give you MSN and Bing content.

When these services don’t make money as people use ad blockers, every commercial search engine showed you lots of ads disguised as search results or similar products to the thing you might have looked for.

I wrote about that some time ago, in the The web starts on page four essay (yes, I am linking to my blog. No, it is not ironic, and will throw 10,000 spoons at anyone who claims so).

This got even worse. I like to research things I want to buy. So, if I add a product name with a size and a code like “Fred Perry Polo M3600 Black L”, I do not want a Lacoste Polo, no matter how much money they pay your search engine.

Web search has become a shopping mall rather than returning links from the web. There is a URL hack you can use in Google to only get Web results but I won’t be surprised if that went away soon.

Fact is that indexing has become less important. 38% of webpages that existed in 2013 are no longer accessible now. Longevity isn’t a goal anymore it seems.

Coffin Nail 7: Social media optimisation

Then came the big area of social media. A misnomer, as there isn’t much social about it. Twitter, Facebook, and others have had a social aspect, for sure, but once they became mainstream media contenders, they soon became weaponised. First, to sell lots of stuff (remember that Facebook is also Instagram, WhatsApp…), and second to change people’s opinions. The Cambridge Analytica kerfuffle showed that by having an addiction machine and keeping people in their bubble you can do much more than the Nazis ever could do with giving people affordable radio sets.

The immediacy and ephemeral nature of social media these days is the equivalent of virtual cocaine. It’s fast, it promises glamour and people get dependent on it without realising it as it all makes so much sense to them. Many studies show that the more outrageous, borderline illegal content gets, the more people interact with it. Even when they are disgusted about it.

For an interesting example, take a look at the viewing numbers of pimple squeezing videos on YouTube, which do exceed fetishists’ consumption by far. And these are the least distasteful things that make up short lived successes. We had a period like that on the earlier Internet, too, with sites like Stileproject and Ogrish leading the charge.

Once search engines returned Twitter posts in favour of web pages or news content, we went down a slippery slope. And lately this has moved into utterly manipulative territory.

Almost every social platform now ranks posts with links lower than posts that are just statements. Posts that put that statement in an image (often without any alt text) rank even higher. It’s a middle finger to the web we thought about creating. The global read and write library.

Personal opinion and shock factor trumps statements with links to verify them. Welcome to a pub full of drunk folk spouting half arsed knowledge and getting their mates to gang up on you when you try to point out obvious flaws in the statement.

Moving to disposable content

The problem with immediacy and going for more atomic content creation is that there is no track record. My blog has been indexed and spread far and wide since I started it in 2005. Whatever I put on Twitter over the years is either hard to find or lost. And this is not something the market laments. Instead, it is seen as a thing a new generation of users crave and want. Is that demand manufactured? Are we controlling a new generation of people by shoving them into a perfect addiction machine like TikTok for the sake of keeping them occupied? Or is this really where media goes?

Machine generated, optimised, boring and immediately forgotten

ChatGPT was a roaring success and people are scared of missing the boat so everything gets “AI” shoved into it right now. Google messed up badly with indexing content from Reddit, a platform with a history of fun “Wrong answers only” posts. Creating summaries that sound excellent and inviting us to keep chatting with a bot full of nonsense. They now say they fixed it by favouring less funny and viral content. Facebook started doing the same, and so does Bing.

When did you ever get a single answer from a human that made you happy without further questions? AI powered summaries are like hitting “I feel lucky” back in the day on Google. Even back then we overestimated the quality of the algorithm. And soon SEO players took that on to get their results as the first – no matter their validity.

We face a web right now that is machine generated content for bots to consume and throw us humans tidbits that sound solid, but are based on decades of random content added to the web to answer a quick “how”, but not the “why” behind it. It pains me to see the opportunity that was the web squandered like this.

There are counter movements and nobody can stop you from publishing long form, great content. And maybe that’s reward in itself. I feel better for having this written down. And I don’t care if it will go viral, quoted by people cleverer than me, or get media fame.

But I had the power to throw it out. A power the web gives everyone. For now. So let’s think how we can make that remain an option.

Stumbling on the escalator

Thursday, February 16th, 2012

I am always amazed about the lack of support for progressive enhancement on the web. Whenever you mention it, you face a lot of “yeah, but…” and you feel having to defend something that should be ingrained in the DNA of anyone who works on the web.


When explaining progressive enhancement in the past Aaron Gustafson and me quoted the American Stand-Up comedian Mitch Hedberg and his escalator insight:

An escalator can never break – it can only become stairs. You would never see an “Escalator Temporarily Out Of Order” sign, just “Escalator Temporarily Stairs. Sorry for the convenience. We apologize for the fact that you can still get up there.”

This is really what it is about. Our technical solutions should be like escalators – they still work when the technology fails or there is a power outage (if you see CSS animations and transformations and transitions and JavaScript as power) – but they might be less convenient to use. Unlike real world escalators we never have to block them off to repair them.

We could even learn from real-world escalators that shut down when nobody uses them for a while and start once people step on them. On the web, we call this script loading or conditional application of functionality. Why load a lot of images up front when they can’t be seen as they are far away from the viewport?

An interesting thing you can see in the real world is that when an escalator broke down and became stairs people stumble when they enter it. Our bodies have been conditioned to expect movement and our motor memory does a “HUH?” when there isn’t any.

This happens on the web as well. People who never were without a fast connection or new and shiny computer or phone with the latest browsers have a hard time thinking about these situations – it just feels weird.


Another interesting thing are the horizontal walkways you have in airports. These are meant to accelerate your walking, not replace it. Still you find people standing on those complaining about their speed.

On the web these are the people who constantly complain about new technology being cool and all but they’d never be able to use it in their current client/development environment. Well, you don’t have to. You can walk in between the walkways and still reach the other side – it just takes a bit longer.

So next time someone praises flexible development and design practices and you have the knee-jerk reaction to either condemn them for not using the newest and coolest as “everybody has a xyz phone and browser abc” or you just don’t see the point in starting from HTML and getting to your goal re-using what you structured and explained in HTML as “GMail and Facebook don’t do it either” think about the escalator and how handy it is in the real world.

Think about it when you are tired (accessibility), or you carry a lot of luggage (performance) or when you just want to have a quick chat whilst being transported up without getting out of breath. Your own body has different needs at different times. Progressively enhancing our products allows us to cater for lots of different needs and environments. Specialising and optimising for one will have a much more impressive result, but for example a lift is pointless when it doesn’t work – no matter how shiny and impressive it looks.

Our job is to make sure people can do the things they went online for – get from their start to their desired goal. This could be convenient and fast or require a bit of work. Our job is to make sure that people do not get the promise of a faster and more convenient way that fails as soon as they try taking it.

You can comment on Google Plus if you want to.

Seven things I want to see on the web in 2011

Sunday, January 2nd, 2011

As I’ve been particularly nice all year, I think I deserve to be allowed to have a wish list of things that should change on the web in 2011. So here is what I want to see:

  1. HTML5 everywhere
  2. Death of the password (antipattern)
  3. Backup APIs
  4. Focus on Security
  5. Governments embracing the web instead of fighting it
  6. Cloud based apps with sharing facilities
  7. More hardware-independent interface innovation

1. HTML5 everywhere

Those who know about my new job heard that I am putting my full energy this year into making HTML5 work to be the replacement for the hacky efforts we do right now to write web applications. I want to see 2011 as the year HTML5 turned mainstream:

  • I want amazingly beautiful and useful software to be built and put in front of the luddites of the web who force their users to have IE6 and not support any other browser.
  • I want to use native form controls like date pickers in travel web sites and finance sites.
  • I want every video on the web to be open and I want to be able to save it with a link and manipulate it without having to re-encode it.
  • I want collaborative software to use web sockets (once the protocol has been fixed) and I want to see web workers to be used to avoid interfaces grinding down to a halt when some calculation needs to be done.
  • I want online converters that use the cloud to make video conversion into open formats dead easy – I also want to have a subtitling format for that.
  • I want interfaces to natively be progressively enhanced by using the same widgets server and client side.
  • I want systems to use Geolocation and Local Storage to be responsive and clever in getting and storing my information rather than having to enter the same data over and over again.

And I want a donkey and a happy puppy to play with – but that’s a different story.

2. Death of the password (antipattern)

I hate memorising passwords. Everybody does. The recent hack of gawker media for example showed that people use amazingly clever passwords like “1234567” or “lifehacker” (on no less) instead of choosing one that is safer but harder to remember.As I don’t use the same password everywhere and I don’t like staying logged in at sites I don’t frequently use a huge chunk of my online life right now is resetting my password. Not fun, but neither is typing in very complex passwords on my mobile.There are alternatives. Using Facebook, Twitter, Google and Yahoo and oAuth you can allow people to sign in to your site – without having to remember another password or do the dance of going from your site to email and back. Using OpenID you can allow people to use their homepage as their login. These systems also have the benefit that you can tap into the social identity of the users on these systems rather than asking for the same data over and over again. I would love more people to use them in 2011 rather than slavishly sticking to an old idea of having to collect user data on your own system. This is the web – use it.

3. Backup APIs

The recent involuntary announcement of Yahoo that is under the hammer (or halfway in the blender) makes it obvious that nothing is safe to use in the long term (I will write a longer article about this as the Yahoo bashers also live in a dream world, IMHO).Therefore I would love to see startups and API providers always offer a backup API in addition to the normal read/write/update APIs. If I don’t like a system any longer it should be easy for me to take all the data I spent a lot of time and effort on over the years with me. Dopplr was a great example of doing this right. In the current run for more and more realtime web apps we forget that backups are important and simple the decent thing to offer our users.

4. Focus on Security

Yeah, I get it – we need to innovate. We need to innovate hard, cause only the ones with the cool new features every week are the ones who win. Rah Rah Rah.

I disagree though that innovation means sacrificing security and this is what happens all over the place now. Hell, I’ve even heard speakers at startup conferences say that security can come later and privacy is not an issue really. That is bullshit, and anyone with half a technical mind should know it.

The web is a mess right now and it doesn’t have to. Storing data unencrypted, transport of identity in clear text over HTTP, XSS vulnerabilities, backdoors and SQL injection are not misdemeanours – they are just sloppy development and will bite you in the arse sooner or later. Sure, Facebook can pay a lawsuit of people getting their identity stolen. Can your startup?

I dread the day when stealing online identities becomes as profitable as credit card fraud and when the organised crime institutions of this world start targeting it. If we want the web to be awesome, we have to make it secure. Otherwise other people will try to solve the security issues for us – and boy are they clueless, which brings me to the next wish.

5. Governments embracing the web instead of fighting it

Wikileaks was a very necessary incident this year. There is information out there that is kept from us. True, a lot of times knowledge can be dangerous and some information should be kept away from people who don’t know how to read or handle it properly. The same piece of information can be displayed in one way or another to cause one emotion or another – this is what TV is for.

However, if there is one thing that Wikileaks showed is that the people who should have all the knowledge are not necessarily the governments. They’ve proven before that a lot of classified information gets lost by leaving laptops and printouts on trains.

One thing that is less mentioned is that Wikileaks showed that the web is an incredibly efficient media to distribute information and get people to defend your cause. LOIC and the attacks on Visa and Mastercard shows that you can leverage the power of every user out there and make their computer part of a cause – even without them knowing much about computers. Right now only the baddies do that – zombie botnets and viruses.

How about a government programme that allows every citizen to download some data and crunch through it for the state? How about making the job of creating a more efficient state the job of every citizen? If you censor people, you have them against you. If you are open in your communication and share the challenges and ask for help you make people your allies.

Instead of seeing this obvious opportunity governments right now are afraid of the web and try to control it – in essence turn the read+write media that is the web into a lame consumption channel much like TV.

Recently the UK proposed to remove pornography from the internet and you need to contact your ISP that you want to consume it beforehand. I am hard pushed to find a lamer excuse for monitoring people’s online behaviour. I am also hard pushed to even fathom how that would work.Are Rubens pictures of huge naked ladies pornography? What is that file called qweaasdwewweq.part2.rar on Rapidshare or Hotfile? Sure, pornography sites that rated themselves with a meta tag are simple to block, but surely if you want to remove porn from the web you also have to block Blogger and any other simple publication platform people use to store naughty pictures or links to rar-ed full movies. Or maybe that is actually the end goal?

6. Cloud based apps with sharing facilities

I have not gotten my Google laptop yet (I asked for one though, let’s see if that works out) but I love the idea of not having to install anything on my computer any longer. When I joined Mozilla I was amazed that the company laptop came completely empty (I was also amazed just how much information Apple wants to know about you when you install OSX). The reason is that everything the company does is online.

We use Zimbra as our mail, BaseCamp, Google Docs, Etherpad and some others. This rocks, and it would rock even more if cloud based systems would talk more to each other:

  • Instead of sending a URL to someone to open a Google Doc, why not have it as a virtual attachment that allows me to save it as a PDF for on-the-go reading directly from a mail client?
  • Why can’t I just upload a movie to S3 and it automatically creates embeddable WebM versions for me?
  • There are some very cool image editing tools on the web now, but where are the video editors? (yes, there was Jumpcut, but it got the old yeller treatment by Yahoo).
  • We need some cool SVG editors online, which could convert other path-based formats on the go.
  • We need better editors for HTML5 content and put them in the cloud rather than install them locally.
  • We need a good web-standards-based slide system which allows us to sync video and audio easily.
  • We need a web based version control system that handles textual and binary data and not require you to know your way around a CLI. The HTML5 File API could be used for that.
  • Why are all expense and travel systems in the style of the 90s? Why can’t I just link my online bank transaction PDF to an invoice system and tick the ones I spent for the company to get the money back?
  • Why don’t systems use the new technologies we have right now to allow for storing data locally and offline?

In other words, we use web based systems but we forget that they could talk to each other and have much more to play with in browsers than we had in the 90s. Many a time I had to create a PDF and attach it to an email so someone in the expenses department could copy and paste from it into another system. That is just wasted time and duplicated effort. Once things are digital they can be re-used.

A lot of cool online systems are in place already, now it would be great to build some collaboration frameworks that allow me to sync them and connect them. There are some very cool things in the making right now – let’s hope this year will be the one where they become industrial strength and get a lot of use.

7. More hardware-independent interface innovation

2010 was the year of hardware innovation. Apple’s iPad, iPhone and Android systems leapfrogged the old grey huge boxes and netbooks and sub-notebooks made us much more mobile than ever before. Small screens and touch interfaces bring up new and exciting challenges and mean that we should question some of the “standards” we use right now (best example are lightboxes which are simply awful to use on a mobile).

However, instead of taking these learnings and simplifying all interfaces we build hardware specific solutions. A lot of the CSS innovation done by Apple is very much targeted to iPad solutions and it will take other browsers some time to take these on – especially when nobody requests browser vendors to do so.When the iPad came out people asked me if I will now change all my sites to work for it. No, I won’t. I will tweak them to work with it alongside all the other systems out there, but I fail to see why I would want to leave out hundreds of millions of users of the web who do not have an iPad.
So instead of tweaking our designs and interfaces to cater one single solution I would love to see original patterns being enhanced and changed according to new use cases. Hardware is fleeting and changing. Patterns stay.


That’s it! I have a few more requests (like free wireless connectivity at public spaces instead of charging 10 Euro for half an hour like this friggin airport does) but for the web, this would make an awesome 2011. Let’s get to it!

Diving into the web of data – the YQL talk at boagworld live 200

Friday, February 12th, 2010

I just finished a quick podcast demo for the 200th podcast of Boagworld, streamed live on ustream. I thought I had an hour but it turned out to be half an hour. My topic was YQL and I wanted to actually do something like shown in the video that was just released (click through to watch or download the video in English or German):

YQl and YUI video

The story of this is:

  • We spend a lot of time thinking about building the interface and using the right semantic markup and trying to make browsers work (or expect certain browsers in certain settings as we gave up on that idea)
  • What we should be concentrating on more is the data that drives our web sites – it is boring to have to copy and paste texts from Word documents or get a CMS to generate something that is almost but not quite like useful HTML.
  • You could say that once published as HTML the data is available but for starters HTML4 is bad as a data format to store information. Furthermore too many pieces of software can access web sites and the cleanest HTML you release somewhere can be messed up by somebody else with a CMS or any other mean of access further down the line. Sadly enough most content editing software still produces HTML that is tied with its presentation rather than what it should structure and define.
  • Having worked on the datasets provided by the UK government at I’ve realised that we are nowhere near as a market to provide re-usable and easily convertible data to each other. XML was meant to be that but got lost in complexities of dictionaries, taxonomies and other things that you can spend days on to define English content but have to re-think anyway once you go multilingual. Most content – let’s face it – is maintained in Excel sheets and Word documents – which is OK cause people should not be forced to use a system they don’t like.
  • If you really think about the web as a platform and as a media then we should have simple ways to provide textual data or binary information (for videos and images) instead of getting bogged down in how to please the current generation of browsers.
  • If you really want to be accessible to any web user – and that is anyone who can get content over HTTP - you should think about making your content available as an API. This allows people to build interfaces necessary for edge cases that you didn’t even think existed.
  • YQL is a simple way to use the web as a database and mix and match data and also a very simple way to provide data in easy to digest formats – give it a go.

In any case, after the 2 o’clock podcast where most of my questions were eloquently answered by Jeremy Keith and the Skype connection died in the last 5 minutes I spent the afternoon putting together some demos for this YQL talk as YQL is really easiest to explain with examples and to have something for the people on flaky connections to play with. So if you go to:

You can see what I talked about during the podcast. People in the chat asked if this will be open source. Yes it is, the passcode is pressing cmd+u in Firefox or whatever other way you choose to “view source” in your browser of choice.

Normally I would not do any of these calls purely in JavaScript (as explained in the video) but this was the quickest solution and it can give you an insight just how easy it is to use information you requested, filtered and converted with YQL.

cURL – your “view source” of the web

Friday, December 18th, 2009

What follows here is a quick introduction to the magic of cURL. This was inspired by the comment of Bruce Lawson on my 24 ways article:

Seems very cool and will help me with a small Xmas project. Unfortunately, you lost me at “Do the curl call”. Care to explain what’s happening there?

What is cURL?

OK, here goes. cURL is your “view source” tool for the web. In essence it is a program that allows you to make HTTP requests from the command line or different language implementations.

The cURL homepage has all the information about it but here is where it gets interesting.

If you are on a Mac or on Linux, you are in luck – for you already have cURL. If you are operation system challenged, you can download cURL in different packages.

On aforementioned systems you can simply go to the terminal and do your first cURL thing, load a web site and see the source. To do this, simply enter

curl ""

And hit enter – you will get the source of (that is the rendered source, like a browser would get it – not the PHP source code of course):

showing with curl

If you want the code in a file you can add a > filename.html at the end:

curl "" > myicantcouk.html

Downloading with curl by  you.

( The speed will vary of course – this is the Yahoo UK pipe :) )

That is basically what cURL does – it allows you to do any HTTP request from the command line. This includes simple things like loading a document, but also allows for clever stuff like submitting forms, setting cookies, authenticating over HTTP, uploading files, faking the referer and user agent set the content type and following redirects. In short, anything you can do with a browser.

I could explain all of that here, but this is tedious as it is well explained (if not nicely presented) on the cURL homepage.

How is that useful for me?

Now, where this becomes really cool is when you use it inside another language that you use to build web sites. PHP is my weapon of choice for a few reasons:

  • It is easy to learn for anybody who knows HTML and JavaScript
  • It comes with almost every web hosting package

The latter is also where the problem is. As a lot of people write terribly shoddy PHP the web is full of insecure web sites. This is why a lot of hosters disallow some of the useful things PHP comes with. For example you can load and display a file from the web with readfile():


Actually, as this is a text file, it needs the right header:

  header('content-type: text/plain');

You will find, however, that a lot of file hosters will not allow you to read files from other servers with readfile(), or fopen() or include(). Mine for example:

readfile not allowed by  you.

And this is where cURL comes in:

// define the URL to load
$url = '';
// start cURL
$ch = curl_init(); 
// tell cURL what the URL is
curl_setopt($ch, CURLOPT_URL, $url); 
// tell cURL that you want the data back from that URL
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
// run cURL
$output = curl_exec($ch); 
// end the cURL call (this also cleans up memory so it is 
// important)
// display the output
echo $output;

As you can see the options is where things get interesting and the ones you can set are legion.

So, instead of just including or loading a file, you can now alter the output in any way you want. Say you want for example to get some Twitter stuff without using the API. This will get the profile badge from my Twitter homepage:

$url = '';
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
$output = preg_replace('/.*(<div id="profile"[^>]+>)/msi','$1',$output);
$output = preg_replace('/<hr.>.*/msi','',$output);
echo $output;

Notice that the HTML of Twitter has a table as the stats, where a list would have done the trick. Let’s rectify that:

$url = '';
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
$output = preg_replace('/.*(<div id="profile"[^>]+>)/msi','$1',$output);
$output = preg_replace('/<hr.>.*/msi','',$output);
$output = preg_replace('/<?table>/','',$output);
$output = preg_replace('/<(?)tr>/','<$1ul>',$output);
$output = preg_replace('/<(?)td>/','<$1li>',$output);
echo $output;

Scraping stuff of the web is but one thing you can do with cURL. Most of the time what you will be doing is calling web services.

Say you want to search the web for donkeys, you can do that with Yahoo BOSS:

$search = 'donkeys';
$appid = 'appid=TX6b4XHV34EnPXW0sYEr51hP1pn5O8KAGs'.
$url = ''.
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
$data = simplexml_load_string($output);
foreach($data->resultset_web->result as $r){
  echo "<h3><a href=\"{$r->clickurl}\">{$r->title}</a></h3>";
  echo "<p>{$r->abstract} <span>({$r->url})</span></p>";

You can also do that for APIs that need POST or other authentication. Say for example to use Placemaker to find locations in a text:

$content = 'Hey, I live in London, England and on Monday '.
           'I fly to Nuremberg via Zurich,Switzerland (sadly enough).';
$key = 'C8meDB7V34EYPVngbIRigCC5caaIMO2scfS2t'.
$ch = curl_init(); 
define('POSTURL',  '');
define('POSTVARS', 'appid='.$key.'&documentContent='.
$ch = curl_init(POSTURL);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  
$x = curl_exec($ch);
$places = simplexml_load_string($x, 'SimpleXMLElement',
echo "<p>$content</p>";
echo "<ul>";
foreach($places->document->placeDetails as $p){
  $now = $p->place;
  echo "<li>{$now->name}, {$now->type} ";
  echo "({$now->centroid->latitude},{$now->centroid->longitude})</li>";
echo "</ul>";

Why is all that necessary? I can do that with jQuery and Ajax!

Yes, you can, but can your users? Also, can you afford to have a page that is not indexed by search engines? Can you be sure that none of the other JavaScript on the page will not cause an error and all of your functionality is gone?

By sticking to your server to do the hard work, you can rely on things working, if you use web resources in JavaScript you are first of all hoping that the user’s computer and browser understands what you want and you also open yourself to all kind of dangerous injections. JavaScript is not secure – every script executed in your page has the same right. If you load third party content with JavaScript and you don’t filter it very cleverly the maintainers of the third party code can inject malicious code that will allow them to steal information from your server and log in as your users or as you.

And why the C64 thing?

Well, the lads behind cURL actually used to do demos on C64 (as did I). Just look at the difference:

horizon 1990 2000