Christian Heilmann

The Table of Contents script – my old nemesis

One thing I like about – let me rephrase that – one of the amazingly few things that I like about Microsoft Word is that you can generate a Table of Contents from a document. Word would go through the headings and create a nested TOC from them for you:

Now, I always like to do that for documents I write in HTML, too, but maintaining them by hand is a pain. I normally write my document outline first:


Cute things on the Interwebs


Rabbits


Puppies


Labradors


Alsatians


Corgies


Retrievers


Kittens


Gerbils


Ducklings


I then collect those, copy and paste them and use search and replace to turn all the hn to links and the IDs to fragment identifiers:


  • Cute things on the Interwebs

  • Rabbits

  • Puppies

  • Labradors

  • Alsatians

  • Corgies

  • Retrievers

  • Kittens

  • Gerbils

  • Ducklings
  • Cute things on the Interwebs


    Rabbits


    Puppies


    Labradors


    Alsatians


    Corgies


    Retrievers


    Kittens


    Gerbils


    Ducklings


    Then I need to look at the weight and order of the headings and add the nesting of the TOC list accordingly.


    Cute things on the Interwebs


    Rabbits


    Puppies


    Labradors


    Alsatians


    Corgies


    Retrievers


    Kittens


    Gerbils


    Ducklings


    Now, wouldn’t it be nice to have that done automatically for me? The way to do that in JavaScript and DOM is actually a much trickier problem than it looks like at first sight (I always love to ask this as an interview question or in DOM scripting workshops).

    Here are some of the issues to consider:

    So here are some solutions to that problem:

    Using the DOM:


    (function(){
    var headings = [];
    var herxp = /hd/i;
    var count = 0;
    var elms = document.getElementsByTagName(‘*’);
    for(var i=0,j=elms.length;i var cur = elms[i];
    var id = cur.id;
    if(herxp.test(cur.nodeName)){
    if(cur.id===’‘){
    id = ‘head’+count;
    cur.id = id;
    count++;
    }

    headings.push(cur);
    }

    }
    var out = ‘

    ';
    document.getElementById('toc').innerHTML = out;
    })();

    You can see the DOM solution in action here. The problem with it is that it can become very slow on large documents and in MSIE6.

    The regular expressions solution

    To work around the need to traverse the whole DOM, I thought it might be a good idea to use regular expressions on the innerHTML of the DOM and write it back once I added the IDs and assembled the TOC:


    (function(){
    var bd = document.body,
    x = bd.innerHTML,
    headings = x.match(/]*>[Ss]*?$/mg),
    r1 = />/,
    r2 = / toc = document.createElement(‘div’),
    out = ‘
    ‘;
    container = document.getElementById(‘toc’) || bd;
    container.appendChild(toc);
    })();

    You can see the regular expressions solution in action here. The problem with it is that reading innerHTML and then writing it out might be expensive (this needs testing) and if you have event handling attached to elements it might leak memory as my colleage Matt Jones pointed out (again, this needs testing). Ara Pehlivavian also mentioned that a mix of both approaches might be better – match the headings but don’t write back the innerHTML – instead use DOM to add the IDs.

    Libraries to the rescue – a YUI3 example

    Talking to another colleague – Dav Glass – about the TOC problem he pointed out that the YUI3 selector engine happily takes a list of elements and returns them in the right order. This makes things very easy:




    There is probably a cleaner way to assemble the TOC list.

    Performance considerations

    There is more to life than simply increasing its speed. – Gandhi

    Some of the code above can be very slow. That said, whenever we talk about performance and JavaScript, it is important to consider the context of the implementation: a table of contents script would normally be used on a text-heavy, but simple, document. There is no point in measuring and judging these scripts running them over gmail or the Yahoo homepage. That said, faster and less memory consuming is always better, but I am always a bit sceptic about performance tests that consider edge cases rather than the one the solution was meant to be applied to.

    Moving server side.

    The other thing I am getting more and more sceptic about are client side solutions for things that actually also make sense on the server. Therefore I thought I could use the regular expressions approach above and move it server side.

    The first version is a PHP script you can loop an HTML document through. You can see the outcome of tocit.php here:


    $file = $_GET[‘file’];
    if(preg_match(‘/^[a-z0-9-_.]+$/i’,$file)){
    $content = file_get_contents($file);
    preg_match_all(“/]*>.*/Us”,$content,$headlines);
    $out = ‘
      ‘;
      foreach($headlines[0] as $k=>$h){
      if(strstr($h,’id’)===false){
      $x = preg_replace(‘/>/’,’ id=”head’.$k.’”>‘,$h,1);
      $content = str_replace($h,$x,$content);
      $h = $x;
      };
      $link = preg_replace(‘/ $link = str_replace(‘id=”’,’href=”#’,$link);
      if($k>0 && $headlines[1][$k-1] $out.=’
        ‘;
        }

        $out .= ‘

      • ‘.$link.’‘;
        if($headlines[1][$k+1] && $headlines[1][$k+1] $out.=’
      ‘;
      }

      if($headlines[1][$k+1] && $headlines[1][$k+1] $headlines[1][$k]){
      $out.='';
      }

      }
      $out.='

    ';
    echo str_replace('
    ',$out,$content);
    }else{
    die('only files like text.html please!');
    }

    ?>

    This is nice, but instead of having another file to loop through, we can also use the output buffer of PHP:


    function tocit($content){
    preg_match_all(“/]*>.*/Us”,$content,$headlines);
    $out = ‘
      ‘;
      foreach($headlines[0] as $k=>$h){
      if(strstr($h,’id’)=false){
      $x = preg_replace(‘/>/’,’ id=”head’.$k.’”>‘,$h,1);
      $content = str_replace($h,$x,$content);
      $h = $x;
      };
      $link = preg_replace(‘/ $link = str_replace(‘id=”’,’href=”#’,$link);
      if($k>0 && $headlines[1][$k-1] $out.=’
        ‘;
        }

        $out .= ‘

      • ‘.$link.’‘;
        if($headlines[1][$k+1] && $headlines[1][$k+1] $out.=’
      ‘;
      }

      if($headlines[1][$k+1] && $headlines[1][$k+1] == $headlines[1][$k]){
      $out.=’‘;
      }

      }
      $out.=’

    ‘;
    return str_replace(‘
    ‘,$out,$content);
    }

    ob_start(“tocit”);
    ?>
    [... the document …]

    The server side solutions have a few benefits: they always work, and you can also cache the result if needed for a while. I am sure the PHP can be sped up, though.

    See all the solutions and get the source code

    I showed you mine, now show me yours!

    All of these solutions are pretty much rough and ready. What do you think how they can be improved? How about doing a version for different libraries? Go ahead, fork the project on GitHub and show me what you can do.