Please note: This is an unpublished site and we are making changes - glitches still!!!

Search Engine Optimization

Sometimes it looks as if SEO were a kind of rocket science. After all the robots are pretty simple things. They are happy when they get two things: the robots.txt instruction file and a proper site map. 

The Site Map for robots

From the sitemap.org site:

"Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site."

A minimalistic site map XML file is pretty simple and we will not make it any more complicated. Here is an example:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.example.com/catalog?desc=vacation_hawaii</loc>
      <changefreq>weekly</changefreq>
   </url>
   <url>
      <loc>http://www.example.com/catalog?desc=vacation_new_zealand</loc>
      <lastmod>2004-12-23</lastmod>
      <changefreq>weekly</changefreq>
   </url>
   <url>
      <loc>http://www.example.com/catalog?desc=vacation_newfoundland</loc>
      <lastmod>2004-12-23T18:00:15+00:00</lastmod>
      <priority>0.3</priority>
   </url>
   <url>
      <loc>http://www.example.com/catalog?desc=vacation_usa</loc>
      <lastmod>2004-11-23</lastmod>
   </url>
</urlset>

As you can see there is the XML declaration and after it we see the urlset attribute which in turn contains url entries. Inside each url we have a varying number of tags. The importan thing is that there is jut one mandatory tag: loc. The rest of the tags can be added if necessary.

What we will do is create a simple script that fetches all our internal pages (i.e. the ones created with the help of the page editor) and creates the structure from them. 

Remember that we the database table already contains something we can use now: the last_modified timestamp!

The template

We might have done the sitemap without a template, simple as it is and you can write it the way you wish. If you wish to use a template based approach the code is below. Note that we have used conditional structures to make sure that only real data gets into the map.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{foreach from=$pages item=page}
  <url>
      <loc>{$page.loc}</loc>
      {if $page.lastmod ne ""}
      <lastmod>{$page.lastmod}</lastmod>
      {/if}
      {if $page.changefreq ne ""}
      <changefreq>{$page.changefreq}</changefreq>
      {/if}
      {if $page.priority ne ""}
      <priority>{$page.priority}</priority>
      {/if}
  </url>
{/foreach}
</urlset>

The module code

The module code is pretty simple as well. If you include data from the news system, possible event table or any other part the code gets a bit longer, of course. In that case it is wise to have several subroutines that make one task each. 

<?php
class sitemap_xml_module extends CmsModule{
    // normal internal pages
    var $pages_table = "cms_pages";
    // reservaation for adding new
    var $news_table    = "cms_news";
    // site root relative to main document root
    var $page_root = "";
    // server and protocol info
    var $server = "myserver.com";
    var $protocol = "http://";
    // template
    var $template = "sitemap_xml.tpl";
 
    function init(){
        parent::init();
        // if server defined in params, use it
        if ($this->vars['server'] != ""){
            $this->server = $this->vars['server'];
        } else {
            // otherwise use the default
            $this->server = $_SERVER['SERVER_NAME'];
        }
        if ($this->vars['page_root'] != ""){
            $this->page_root = $this->vars['page_root'];
        }
        if ($this->vars['pages_table'] != ""){
            $this->pages_table = $this->vars['pages_table'];
        }
        if ($this->vars['template'] != ""){
            $this->template = $this->vars['template'];
        }
        // add the rest if necessary
    }
    function fetch(){
        $query= "SELECT id,DATE_FORMAT(last_modified,'%Y-%m-%d') AS last_modified " .
                "FROM $this->pages_table " .
                "WHERE visible=1 ORDER BY idx ";
        $result = mysql_query($query);
        $idx = 0;
        $pages = array();
        while ($obj = mysql_fetch_object($result)) {
            $pages[$idx]['title'] = $obj->title;
            $pages[$idx]['last_modified'] = $obj->last_modified;
            $loc = $this->protocol . $this->server."/";
            $loc .= $this->page_root."/"."index.php?pageid=".$obj->id;
            $pages[$idx]['loc'] = $loc;
            $pages[$idx]['lastmod']=$obj->last_modified;
            $idx++;
        }
        // if you want to do the same to the news items
        // replicate the above
 
        $mySmarty = new Smarty();
        $mySmarty->assign("pages",$pages);
        // clear the output buffer to get a clean page
        ob_clean();
        echo $mySmarty->fetch($this->template);
        // and flush (output), it should now be valid XML
        ob_flush();
        // finally exit the whole script
        exit;
    }
} ?>

Output buffering

One potential problem of a poorly designed site is that the output to the browser (or robot) starts too early. This may result in the robot getting a malformed XML file starting with garbage (from the robot's point of view). That is why we will use output buffering which lets us control the the output and clean the buffer when necessary.  Please note the ob_clean() and ob_flush() statements in the above code. The first one empties  the output buffer and the second flushes it to the browser. And - lo and behold - we have our first echo statement so far!

In order to make the application work we must incorporate the output buffering and if you index.php does not yet have the following lines please add them there now!

    ob_start();
    $smarty->display("main.tpl");
    ob_end_flush(); 

The ob_start() statement must be before anything is output. If you want to play safe write it on top of the index.php script.

Adding the module to the main template

The obviuos postion for the code is on top of the code because the output does not use  anything from the main  template code let alone output from any other module. So generating a whole page and then ob_cleaning it sounds like a waste of CPU resources. On the other hand robots will visit a site so infrequently that they have no effect on the load. So you are free to place it anywhere you want. This is what the statements might look like in the middle of the maincontent div conditional statements: 

    {elseif $command eq "sitemap"}
        {cmsmodule name="sitemap_html" path="modules/sitemap"}
    {elseif $command eq "sitemap_xml"}
        {cmsmodule name="sitemap_xml" path="modules/nonvisual"} 

Final instructions

Now you can create a robots.txt file in your site root. The example below  will tell all robots that they are welcome visitors and that they get the sitempap from the URL provided. Please test the XML sitemap url and check the output for validity before publishing it. 

User-agent: *
Sitemap:  http://cms.tonipex.net?act=sitemap_xml

After this the only thing you need to do is to let the search engines know that your site exists. Various search providers have tools that let you propose your site to be added to the search repository. Remember, though, that the more popular your site get the more links to it people will create. And the more links to your site the higher your site will be ranked.

Now that you have your visitor log running you may soon see crawlers hitting the site. You could actually improve the logger a bit by intelligently leaving out the crawlers from the log because in many cases the crawler visits are not very interesting.