Search Engine Optimization
Sometimes it looks as if SEO were a kind of rocket science. After all the robots are pretty simple things. They are happy when they get two things: the robots.txt instruction file and a proper site map.
The Site Map for robots
From the sitemap.org site:
"Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site."
A minimalistic site map XML file is pretty simple and we will not make it any more complicated. Here is an example:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://www.example.com/catalog?desc=vacation_hawaii</loc> <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?desc=vacation_new_zealand</loc> <lastmod>2004-12-23</lastmod> <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?desc=vacation_newfoundland</loc> <lastmod>2004-12-23T18:00:15+00:00</lastmod> <priority>0.3</priority> </url> <url> <loc>http://www.example.com/catalog?desc=vacation_usa</loc> <lastmod>2004-11-23</lastmod> </url> </urlset>
As you can see there is the XML declaration and after it we see the urlset attribute which in turn contains url entries. Inside each url we have a varying number of tags. The importan thing is that there is jut one mandatory tag: loc. The rest of the tags can be added if necessary.
What we will do is create a simple script that fetches all our internal pages (i.e. the ones created with the help of the page editor) and creates the structure from them.
Remember that we the database table already contains something we can use now: the last_modified timestamp!
The template
We might have done the sitemap without a template, simple as it is and you can write it the way you wish. If you wish to use a template based approach the code is below. Note that we have used conditional structures to make sure that only real data gets into the map.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> {foreach from=$pages item=page} <url> <loc>{$page.loc}</loc> {if $page.lastmod ne ""} <lastmod>{$page.lastmod}</lastmod> {/if} {if $page.changefreq ne ""} <changefreq>{$page.changefreq}</changefreq> {/if} {if $page.priority ne ""} <priority>{$page.priority}</priority> {/if} </url> {/foreach} </urlset>
The module code
The module code is pretty simple as well. If you include data from the news system, possible event table or any other part the code gets a bit longer, of course. In that case it is wise to have several subroutines that make one task each.
<?php class sitemap_xml_module extends CmsModule{ // normal internal pages var $pages_table = "cms_pages"; // reservaation for adding new var $news_table = "cms_news"; // site root relative to main document root var $page_root = ""; // server and protocol info var $server = "myserver.com"; var $protocol = "http://"; // template var $template = "sitemap_xml.tpl"; function init(){ parent::init(); // if server defined in params, use it if ($this->vars['server'] != ""){ $this->server = $this->vars['server']; } else { // otherwise use the default $this->server = $_SERVER['SERVER_NAME']; } if ($this->vars['page_root'] != ""){ $this->page_root = $this->vars['page_root']; } if ($this->vars['pages_table'] != ""){ $this->pages_table = $this->vars['pages_table']; } if ($this->vars['template'] != ""){ $this->template = $this->vars['template']; } // add the rest if necessary } function fetch(){ $query= "SELECT id,DATE_FORMAT(last_modified,'%Y-%m-%d') AS last_modified " . "FROM $this->pages_table " . "WHERE visible=1 ORDER BY idx "; $result = mysql_query($query); $idx = 0; $pages = array(); while ($obj = mysql_fetch_object($result)) { $pages[$idx]['title'] = $obj->title; $pages[$idx]['last_modified'] = $obj->last_modified; $loc = $this->protocol . $this->server."/"; $loc .= $this->page_root."/"."index.php?pageid=".$obj->id; $pages[$idx]['loc'] = $loc; $pages[$idx]['lastmod']=$obj->last_modified; $idx++; } // if you want to do the same to the news items // replicate the above $mySmarty = new Smarty(); $mySmarty->assign("pages",$pages); // clear the output buffer to get a clean page ob_clean(); echo $mySmarty->fetch($this->template); // and flush (output), it should now be valid XML ob_flush(); // finally exit the whole script exit; } } ?>
Output buffering
One potential problem of a poorly designed site is that the output to the browser (or robot) starts too early. This may result in the robot getting a malformed XML file starting with garbage (from the robot's point of view). That is why we will use output buffering which lets us control the the output and clean the buffer when necessary. Please note the ob_clean() and ob_flush() statements in the above code. The first one empties the output buffer and the second flushes it to the browser. And - lo and behold - we have our first echo statement so far!
In order to make the application work we must incorporate the output buffering and if you index.php does not yet have the following lines please add them there now!
ob_start();
$smarty->display("main.tpl");
ob_end_flush();
The ob_start() statement must be before anything is output. If you want to play safe write it on top of the index.php script.
Adding the module to the main template
The obviuos postion for the code is on top of the code because the output does not use anything from the main template code let alone output from any other module. So generating a whole page and then ob_cleaning it sounds like a waste of CPU resources. On the other hand robots will visit a site so infrequently that they have no effect on the load. So you are free to place it anywhere you want. This is what the statements might look like in the middle of the maincontent div conditional statements:
{elseif $command eq "sitemap"}
{cmsmodule name="sitemap_html" path="modules/sitemap"}
{elseif $command eq "sitemap_xml"}
{cmsmodule name="sitemap_xml" path="modules/nonvisual"}
Final instructions
Now you can create a robots.txt file in your site root. The example below will tell all robots that they are welcome visitors and that they get the sitempap from the URL provided. Please test the XML sitemap url and check the output for validity before publishing it.
User-agent: *
Sitemap: http://cms.tonipex.net?act=sitemap_xml
After this the only thing you need to do is to let the search engines know that your site exists. Various search providers have tools that let you propose your site to be added to the search repository. Remember, though, that the more popular your site get the more links to it people will create. And the more links to your site the higher your site will be ranked.
Now that you have your visitor log running you may soon see crawlers hitting the site. You could actually improve the logger a bit by intelligently leaving out the crawlers from the log because in many cases the crawler visits are not very interesting.
The validity of this site may vary while it is being
developed.
Feel free to test it, though :)