I was under the impression that the latest Panda update was supposed to refine Google’s results to more valuable and relevant content. It appears I was mistaken.
I have run into some problems recently in controlling which version of a URL Google uses as the canonical…regardless of which canonical is specified in the rel=”canonical” HTML element. Before I go into details regarding the problem I think it is important to outline my understanding of where Google ultimately defines a canonical URL.
Google Canonical URL Selection
It is my understanding from both research and experience that Google uses the following sources for defining canonical URLs, listed in order of priority.
1. Internal links to pages within the website.
2. The rel=”canonical” HTML element within the HEAD of the page code. I.E. 3. The URL listed within the sitemap.xml file
4. External links to pages with the website
When it comes to the final canonical choice, it seems that majority rules. Meaning, whenever two higher priority elements match, that seems to be Google’s canonical output. While I am aware that this is a massively simplified view of Google’s decision making process when it comes to the canonical URL, it still seems to apply in most circumstances.
The Canonical URL Goal
Get Google to display the all lowercase URL version and in turn attract more external links with my preferred URL version.
The Canonical URL Problem
The problem that I run into is mostly focused around URL case. The way that the site code is structured is not optimal for several reasons.
1. The URLs output for internal links are a mix of lower and upper case without any way to convert them to all lower case and no way to efficiently redirect the old URLs to the new
2. The system has no way of outputting a URL list for sitemaps generation
3. Most back links use the browser URL which set the canonical signal to reflect the mixed case URL version
Originally, I was using a link crawler to create sitemap files which was causing three of the four canonical signals to reflect the mixed case URLs. I needed a better solution since only listing the appropriate canonical version within the rel=”canonical” element was not enough to have Google display my canonical choice.
The Canonical URL Solution
My thought was that if I could at least get the rel=”canonical” URLs and the sitemap URLs to match then I would have a chance at providing the right signals for Google to show the URL of my choosing in the results. I was right. Within 45 days of matching the URLs within my sitemap file to the URLs within the rel=”canonical” elements on my pages, Google was displaying all lowercase URLs and had also stripped off all parameters appended to the URLs. How did I do it?
First I retrieved all site URLs using a regular crawler. This was just to get an original list that I could use to plug into the rel=”canonical” retrieval tool that I created using a simple PHP script.
My second step was to past the full URL list into the canonical tool and wait for the list to come back clean (assuming the rel=”canonical” element was set correctly within the code).
Finally I just used my Excel sitemap template to remove the duplicates, which there were a ton of, and create the XML URL line items.
That’s it. It all took less than an hour including scrubbing the URL list of bad URLs, retrieving the canonicals and generating the clean sitemaps that would match the rel=”canonical” element in my code for a total of 5k URLs. The direct impact after a 45 day wait was worth every second I spent.
Canonical URL Solution Benefits
• All URLs displayed in the Google SERPs are my preferred canonical URLs
• Any back links gained from new visitors via Google results are pointed to my preferred canonical
• Consolidated link value
• 95% of the URLs within my sitemap file are now indexed since all duplicates have been removed
• Clean URL list to work with when using any other data scraping utilities to retrieve site information
Canonical URL Tool Notes
• The server timeout defaults at two minutes
• URLs must be a full path and separated by line breaks
• Requests for additional functionality and code base are welcome
• This tool is not meant as an enterprise solution…yet