Recently I wrote about making sure your new site can be indexed by a search engine, in our case, the Google Search Appliance (GSA).
What a lot of our colleagues don’t realize is that just having your site’s content indexed does not mean it will be found, or perhaps even should be found. No, this isn’t about SEO but rather about creating a clean index.
To give a bit of background, we have several divisions and units, each with their own website, some specialty sites and the main site which loosely groups it all together. Our goal in this scenario is to drive users to what they want but know that due to this architecture, which is not consistent across all the sites, the best that we can do is to drive users in the direction of what they want.
When we first purchased the GSA we purchased a license for 1M documents, once the crawling started we realized we probably had more content than that, but we also knew that our sites contained a lot of ROT (redundant, outdated, trivial content) so the plan was to exclude it. After all, if there is garbage in the index it will show up in search results, and that is not desirable.
I started excluding content. First it was easy things such as pages that were specifically for printing, then anything with session IDs and the list grew. Now, our index is down to 160K documents and nothing of use has been excluded from indexing. How did we do it – ruthless exclusion!
Here is a list of everything I exclude from our index:
- Individual people pages in directories. The directories are indexed so the people can be found but not the individual pages which users click through to from the directory;
- Individual pages in books. Again, we have the main page for the publication and due to some programming all of the content is in the index for the book but only the single main page appears for the book in search results;
- Lack of metadata means podcasts and videocasts don’t index well to begin with so they are excluded, we have pages listing the files with descriptions;
- Past events can be accessed through calendar pages but past events themselves are not indexed;
- Permutations of faceted navigation;
- Pages formatted for printing;
- Anything with a session ID;
- And, any personal content since some areas of our organization do allow employees to have personal sites.
Besides having a super clean yet very useful index, the upshot is that I have become very adept with regexps!