I had the opportunity to participate in Microsoft Ottawa’s ECM Days event yesterday. I gave a short presentation on the non-linear creations’ approach to tuning Microsoft search technologies – Microsoft Search Express 2008, Microsoft Search Server 2008, Microsoft Office Sharepoint Server 2007 (MOSS).
I won’t repeat the entire presentation here, but I thought a handful of observations are worth sharing. The graph below may look familiar. We found a similar shape when we examined the shape of search revealed by AOL’s ill-considered data release.

Short-tail (blue)– fewer than 100 search terms account for 36% of search volume
Long-tail – (green) 60% of search terms only occur once in the three month period
Mid-tail – (red) the range of searches between most and least common.
Not surprisingly, the shape of search inside the firewall does not differ dramatically from that outside the firewall. This graph looks at search behaviour on an intranet over a 3 month period – the vertical axis is the number of times a given terms was sought, the horizontal access is an ordinal ranking of terms from most popular to least popular. So, the most popular term was sought about 1000 times; the least popular (the green line to the right) were only sought once.
Why Does this Matter to Microsoft Search Tuning?
Good question. It matters because Microsoft provides tools for tuning search performance that allows you to deliver superior results for each.
Addressing the Short Tail: Using Best Bets
Microsoft search technologies allow you to define “Best Bets.” These are comparable to Key Word Matching features in the Google Search Appliance. In essence, they are a way to manually supplant the top search results returned with a recommended link. This is powerful. By grouping the most common search terms and defining best bets links for each of these terms, you can very quickly – and dramatically – improve user satisfaction with search results.
In the example above, one of 15 best bet links appear when any of the 100 most common searches are performed and this means 36% of searchers are likely to find what they need.
Addressing the Mid Tail: Leveraging Authority
Microsoft allows you to identify pages that you feel are particularly authoritative – and rank them as primary, secondary or tertiary. (You can also demote pages that you believe should not be considered authoritative).There appears to be a halo affect associated with this definition of authority. Documents or pages “close” to authoritative pages seem to be granted higher relevance in search results than more distantly associated documents or pages. By adjusting authority assignment you can change the broad landscape of search results and, with experimentation, significantly improve the relevance of search results for the broad mid-tail of searchers. (Subsequent posts will describe ideal sources of authority and ways of proving that you’re making search progress.)
Addressing the Long-Tail: Zero-result pages and Synonyms
The rare searches that make up the long tail tend to fall into one of two categories:
- Deeply detailed searches with four or more terms entered by people who know specifically what they are seeking
- What might charitably described as idiosyncratic spellings of more common search terms
You can safely ignore the first searches – they know what they want and will probably find it if it exists. But you should certainly help out the spelling-challenged in your company.
The Microsoft thesaurus is both powerful and a little intimidating. Casual users should probably keep hands well away from the keyboard while viewing it. But it does lend itself to programmatic updating (the topic of a future post).
To add synonyms, you need to edit an XML file, usually named tsenu.xml (for the English thesaurus.) The following is a snippet showing misspellings of Thibideau mapped to the correct spelling. If a user enters any of these terms, a search is run against all of the terms.

Our advice? Start with the standard report that shows the most common terms for which zero results are returned. Take the misspellings or mixed up acronyms and begin adding to the thesaurus. This should drive significant improved search experience for the long tail of search.
Questions or comments? Or real world experience wrestling with enterprise search? That’s what the comment fields are for.