How to measure search result quality (in eZ Publish)

by Ivo Lukač - February 2, 2014

Solr is an excellent search engine. We are using it for years through eZ Find (its an eZ Publish extension for searching) and its really powerful and flexible solution. Our use cases are not that complex so we never actually hit into a wall with Solr. But there is one problem when building search solutions in general which we encountered few times and for which Solr has no feature. It happens when the language of the indexed content is some language that we as integrators don't understand.

The problem

So the main problem we had on few projects is that we couldn't verify that search is performing well. Those use cases were not simple so it was not just about plugging in the eZ Find/Solr extension and enabling it. We implemented custom advanced search forms with lots of options, we also tuned configurations parameters, boosted some attributes and finally updated Solr schema to better index the content.

If it were English or Croatian or some other language that we understand, no problem. But with Arabic, Vietnamese or Armenian its another story :).

In some cases users/editors would notify that they are not happy with the search, but that was to vague to work with. So the solution was to measure the search result quality in several steps with performing tuning in between. As we usually have the native speaking editorial team, they were asked to do the rating of results for the defined search term set. The set could be produced manually or based on existing search usage statistics. We used the the 20 most searched terms.

The tool

To make the rating as easy as possible for editors we implemented a small extension for eZ Publish that plugs in the existing search statistics module and features:

creating of testing periods
simple REST function for rating each search result (that could be plugged in the site)
calculates few quality measures for each search term
calculates total quality index for the period

On the search result page rating can be implemented in more ways, the simplest was to just have a "Thumbs Up" link if the result is ok for the search term (rated with value of 1) and "Thumbs Down" if the result is bad for the search term (rated with value of -1). We implemented 2 mentioned actions as icons for each result on the first page (first 20 items).

Model

Earlier I mentioned quality measures that the tool calculates. Here is a more detailed explanation:

Discounted Cumulative Gain (DCG) - rate sum discounted based on position in search result. As the position of each result is very important (best result should be first), we sum all the rates but depending on the position each rate is discounted with logarithmic reduction factor. The measure is a good base but can't be used for comparing as the number of results in a result set could vary.
Normalised Discounted Cumulative Gain (NDCG) - discounted rate sum normalised against best possible outcome. To overcome the problem with DCG we normalise it against the best outcome. The best possible outcome would be a DCG measure when all results on the first page would be rated with highest score. The good thing here is that with NDCG we get a percentage as the unit of measurement
Popularity based NDCG - takes into account the popularity of the search form. This measure was not that important in our use case as the number of tests for each search term were almost equal (equal to number of editors) but to measure the quality with site visitors rates it would be interesting to give more weight to more popular terms
Total NDCG - sums NDCG for all search term to get one total number for the test period. This one is useful to quickly compare different testing periods.

We used NDCG to see which search terms are performing badly and the Total NDCG to verify that we are making improvements with each new period.

After 3 test period runs we were able to discover and solve one big bug and also tune the engine to squeeze out a 40% better total measure. Yippee!

Keep in mind

There are few thing to keep in mind when doing such experiment:

What if good results are not showing? The tool will not measure such thing directly but could be indicated by a very low NDCG value for a specific term
What if there is no good result at all? If there are no results no rating can be performed and all measures will be 0.
What about new content added after the testing? These testing could be run regularly by the editor team.

At the end of the day measurements are good for comparing between test periods and not meaningful by itself.

Improvement ideas

Opening rating to site visitor - use the crowd to measure the results which would be useful on high traffic site, but could open some unforeseen problems
Using clicks as rates - to make the rating transparent we could use the event of opening a result as an implicit rate. this would actually be good in case when collecting input from site visitors, but opening a certain result page is not a trustworthy sign for a good result
Implement “did you find what you have looking for?” feature - to make even easier experience for site visitors, a feature could be added on detailed result page to get the explicit quality rate
Integrate with site analytics
Use rate data to boost particular item in a search! - Building a self-adjusting system? Why not :)

By the way, this uses case and the tool were demonstrated at J. Boye Aarhus 2013 conference and slides could be found here.

The eZ Publish extension developed here is still in early alpha stage, and not ready to go public, but proven to be useful already. If there is an interest for this solution please contact us. We might push it to stable then :)

Last but not least: this method was applied on top of eZ Publish CMS, but could be applied on any search solution.

solr ezpublish ezfind

Comments

2013! On a right path, still a lot to reach the goal

by Ivo Lukač - January 21, 2014