Andrew Wilkinson

Random Ramblings on Programming

Archive for the ‘whoosh’ Category

Searching Stemmed Fields With Whoosh

with one comment

WORDS by FeuilluWhoosh is quite a nice pure-python full text search engine. While it is still being actively developed and is suitable for production usage there are still some rough edges. One problem that stumped me for a while was searching stemmed fields.

Stemming is where you take the endings off words, such as ‘ings’ on the word endings. This reduces the accuracy of searches but greatly increases the chances of users finding something related to what they were looking for.

To create a stemmed field you need to tell Whoosh to use the StemmingAnalyzer, as shown in the schema definition below.

from whoosh.analysis import StemmingAnalyzer
from whoosh.fields import Schema, TEXT, ID

schema = Schema(id=ID(stored=True, unique=True),
                       text=TEXT(analyzer=StemmingAnalyzer()))

Using the StemmingAnalyzer will cause Whoosh to stem every word before it is added to the index. If you use the shortcut search function to search with a word that should be stemmed it will return no results, as that word does not exist in the index, even though it was included in the data that was indexed.

To correctly search a stemmed index you must parse the query and tell the parse to use the Variations term class. The causes the words in the query to also be stemmed, so they correctly match words in the stemmed index.

searcher = ix.searcher()
qp = QueryParser("text", schema=schema, termclass=Variations)
parsed = qp.parse(query)
docs = searcher.search(parsed)

Photo of words by feuilllu.

Advertisements

Written by Andrew Wilkinson

January 21, 2010 at 1:28 pm

Posted in whoosh

Tagged with , , , ,