Searching Stemmed Fields With Whoosh

WORDS by FeuilluWhoosh is quite a nice pure-python full text search engine. While it is still being actively developed and is suitable for production usage there are still some rough edges. One problem that stumped me for a while was searching stemmed fields.

Stemming is where you take the endings off words, such as ‘ings’ on the word endings. This reduces the accuracy of searches but greatly increases the chances of users finding something related to what they were looking for.

To create a stemmed field you need to tell Whoosh to use the StemmingAnalyzer, as shown in the schema definition below.

from whoosh.analysis import StemmingAnalyzer
from whoosh.fields import Schema, TEXT, ID

schema = Schema(id=ID(stored=True, unique=True),
                       text=TEXT(analyzer=StemmingAnalyzer()))

Using the StemmingAnalyzer will cause Whoosh to stem every word before it is added to the index. If you use the shortcut search function to search with a word that should be stemmed it will return no results, as that word does not exist in the index, even though it was included in the data that was indexed.

To correctly search a stemmed index you must parse the query and tell the parse to use the Variations term class. The causes the words in the query to also be stemmed, so they correctly match words in the stemmed index.

searcher = ix.searcher()
qp = QueryParser("text", schema=schema, termclass=Variations)
parsed = qp.parse(query)
docs = searcher.search(parsed)

Photo of words by feuilllu.

Advertisements

Author: Andrew Wilkinson

I'm a computer programmer and team leader working at the UK grocer and tech company, Ocado Technology. I mostly write multithreaded real time systems in Java, but in the past I've worked with C#, C++ and Python.

One thought on “Searching Stemmed Fields With Whoosh”

  1. Hi,

    This post is pretty old now, so maybe at that time, what I will say here was not correct, but as of now, as soon as you set a stemmed field up, and specify the right schema to the QueryParser, you do not need to set the termclass to whoosh.query.Variations.

    In fact, if you use Variations of the user’s query, you most likely wouldn’t use stemming at indexing time. It’s either using stemming at indexing-time, either using the morphological variations of the query at querying-time.

    The results are the same, just, using variations is more computing-power consuming at querying time since your python application needs to compare every querying term variation to the terms in the indexed field.

    In addition, in the case you want to use the Variations termclass of the terms in the user’s query instead of stemming the fields while indexing, you would need to import the Variations class from whoosh.query before using it, like this: `from whoosh.query import Variations` 😉

    Anyway, thanks for this post, it helped me a lot while investigating in the good direction now that I’m discovering Whoosh.

    Cheers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s