Andrew Wilkinson

Random Ramblings on Programming

Posts Tagged ‘whoosh

Beating Google With CouchDB, Celery and Whoosh (Part 6)

with one comment

QueryWe’re nearing the end of our plot to create a Google-beating search engine (in my dreams at least) and in this post we’ll build the interface to query the index we’ve built up. Like Google the interface is very simple, just a text box on one page and a list of results on another.

To begin with we just need a page with a query box. To make the page slightly more interesting we’ll also include the number of pages in the index, and a list of the top documents as ordered by our ranking algorithm.

In the templates on this page we reference base.html which provides the boiler plate code needed to make an HTML page.

{% extends "base.html" %}

{% block body %}
    <form action="/search" method="get">
        <input name="q" type="text">
        <input type="submit">
    </form>

    <hr>

    <p>{{ doc_count }} pages in index.</p>

    <hr>

    <h2>Top Pages</h2>

    <ol>
    {% for page in top_docs %}
        <li><a href="{{ page.url }}">{{ page.url }}</a> - {{ page.rank }}</li>
    {% endfor %}
    </ol>
{% endblock %}

To show the number of pages in the index we need to count them. We’ve already created an view to list Pages by their url and CouchDB can return the number of documents in a view without actually returning any of them, so we can just get the count from that. We’ll add the following function to the Page model class.

    @staticmethod
    def count():
        r = settings.db.view("page/by_url", limit=0)
        return r.total_rows

We also need to be able to get a list of the top pages, by rank. We just need to create view that has the rank as the key and CouchDB will sort it for us automatically.

With all the background pieces in place the Django view function to render the index is really very straightforward.

def index(req):
    return render_to_response("index.html", { "doc_count": Page.count(), "top_docs": Page.get_top_by_rank(limit=20) })

Now we get to the meat of the experiment, the search results page. First we need to query the index.

def search(req):
    q = QueryParser("content", schema=schema).parse(req.GET["q"])

This parses the user submitted query and prepares the query ready to be used by Whoosh. Next we need to pass the parsed query to the index.

    results = get_searcher().search(q, limit=100)

Hurrah! Now we have list of results that match our search query. All that remains is to decide what order to display them in. To do this we normalize the score returned by Whoosh and the rank that we calculated, and add them together.

    if len(results) > 0:
        max_score = max([r.score for r in results])
        max_rank = max([r.fields()["rank"] for r in results])

To calculate our combined rank we normalize the score and the rank by setting the largest value of each to one and scaling the rest appropriately.

        combined = []
        for r in results:
            fields = r.fields()
            r.score = r.score/max_score
            r.rank = fields["rank"]/max_rank
            r.combined = r.score + r.rank
            combined.append(r)

The final stage is to sort our list by the combined score and render the results page.

        combined.sort(key=lambda x: x.combined, reverse=True)
    else:
        combined = []

    return render_to_response("results.html", { "q": req.GET["q"], "results": combined })

The template for the results page is below.

{% extends "base.html" %}

{% block body %}
    <form action="/search" method="get">
        <input name="q" type="text" value="{{ q }}">
        <input type="submit">
    </form>

    {% for result in results|slice:":20" %}
        <p>
            <b><a href="{{ result.url }}">{{ result.title|safe }}</a></b> ({{ result.score }}, {{ result.rank }}, {{ result.combined }})<br>
            {{ result.desc|safe }}
        </p>
    {% endfor %}
{% endblock %}

So, there we have it. A complete web crawler, indexer and query website. In the next post I’ll discuss how to scale the search engine.

Read part 7.


Photo of Query by amortize.

Advertisements

Written by Andrew Wilkinson

October 13, 2011 at 12:00 pm

Beating Google With CouchDB, Celery and Whoosh (Part 5)

leave a comment »

orderIn this post we’ll continue building the backend for our search engine by implementing the algorithm we designed in the last post for ranking pages. We’ll also build a index of our pages with Whoosh, a pure-Python full-text indexer and query engine.

To calculate the rank of a page we need to know what other pages link to a given url, and how many links that page has. The code below is a CouchDB map called page/links_to_url. For each page this will output a row for each link on the page with the url linked to as the key and the page’s rank and number of links as the value.

function (doc) {
    if(doc.type == "page") {
        for(i = 0; i < doc.links.length; i++) {
            emit(doc.links[i], [doc.rank, doc.links.length]);
        }
    }
}

As before we’re using a Celery task to allow us to distribute our calculations. When we wrote the find_links task we called calculate_rank with the document id for our page as the parameter.

@task
def calculate_rank(doc_id):
    page = Page.load(settings.db, doc_id)

Next we get a list of ranks for the page’s that link to this page. This static method is a thin wrapper around the page/links_to_url map function given above.

    links = Page.get_links_to_url(page.url)

Now we have the list of ranks we can calculate the rank of this page by dividing the rank of the linking page by the number of links and summing this across all the linking pages.

    rank = 0
    for link in links:
        rank += link[0] / link[1]

To prevent cycles (where A links to B and B links to A) from causing an infinite loop in our calculation we apply a damping factor. This causes the value of the link to decline by 0.85 and combined with the limit later in the function will force any loops to settle on a value.

    old_rank = page.rank
    page.rank = rank * 0.85

If we didn’t find any links to this page then we give it a default rank of 1/number_of_pages.

    if page.rank == 0:
        page.rank = 1.0/settings.db.view("page/by_url", limit=0).total_rows

Finally we compare the new rank to the previous rank in our system. If it has changed by more than 0.0001 then we save the new rank and cause all the pages linked to from our page to recalculate their rank.

    if abs(old_rank - page.rank) > 0.0001:
        page.store(settings.db)

        for link in page.links:
            p = Page.get_id_by_url(link, update=False)
            if p is not None:
                calculate_rank.delay(p)

This is a very simplistic implementation of a page rank algorithm. It does generate a useful ranking of pages, but the number of queued calculate_rank tasks explodes. In a later post I’ll discuss how this could be made rewritten to be more efficient.

Whoosh is a pure-Python full text search engine. In the next post we’ll look at querying it, but first we need to index the pages we’ve crawled.

The first step with Whoosh is to specify your schema. To speed up the display of results we store the information we need to render the results page directly in the schema. For this we need the page title, url and description. We also store the score given to the page by our pagerank-like algorithm. Finally we add the page text to the index so we can query it. If you want more details, the Whoosh documentation is pretty good.

from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), url=ID(stored=True, unique=True), desc=ID(stored=True), rank=NUMERIC(stored=True, type=float), content=TEXT)

CouchDB provides an interface for being informed whenever a document in the database changes. This is perfect for building an index.

Our full-text indexing daemon is implemented as a Django management command so there is some boilerplate code required to make this work.

class Command(BaseCommand):
    def handle(self, **options):
        since = get_last_change()
        writer = get_writer()

CouchDB allows you to get all the changes that have occurred since a specific point in time (using a revision number). We store this number inside the Whoosh index directory, and accessing it using the get_last_change and set_last_change functions. Our access to the Whoosh index is through a IndexWriter object, again accessed through an abstraction function.

Now we enter an infinite loop and call the changes function on our CouchDB database object to get the changes.

        try:
            while True:
                changes = settings.db.changes(since=since)
                since = changes["last_seq"]
                for changeset in changes["results"]:
                    try:
                        doc = settings.db[changeset["id"]]
                    except couchdb.http.ResourceNotFound:
                        continue

In our database we store robots.txt files as well as pages, so we need to ignore them. We also need to parse the document so we can pull out the text from the page. We do this with the BeautifulSoup library.

                    if "type" in doc and doc["type"] == "page":
                        soup = BeautifulSoup(doc["content"])
                        if soup.body is None:
                            continue

On the results page we try to use the meta description if we can find it.

                        desc = soup.findAll('meta', attrs={ "name": desc_re })

Once we’ve got the parsed document we update our Whoosh index. The code is a little complicated because we need to handle the case where the page doesn’t have a title or description, and that we search for the title as well as the body text of the page. The key element here is text=True which pulls out just the text from a node and strips out all of the tags.

                        writer.update_document(
                                title=unicode(soup.title(text=True)[0]) if soup.title is not None and len(soup.title(text=True)) > 0 else doc["url"],
                                url=unicode(doc["url"]),
                                desc=unicode(desc[0]["content"]) if len(desc) > 0 and desc[0]["content"] is not None else u"",
                                rank=doc["rank"],
                                content=unicode(soup.title(text=True)[0] + "\n" + doc["url"] + "\n" + "".join(soup.body(text=True)))
                            )

Finally we update the index and save the last change number so next time the script is run we continue from where we left off.

                    writer.commit()
                    writer = get_writer()

                set_last_change(since)
        finally:
            set_last_change(since)

In the next post I’ll discuss how to query the index, sort the documents by our two rankings and build a simple web interface.

Read part 6.


Photo of order by Steve Mishos.

Written by Andrew Wilkinson

October 11, 2011 at 12:00 pm

Beating Google With CouchDB, Celery and Whoosh (Part 2)

with 2 comments

Celery, Carrots & Sweet Onion for Chicken Feet Stock by I Believe I Can FryIn this series I’ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we’ll begin by creating the data structure for storing the pages in the database, and write the first parts of the webcrawler.

CouchDB’s Python library has a simple ORM system that makes it easy to convert between the JSON objects stored in the database and a Python object.

To create the class you just need to specify the names of the fields, and their type. So, what do a search engine need to store? The url is an obvious one, as is the content of the page. We also need to know when we last accessed the page. To make things easier we’ll also have a list of the urls that the page links to. One of the great advantages of a database like CouchDB is that we don’t need to create a separate table to hold the links, we can just include them directly in the main document. To help return the best pages we’ll use a page rank like algorithm to rank the page, so we also need to store that rank. Finally, as is good practice on CouchDB we’ll give the document a type field so we can write views that only target this document type.

class Page(Document):
    type = TextField(default="page")

    url = TextField()

    content = TextField()

    links = ListField(TextField())

    rank = FloatField(default=0)

    last_checked = DateTimeField(default=datetime.now)

That’s a lot of description for not a lot of code! Just add that class to your models.py file. It’s not a normal Django model, but we’re not using Django models in this project so it’s the right place to put it.

We also need to keep track of the urls that we are and aren’t allowed to access. Fortunately for us Python comes with a class, RobotFileParser which handles the parsing of the file for us. So, this becomes a much simpler model. We just need the domain name, and a pickled RobotFileParser instance. We also need to know whether we’re accessing an http or https and we’ll give it type field to distinguish it from the Page model.

class RobotsTxt(Document):
    type = TextField(default="robotstxt")

    domain = TextField()
    protocol = TextField()

    robot_parser_pickle = TextField()

We want to make the pickle/unpickle process transparent so we’ll create a property that hides the underlying pickle representation. CouchDB can’t store the binary pickle value, so we also base64 encode it and store that instead. If the object hasn’t been set yet then we create a new one on the first access.

    def _get_robot_parser(self):
        if self.robot_parser_pickle is not None:
            return pickle.loads(base64.b64decode(self.robot_parser_pickle))
        else:
            parser = RobotFileParser()
            parser.set_url(self.protocol + "://" + self.domain + "/robots.txt")
            self.robot_parser = parser

            return parser
    def _set_robot_parser(self, parser):
        self.robot_parser_pickle = base64.b64encode(pickle.dumps(parser))
    robot_parser = property(_get_robot_parser, _set_robot_parser)

For both pages and robots.txt files we need to know whether we should reaccess the page. We’ll do this by testing whether the we accessed the file in the last seven days of not. For Page models we do this by adding the following function which implements this check.

    def is_valid(self):
        return (datetime.now() - self.last_checked).days < 7

For the RobotsTxt we can take advantage of the last modified value stored in the RobotFileParser that we’re wrapping. This is a unix timestamp so the is_valid function needs to be a little bit different, but follows the same pattern.

    def is_valid(self):
        return (time.time() - self.robot_parser.mtime()) < 7*24*60*60

To update the stored copy of a robots.txt we need to get the currently stored version, read a new one, set the last modified timestamp and then write it back to the database. To avoid hitting the same server too often we can use Django’s cache to store a value for ten seconds, and sleep if that value already exists.

    def update(self):
        while cache.get(self.domain) is not None:
            time.sleep(1)
        cache.set(self.domain, True, 10)

        parser = self.robot_parser
        parser.read()
        parser.modified()
        self.robot_parser = parser

        self.store(settings.db)

Once we’ve updated the stored file we need to be able to query it. This function just passes the URL being tested through to the underlying model along with our user agent string.

    def is_allowed(self, url):
        return self.robot_parser.can_fetch(settings.USER_AGENT, url)

The final piece in our robots.txt puzzle is a function to pull the write object out of the database. We’ll need a view that has the protocol and domain for each file as the key.

    @staticmethod
    def get_by_domain(protocol, domain):
        r = settings.db.view("robotstxt/by_domain", key=[protocol, domain])

We query that mapping and if it returns a value then we load the object. If it’s still valid then we can return right away, otherwise we need to update it.

        if len(r) > 0:
            doc = RobotsTxt.load(settings.db, r.rows[0].value)
            if doc.is_valid():
                return doc

If we’ve never loaded this domain’s robots.txt file before then we need to create a blank object. The final step is to read the file and store it in the database.

        else:
            doc = RobotsTxt(protocol=protocol, domain=domain)

        doc.update()

        return doc

For completeness, here is the map file required for this function.

function (doc) {
    if(doc.type == "robotstxt") {
        emit([doc.protocol, doc.domain], doc._id);
    }
}

In this post we’ve discussed how to represent a webpage in our database as well as keep track of what paths we are and aren’t allowed to access. We’ve also seen how to retrieve the robots.txt files and update them if they’re too old.

Now that we can test whether we’re allowed to access a URL in the next post in this series I’ll show you how to begin crawling the web and populating our database.

Read part 3.


Photo of Celery, Carrots & Sweet Onion for Chicken Feet Stock by I Believe I Can Fry.

Written by Andrew Wilkinson

September 29, 2011 at 12:00 pm

Beating Google With CouchDB, Celery and Whoosh (Part 1)

with 7 comments

celery by Judy **Ok, let’s get this out of the way right at the start – the title is a huge overstatement. This series of posts will show you how to create a search engine using standard Python tools like Django, Celery and Whoosh with CouchDB as the backend.

Celery is a message passing library that makes it really easy to run background tasks and to spread them across a number of nodes. The most recent release added the NoSQL database CouchDB as a possible backend. I’m a huge fan of CouchDB, and the idea of running both my database and message passing backend on the same software really appealed to me. Unfortunately the documentation doesn’t make it clear what you need to do to get CouchDB working, and what the downsides are. I decided to write this series partly to explain how Celery and CouchDB work, but also to experiment with using them together.

In this series I’m going to talk about setting up Celery to work with Django, using CouchDB as a backend. I’m also going to show you how to use Celery to create a web-crawler. We’ll then index the crawled pages using Whoosh and use a PageRank-like algorithm to help rank the results. Finally, we’ll attach a simple Django frontend to the search engine for querying it.

Let’s consider what we need to implement for our webcrawler to work, and be a good citizen of the internet. First and foremost is that we must be read and respect robots.txt. This is a file that specifies what areas of a site crawlers are banned from. We must also not hit a site too hard, or too often. It is very easy to write a crawler than repeatedly hits a site, and requests the same document over and over again. These are very big no-noes. Lastly we must make sure that we use a custom User Agent so our bot is identifiable.

We’ll divide the algorithm for our webcrawler into three parts. Firstly we’ll need a set of urls. The crawler picks a url, retrieves the page then store it in the database. The second stage takes the page content, parses it for links, and adds the links to the set of urls to be crawled. The final stage is to index the retrieved text. This is done by watching for pages that are retrieved by the first stage, and adding them to the full text index.

Celery’s allows you to create ‘tasks’. These are units of work that are triggered by a piece of code and then executed, after a period of time, on any node in your system. For the crawler we’ll need two seperate tasks. The first retrieves and stores a given url. When it completes it will triggers a second task, one that parses the links from the page. To begin the process we’ll need to use an external command to feed some initial urls into the system, but after that it will continuously crawl until it runs out of links. A real search engine would want to monitor its index for stale pages and reload those, but I won’t implement that in this example.

I’m going to assume that you have a decent level of knowledge about Python and Django, so you might want to read some tutorials on those first. If you’re following along at home, create yourself a blank Django project with a single app inside. You’ll also need to install django-celery, the CouchDB Python library, and have a working install of CouchDB available.

Read part 2.


Photo of celery by Judy **.

Written by Andrew Wilkinson

September 27, 2011 at 12:00 pm

Searching Stemmed Fields With Whoosh

with one comment

WORDS by FeuilluWhoosh is quite a nice pure-python full text search engine. While it is still being actively developed and is suitable for production usage there are still some rough edges. One problem that stumped me for a while was searching stemmed fields.

Stemming is where you take the endings off words, such as ‘ings’ on the word endings. This reduces the accuracy of searches but greatly increases the chances of users finding something related to what they were looking for.

To create a stemmed field you need to tell Whoosh to use the StemmingAnalyzer, as shown in the schema definition below.

from whoosh.analysis import StemmingAnalyzer
from whoosh.fields import Schema, TEXT, ID

schema = Schema(id=ID(stored=True, unique=True),
                       text=TEXT(analyzer=StemmingAnalyzer()))

Using the StemmingAnalyzer will cause Whoosh to stem every word before it is added to the index. If you use the shortcut search function to search with a word that should be stemmed it will return no results, as that word does not exist in the index, even though it was included in the data that was indexed.

To correctly search a stemmed index you must parse the query and tell the parse to use the Variations term class. The causes the words in the query to also be stemmed, so they correctly match words in the stemmed index.

searcher = ix.searcher()
qp = QueryParser("text", schema=schema, termclass=Variations)
parsed = qp.parse(query)
docs = searcher.search(parsed)

Photo of words by feuilllu.

Written by Andrew Wilkinson

January 21, 2010 at 1:28 pm

Posted in whoosh

Tagged with , , , ,