Scalable Collaborative Filtering With MongoDB

Book AddictionMany websites have some form of recommendation system. While it’s simple to create a recommendation system for small amounts of data, how do you create a system that scales to huge amounts of data?

How to actually calculate the similarity of two items is a complicated topic with many possible solutions. Which one if appropriate depends on your particularly application. If you want to find out more I suggest reading the excellent Programming Collective Intelligence (Amazon affiliate link) by Toby Segaran.

We’ll take the simplest method for calculating similarity and just calculate the percentage of users who have visited both pages compared to the total number who have visited either. If we have Page 1 that was visited by user A, B and C and Page 2 that was visited by A, C and D then the A and C visited both, but A, B, C and D visited either one so the similarity is 50%.

With thousands or millions of items and millions or billions of views calculating the similarity between items becomes a difficult problem. Fortunately MongoDB’s sharding and replication allow us to scale the calculations to cope with these large datasets.

First let’s create a set of views across a number of items. A view is stored as a single document in MongoDB. You would probably want to include extra information such as the time of the view, but for our purposes this is all that is required.

views = [
        { "user": "0", "item": "0" },
        { "user": "1", "item": "0" },
        { "user": "1", "item": "0" },
        { "user": "1", "item": "1" },
        { "user": "2", "item": "0" },
        { "user": "2", "item": "1" },
        { "user": "2", "item": "1" },
        { "user": "3", "item": "1" },
        { "user": "3", "item": "2" },
        { "user": "4", "item": "2" },

for view in views:

The first step is to process this list of view of events so we can take a single item and get a list of all the users that have viewed it. To make sure this scales over a large number of views we’ll use MongoDB’s map/reduce functionality.

def article_user_view_count():
    map_func = """
function () {
    var view = {}
    view[this.user] = 1
    emit(this.item, view);

We’ll build a javascript Object where the keys are the user id and the value is the number of time that user has viewed this item. In the map function we we build an object that represents a single view and emit it using the item id as the key. MongoDB will group all the objects emitted with the same key and run the reduce function, shown below.

    reduce_func = """
function (key, values) {
    var view = values[0];

    for (var i = 1; i < values.length; i++) {
        for(var item in values[i]) {
            if(!view.hasOwnProperty(item)) { view[item] = 0; }

            view[item] = view[item] + values[i][item];
    return view;

A reduce function takes two parameters, the key and a list of values. The values that are passed in can either be those emitted by the map function, or values returned from the reduce function. To help it scale not all of the original values will be processed at once, and the reduce function must be able to handle input from the map function or its own output. Here we output a value in the same format as the input so we don’t need to do anything special.

    db.views.map_reduce(Code(map_func), Code(reduce_func), out="item_user_view_count")

The final step is to run the functions we’ve just created and output the data into a new collection. Here we’re recalculating all the data each time this function is run. To scale properly you should filter the input based on the date the view occurred and merge it with the output collection, rather than replacing it as we are doing here.

Now we need calculate a matrix of similarity values, linking each item with every other item. First lets see how we can calculate the similarity of all items to one single item. Again we’ll use map/reduce to help spread the load of running this calculation. Here we’ll just use the map part of map/reduce because each input document will be represented by a single output document.

def similarity(item):
    map_func = """
function () {
    if(this._id == "%s") { return; }

    var viewed_both = {};
    var viewed_any = %s;

    for (var user in this.views) {
        if(this.value.hasOwnProperty(user)) {
            viewed_both[user] = 1;

        viewed_any[user] = 1;
     emit("%s"+"_"+this._id, viewed_both.length / viewed_any.length );
""" % (int(item["_id"]), json.dumps(item["value"]), json.dumps(item["value"]) int(item["_id"]), )

The input to our Python function is a document that was outputted by our previous map/reduce call. We build a new Javascript by interpolating some data from this document into a template function. We loop through all the users who viewed the document we’re comparing against and work out whether they have viewed both. At the end of the function we emit the percentage of users who viewed both.

    reduce_func = """
function (key, values) {
    return results[0];

Because we output unique ids in the map function this reduce function will only be called with a single value so we just return that.

    db.item_user_view_count.map_reduce(Code(map_func), Code(reduce_func), out=SON([("merge", "item_similarity")]))

The last step in this function is to run the map reduce. Here as we’re running the map/reduce multiple times we need to merge the output rather than replacing it as we did before.

The final step is to loop through the output from our first map/reduce and call our second function for each item.

for doc in db.item_user_view_count.find():

A key thing to realise is that you don’t need to calculate live similarity data. Once you have even a few hundred views per item then the similarity will remain fairly consistent. In this example we step through each item in turn and calculate the similarity for it with every other item. For a million item database where each iteration of this loop takes one second the similarity data will be updated once every 11 days.

I’m not claiming that you can take the code provided here and immediately have a massively scalable system. MongoDB provides an easy to use replication and sharding system, which are plugged in to its Map/Reduce framework. What you should take away is that by using map/reduce with sharding and replication to calculate the similarity between two items we can quickly get a system that scales well with an increasing number of items and of views.

Photo of Book Addiction by Emily Carlin.


Django ImportError Hiding

Hidden CatA little while ago I was asked what my biggest gripe with Django was. At the time I couldn’t think of a good answer because since I started using Django in the pre-1.0 days most of the rough edges have been smoothed. Yesterday though, I encountered an error that made me wish I thought of it at the time.

The code that produced the error looked like this:

from django.db import models

class MyModel(model.Model):

    def save(self):



The error that was raised was AttributeError: 'NoneType' object has no attribute 'Model'. This means that rather than containing a module object, models was None. Clearly this is impossible as the class could not have been created if that was the case. Impossible or not, it was clearly happening.

Adding a print statement to the module showed that when it was imported the models variable did contain the expected module object. What that also showed was that module was being imported more than once, something that should also be impossible.

After a wild goose chase investigating reasons why the module might be imported twice I tracked it down to the load_app method in django/db/models/ The code there looks something like this:

    def load_app(self, app_name, can_postpone=False):
            models = import_module('.models', app_name)
        except ImportError:
            # Ignore exception

Now I’m being a harsh here, and the exception handler does contain a comment about working out if it should reraise the exception. The issue here is that it wasn’t raising the exception, and it’s really not clear why. It turns out that I had a misspelt module name in an import statement in a different module. This raised an ImportError which was caught, hidden and then Django repeatedly attempted to import the models as they were referenced in the models of other apps. The strange exception that was originally encountered is probably an artefact of Python’s garbage collection, although how exactly it occurred is still not clear to me.

There are a number of tickets (#6379, #14130 and probably others) on this topic. A common refrain in Python is that it’s easier to ask for forgiveness than to ask for permission, and I certainly agree with Django and follow that most of the time.

I always follow the rule that try/except clauses should cover as little code as possible. Consider the following piece of code.


except AttributeError:
    # handle error

Which of the three attribute accesses are we actually trying to catch here? Handling exceptions like this are a useful way of implementing Duck Typing while following the easier to ask forgiveness principle. What this code doesn’t make clear is which member or method is actually optional. A better way to write this would be:


    member = var.member
except AttributeError:
    # handle error

Now the code is very clear that the var variable may or may not have a member member variable. If method1 or method2 do not exist then the exception is not masked and is passed on. Now lets consider that we want to allow the method1 attribute to be optional.

except AttributeError:
    # handle error

At first glance it’s obvious that method1 is optional, but actually we’re catching too much here. If there is a bug in method1 that causes an AttributeError to raised then this will be masked and the code will treat it as if method1 didn’t exist. A better piece of code would be:

    method = var.method1
except AttributeError:
    # handle error

ImportErrors are similar because code can be executed, but then when an error occurs you can’t tell whether the original import failed or whether an import inside that failed. Unlike with an AttributeError there is a no easy way to rewrite the code to only catch the error you’re interested in. Python does provide some tools to divide the import process into steps, so you can tell whether the module exists before attempting to import it. In particular the imp.find_module function would be useful.

Changing Django to avoid catching the wrong ImportErrors will greatly complicate the code. It would also introduce the danger that the algorithm used would not match the one used by Python. So, what’s the moral of this story? Never catch more exceptions than you intended to, and if you get some really odd errors in your Django site watch out for ImportErrors.

Photo of Hidden Cat by Craig Grahford.

Beating Google With CouchDB, Celery and Whoosh (Part 8)

github 章魚貼紙In the previous seven posts I’ve gone through all the stages in building a search engine. If you want to try and run it for yourself and tweak it to make it even better then you can. I’ve put the code up on GitHub. All I ask is that if you beat Google, you give me a credit somewhere.

When you’ve downloaded the code it should prove to be quite simple to get running. First you’ll need to edit It should work out of the box, but you should change the USER_AGENT setting to something unique. You may also want to adjust some of the other settings, such as the database connection or CouchDB urls.

To set up the CouchDB views type python update_couchdb.

Next, to run the celery daemon you’ll need to type the following two commands:

python celeryd -Q retrieve
python celeryd -Q process

This sets up the daemons to monitor the two queues and process the tasks. As mentioned in a previous post two queues are needed to prevent one set of tasks from swamping the other.

Next you’ll need to run the full text indexer, which can be done with python index_update and then you’ll want to run the server using python runserver.

At this point you should have several process running not doing anything. To kick things off we need to inject one or more urls into the system. You can do this with another management command, python start_crawl http://url. You can run this command as many times as you like to seed your crawler with different pages. It has been my experience that the average page has around 100 links on it so it shouldn’t take long before your crawler is scampering off to crawl many more pages that you initially seeded it with.

So, how well does Celery work with CouchDB as a backend? The answer is that it’s a bit mixed. Certainly it makes it very easy to get started as you can just point it at the server and it just works. However, the drawback, and it’s a real show stopper, is that the Celery daemon will poll the database looking for new tasks. This polling, as you scale up the number of daemons will quickly bring your server to its knees and prevent it from doing any useful work.

The disappointing fact is that Celery could watch the _changes feed rather than polling. Hopefully this will get fixed in a future version. For now though, for anything other experimental scale installations RabbitMQ is a much better bet.

Hopefully this series has been useful to you, and please do download the code and experiment with it!

Photo of github 章魚貼紙 by othree.

Beating Google With CouchDB, Celery and Whoosh (Part 7)

The Planet Data CenterThe key ingredients of our search engine are now in place, but we face a problem. We can download webpages and store them in CouchDB. We can rank them in order of importance and query them using Whoosh but the internet is big, really big! A single server doesn’t even come close to being able to hold all the information that you would want it to – Google has an estimated 900,000 servers. So how do we scale this the software we’ve written so far effectively?

The reason I started writing this series was to investigate how well Celery’s integration with CouchDB works. This gives us an immediate win in terms of scaling as we don’t need to worry about a different backend, such as RabbitMQ. Celery itself is designed to scale so we can run celeryd daemons as many boxes as we like and the jobs will be divided amongst them. This means that our indexing and ranking processes will scale easily.

CouchDB is not designed to scale across multiple machines, but there is some mature software, CouchDB-lounge that does just that. I won’t go into how to get set this up but fundamentally you set up a proxy that sits in front of your CouchDB cluster and shards the data across the nodes. It deals with the job of merging view results and managing where the data is actually stored so you don’t have to. O’Reilly’s CouchDB: The Definitive Guide has a chapter on clustering that is well worth a read.

Unfortunately while Woosh is easy to work with it’s not designed to be used on a large scale. Indeed if someone was crazy enough to try to run the software we’ve developed in this series they might be advised to replace Whoosh with Solr. Solr is a lucene-based search server which provides an HTTP interface to the full-text index. Solr comes with a sharding system to enable you to query an index that is too large for a single machine.

So, with our two data storage tools providing HTTP interface and both having replication and sharding either built in or as available as a proxy the chances of being able to scale effectively are good. Celery should allow the background tasks that are needed to run a search engine can be scaled, but the challenges of building and running a large scale infrastructure are many and I would not claim that these tools mean success is guarenteed!

In the final post of this series I will discuss what I’ve learnt about running Celery with CouchDB, and with CouchDB in general. I’ll also describe how to download and run the complete code so you can try these techniques for yourself.

Read part 8.

Photo of The Planet Data Center by The Planet.

Beating Google With CouchDB, Celery and Whoosh (Part 6)

QueryWe’re nearing the end of our plot to create a Google-beating search engine (in my dreams at least) and in this post we’ll build the interface to query the index we’ve built up. Like Google the interface is very simple, just a text box on one page and a list of results on another.

To begin with we just need a page with a query box. To make the page slightly more interesting we’ll also include the number of pages in the index, and a list of the top documents as ordered by our ranking algorithm.

In the templates on this page we reference base.html which provides the boiler plate code needed to make an HTML page.

{% extends "base.html" %}

{% block body %}
    <form action="/search" method="get">
        <input name="q" type="text">
        <input type="submit">


    <p>{{ doc_count }} pages in index.</p>


    <h2>Top Pages</h2>

    {% for page in top_docs %}
        <li><a href="{{ page.url }}">{{ page.url }}</a> - {{ page.rank }}</li>
    {% endfor %}
{% endblock %}

To show the number of pages in the index we need to count them. We’ve already created an view to list Pages by their url and CouchDB can return the number of documents in a view without actually returning any of them, so we can just get the count from that. We’ll add the following function to the Page model class.

    def count():
        r = settings.db.view("page/by_url", limit=0)
        return r.total_rows

We also need to be able to get a list of the top pages, by rank. We just need to create view that has the rank as the key and CouchDB will sort it for us automatically.

With all the background pieces in place the Django view function to render the index is really very straightforward.

def index(req):
    return render_to_response("index.html", { "doc_count": Page.count(), "top_docs": Page.get_top_by_rank(limit=20) })

Now we get to the meat of the experiment, the search results page. First we need to query the index.

def search(req):
    q = QueryParser("content", schema=schema).parse(req.GET["q"])

This parses the user submitted query and prepares the query ready to be used by Whoosh. Next we need to pass the parsed query to the index.

    results = get_searcher().search(q, limit=100)

Hurrah! Now we have list of results that match our search query. All that remains is to decide what order to display them in. To do this we normalize the score returned by Whoosh and the rank that we calculated, and add them together.

    if len(results) > 0:
        max_score = max([r.score for r in results])
        max_rank = max([r.fields()["rank"] for r in results])

To calculate our combined rank we normalize the score and the rank by setting the largest value of each to one and scaling the rest appropriately.

        combined = []
        for r in results:
            fields = r.fields()
            r.score = r.score/max_score
            r.rank = fields["rank"]/max_rank
            r.combined = r.score + r.rank

The final stage is to sort our list by the combined score and render the results page.

        combined.sort(key=lambda x: x.combined, reverse=True)
        combined = []

    return render_to_response("results.html", { "q": req.GET["q"], "results": combined })

The template for the results page is below.

{% extends "base.html" %}

{% block body %}
    <form action="/search" method="get">
        <input name="q" type="text" value="{{ q }}">
        <input type="submit">

    {% for result in results|slice:":20" %}
            <b><a href="{{ result.url }}">{{ result.title|safe }}</a></b> ({{ result.score }}, {{ result.rank }}, {{ result.combined }})<br>
            {{ result.desc|safe }}
    {% endfor %}
{% endblock %}

So, there we have it. A complete web crawler, indexer and query website. In the next post I’ll discuss how to scale the search engine.

Read part 7.

Photo of Query by amortize.

Beating Google With CouchDB, Celery and Whoosh (Part 5)

orderIn this post we’ll continue building the backend for our search engine by implementing the algorithm we designed in the last post for ranking pages. We’ll also build a index of our pages with Whoosh, a pure-Python full-text indexer and query engine.

To calculate the rank of a page we need to know what other pages link to a given url, and how many links that page has. The code below is a CouchDB map called page/links_to_url. For each page this will output a row for each link on the page with the url linked to as the key and the page’s rank and number of links as the value.

function (doc) {
    if(doc.type == "page") {
        for(i = 0; i < doc.links.length; i++) {
            emit(doc.links[i], [doc.rank, doc.links.length]);

As before we’re using a Celery task to allow us to distribute our calculations. When we wrote the find_links task we called calculate_rank with the document id for our page as the parameter.

def calculate_rank(doc_id):
    page = Page.load(settings.db, doc_id)

Next we get a list of ranks for the page’s that link to this page. This static method is a thin wrapper around the page/links_to_url map function given above.

    links = Page.get_links_to_url(page.url)

Now we have the list of ranks we can calculate the rank of this page by dividing the rank of the linking page by the number of links and summing this across all the linking pages.

    rank = 0
    for link in links:
        rank += link[0] / link[1]

To prevent cycles (where A links to B and B links to A) from causing an infinite loop in our calculation we apply a damping factor. This causes the value of the link to decline by 0.85 and combined with the limit later in the function will force any loops to settle on a value.

    old_rank = page.rank
    page.rank = rank * 0.85

If we didn’t find any links to this page then we give it a default rank of 1/number_of_pages.

    if page.rank == 0:
        page.rank = 1.0/settings.db.view("page/by_url", limit=0).total_rows

Finally we compare the new rank to the previous rank in our system. If it has changed by more than 0.0001 then we save the new rank and cause all the pages linked to from our page to recalculate their rank.

    if abs(old_rank - page.rank) > 0.0001:

        for link in page.links:
            p = Page.get_id_by_url(link, update=False)
            if p is not None:

This is a very simplistic implementation of a page rank algorithm. It does generate a useful ranking of pages, but the number of queued calculate_rank tasks explodes. In a later post I’ll discuss how this could be made rewritten to be more efficient.

Whoosh is a pure-Python full text search engine. In the next post we’ll look at querying it, but first we need to index the pages we’ve crawled.

The first step with Whoosh is to specify your schema. To speed up the display of results we store the information we need to render the results page directly in the schema. For this we need the page title, url and description. We also store the score given to the page by our pagerank-like algorithm. Finally we add the page text to the index so we can query it. If you want more details, the Whoosh documentation is pretty good.

from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), url=ID(stored=True, unique=True), desc=ID(stored=True), rank=NUMERIC(stored=True, type=float), content=TEXT)

CouchDB provides an interface for being informed whenever a document in the database changes. This is perfect for building an index.

Our full-text indexing daemon is implemented as a Django management command so there is some boilerplate code required to make this work.

class Command(BaseCommand):
    def handle(self, **options):
        since = get_last_change()
        writer = get_writer()

CouchDB allows you to get all the changes that have occurred since a specific point in time (using a revision number). We store this number inside the Whoosh index directory, and accessing it using the get_last_change and set_last_change functions. Our access to the Whoosh index is through a IndexWriter object, again accessed through an abstraction function.

Now we enter an infinite loop and call the changes function on our CouchDB database object to get the changes.

            while True:
                changes = settings.db.changes(since=since)
                since = changes["last_seq"]
                for changeset in changes["results"]:
                        doc = settings.db[changeset["id"]]
                    except couchdb.http.ResourceNotFound:

In our database we store robots.txt files as well as pages, so we need to ignore them. We also need to parse the document so we can pull out the text from the page. We do this with the BeautifulSoup library.

                    if "type" in doc and doc["type"] == "page":
                        soup = BeautifulSoup(doc["content"])
                        if soup.body is None:

On the results page we try to use the meta description if we can find it.

                        desc = soup.findAll('meta', attrs={ "name": desc_re })

Once we’ve got the parsed document we update our Whoosh index. The code is a little complicated because we need to handle the case where the page doesn’t have a title or description, and that we search for the title as well as the body text of the page. The key element here is text=True which pulls out just the text from a node and strips out all of the tags.

                                title=unicode(soup.title(text=True)[0]) if soup.title is not None and len(soup.title(text=True)) > 0 else doc["url"],
                                desc=unicode(desc[0]["content"]) if len(desc) > 0 and desc[0]["content"] is not None else u"",
                                content=unicode(soup.title(text=True)[0] + "\n" + doc["url"] + "\n" + "".join(soup.body(text=True)))

Finally we update the index and save the last change number so next time the script is run we continue from where we left off.

                    writer = get_writer()


In the next post I’ll discuss how to query the index, sort the documents by our two rankings and build a simple web interface.

Read part 6.

Photo of order by Steve Mishos.

Beating Google With CouchDB, Celery and Whoosh (Part 4)

Red Sofa encounter iIn this series I’m showing you how to build a webcrawler and search engine using standard Python based tools like Django, Celery and Whoosh with a CouchDB backend. In previous posts we created a data structure, parsed and stored robots.txt and stored a single webpage in our document. In this post I’ll show you how to parse out the links from our stored HTML document so we can complete the crawler, and we’ll start calculating the rank for the pages in our database.

There are several different ways of parsing out the links in a given HTML document. You can just use a regular expression to pull the urls out, or you can use a more complete but also more complicated (and slower) method of parsing the HTML using the standard Python htmlparser library, or the wonderful Beautiful Soup. The point of this series isn’t to build a complete webcrawler, but to show you the basic building blocks. So, for simplicity’s sake I’ll use a regular expression.

link_single_re = re.compile(r"<a[^>]+href='([^']+)'")
link_double_re = re.compile(r'<a[^>]+href="([^"]+)"')

All we need to look for an href attribute in an a tag. We’ll use two regular expressions to handle single and double quotes, and then build a list containing all the links in the document.

def find_links(doc_id):
    doc = Page.load(settings.db, doc_id)

    raw_links = []
    for match in link_single_re.finditer(doc.content):

    for match in link_double_re.finditer(doc.content):

Once we’ve got a list of the raw links we need to process them into absolute urls that we can send back to the retrieve_page task we wrote earlier. I’m cutting some corners with processing these urls, in particular I’m not dealing with base tags.

    doc.links = []
    for link in raw_links:
        if link.startswith("#"):
        elif link.startswith("http://") or link.startswith("https://"):
        elif link.startswith("/"):
            parse = urlparse(doc["url"])
            link = parse.scheme + "://" + parse.netloc + link
            link = "/".join(doc["url"].split("/")[:-1]) + "/" + link


Once we’ve got our list of links and saved the modified document we then need to trigger the next series of steps to occur. We need to calculate the rank of this page, so we trigger that task and then we step through each page that we linked to. If we’ve already got a copy of the page then we want to recalculate its rank to take into account the rank of this page (more on this later) and if we don’t have a copy then we queue it up to be retrieved.


    for link in doc.links:
        p = Page.get_id_by_url(link, update=False)
        if p is not None:

We’ve now got a complete webcrawler. We can store webpages and robots.txt files. Given a starting URL our crawler will set about parsing pages to find out what they link to and retrieve those pages as well. Given enough time you’ll end up with most of the internet on your harddisk!

When we come to write the website to query the information we’ve collected we’ll use two numbers to rank pages. First we’ll use the a value that ranks pages base on the query used, but we’ll also use a value that ranks pages based on their importance. This is the same method used by Google, known as Page Rank.

Pank Rank is a measure of how likely you are to end up on a given page by clicking on a random link anywhere on the internet. The Wikipedia article goes into some detail on a number of ways to calculate it, but we’ll use a very simple iterative algorithm.

When created, a page is given a rank equal to 1/number of pages. Each link that is found on a newly crawled page then causes the rank of the destination page to be calculated. In this case the rank of a page is the sum of the ranks of the pages that link to it, divided by the number of links on those pages, multiplied by a dampening factor (I use 0.85, but this could be adjusted.) If a page has a rank of 0.25 and has five links then each page linked to gains 0.05*0.85 rank for that link. If the change in rank of the page when recalculated is significant then the rank of all the pages it links to are recalculated.

In this post we’ve completed the web crawler part of our search engine and discussed how to rank pages in importance. In the next post we’ll implement this ranking and also create a full text index of the pages we have crawled.

Read part 5.

Photo of Red Sofa encounter i by Ricard Gil.