Back Garden Weather in CouchDB (Part 5)

Snow fallingAfter a two week gap the recent snow in the UK has inspired me to get back to my series of posts on my weather station website, WelwynWeather.co.uk. In this post I’ll discuss the records page, which shows details such as the highest and lowest temperatures, and the heaviest periods of rain.

From a previous post in this series you’ll remember that the website is implemented as a CouchApp. These are Javascript functions that run inside the CouchDB database, and while they provide quite a lot of flexibility you do need to tailor your code to them.

On previous pages we have use CouchDB’s map/reduce framework to summarise data then used a list function to display the results. The records page could take a similar approach, but there are some drawbacks to that. Unlike the rest of the pages the information on the records page consists of a number of unrelated numbers. While we could create a single map/reduce function to process all of them at once. That function will quickly grow and become unmanageable, so instead we’ll calculate the statistics individually and use AJAX to load them dynamically into the page.

To calculate the minimum indoor temperature we first need to create a simple view to calculate the value. As with all CouchDB views this starts with map function that outputs the parts of the document we are interested in.

function(doc) {
    emit(doc._id, { "temp_in": doc.temp_in, "timestamp": doc.timestamp });
}

Next we create a reduce function to find the lowest temperature. To do this we simply loop through all the values and select the smallest temperature, recording the timestamp that temperature occurred.

function(keys, values, rereduce) {
    var min = values[0].temp_in;
    var min_on = values[0].timestamp;

    for(var i=0; i<values.length; i++) {
        if(values[i].temp_in < min) {
            min = values[i].temp_in;
            min_on = values[i].timestamp;
        }
    }

    return { "temp_in": min, "timestamp": min_on }
}

The website welwynweather.co.uk actually points to the Couch rewrite document. To make the view available we add a rewrite to expose it to the world. As we want to reduce all documents to a single point we just need to pass reduce=true as the query.

{
    "from": "/records/temperature/in/min",
    "to": "/_view/records_temp_in_min",
    "query": { "reduce": "true" }
},

Lastly we can use jQuery to load the data and place the values into the DOM at the appropriate place. As CouchDB automatically sends the correct mime type jQuery will automatically decode the JSON data making this function very straightforward.

$.getJSON("records/temperature/in/min", function (data, textStatus, jqXHR) {
    var row = data.rows[0].value;
    var date = new Date(row.timestamp*1000);
    $("#min_temp_in").html(row.temp_in);
    $("#min_temp_in_date").html(date.toUTCString());
  });

This approach works well for most of the records that I want to calculate. Where it falls down is when calculating the wettest days and heaviest rain as the data needs to be aggregated before being reduced to a single value. Unfortunately CouchDB does not support this. The issue is that you cannot guarantee that the original documents will be passed to your view in order. In fact it is more likely than not than they won’t be. So, to calculate the heaviest periods of rain you would need to build a data structure containing each hour or day and the amount of rain in that period. As the documents are processed the structure would need to be updated and the period with the highest rain found.

Calculating a complicated structure as the result of your reduce function is disallowed by CouchDB, for good reason. An alternative way to find the heaviest periods of rain would be to put the output of the aggregation function into a new database and run another map/reduce function over that to find the heaviest period. Unfortunately CouchDB doesn’t support the chaining of views, so this is impossible without using an external program.

To solve this problem I do the aggregation in CouchDB and the transfer the whole result to the webbrowser and calculate the heaviest period in Javascript. The code to do this is given below. It’s very similar to that given above, but includes a loop to cycle over the results and pick the largest value.

$.getJSON("records/rain/wettest", function (data, textStatus, jqXHR) {
        var max_on = data.rows[0].key;
        var max_rain = data.rows[0].value;
        for(var i=0; i<data.rows.length; i++) {
            if(data.rows[i].value > max_rain) {
                max_on = data.rows[i].key;
                max_rain = data.rows[i].value;
            }
        }
        var date = new Date(max_on*1000);
    $("#wettest_day").html(max_rain);
        $("#wettest_day_date").html(date.toDateString());
    });

This solution works ok, but as time goes on the dataset gets bigger and bigger and the amount of data that is transferred to the browser will grow and grow. Hopefully in future I’ll be able to write another post about changing this to use chained viewed.

CouchDB is a great document store that is at home the web. The ability to run simple sites right from your database is extremely useful and makes deployment a snap. As with all technology you need to be aware of the limitations of CouchDB and allow for them in your designs. In my case the inability to chain views together is really the only wart in the code. Don’t forget you can replicate the database to get the data and use the couchapp command to clone a copy of site. See the first post in this series for instructions on how to do this. Please let me know in the comment section below if you find the site useful or have any questions or comments on the code.


Photo of Snow falling by Dave Gunn.

Back Garden Weather in CouchDB (Part 4)

Weather frontIn this series of posts I’m describing how I created a CouchDB CouchApp to display the weather data collected by the weather station in my back garden. In the previous post I showed you how to display a single day’s weather data. In this post we will look at processing the data to display it by month.

The data my weather station collects consists of a record every five minutes. This means that a 31 day month will consist of 8,928 records. Unless you have space to draw a graph almost nine thousand pixels wide then there is no point in wasting valuable rending time processing that much data. Reducing the data to one point per hour gives us a much more manageable 744 data points for a month. A full years worth of weather data consists of 105,120 records, even reducing it to one point per hour gives us 8760 points. When rendering a year’s worth of data it is clearly worth reducing the data even further, this time to one point per day.

How do we use CouchDB to reduce the data to one point per hour? Fortunately CouchDB’s map/reduce architecture is perfect for this type of processing. CouchDB will also cache the results of the processing automatically so it only needs to be run once rather than requiring an expensive denormalisation process each time some new data is uploaded.

First we need to group the five minute weather records together into groups for each hour. We could do this by taking the unix timestamp of record and rounding to the nearest hour. The problem with this approach is that the keys are included in the urls. If you can calculate unix timestamps in your head then your maths is better than mine! To make the urls more friendly we’ll use a Javascript implementation of sprintf to build a human-friendly representation of date and time, excluding the minute component.

function(doc) {
    // !code vendor/sprintf-0.6.js

    emit(sprintf("%4d-%02d-%02d %02d", doc.year, doc.month, doc.day, doc.hour), doc);
}

CouchDB will helpfully group documents with the same key, so all the records from the same hour will be passed to the reduce function. What you cannot guarantee though is that all the records will be passed in one go, instead you must ensure that your reduce function can operate on its own output. You can tell whether you are ‘rereducing’ the output of the reduce function by checking the third parameter to the function.

function(keys, values, rereduce) {
    var count = 0;

    var timestamp = values[0].timestamp;
    var temp_in = 0;
    var temp_out = 0;
    var abs_pressure = 0;
    var rain = 0;

    var wind_dir = [];
    for(var i=0; i<8; i++) { wind_dir.push({ value: 0}); }

To combine the multiple records it makes sense to average most of the values. The exceptions to this are the amount of rain, which should be summed; the wind direction, which should be a count of the gusts in each direction, and the wind gust speed which should be the maximum value. Because your reduced function may be called more than once calculating the average value is not straightforward. If you simply calculate the average of the values passed in then you will be calculating the average of averages, which is not the same the average of the full original data. To work around this we calculate the average of the values and store that with the number of values. Then, when we rereduce, we multiply the average by the number of values and then average the multiplied value.

In the previous, simplified, code snippet we set up the variables that will hold the averages.

    for(var i=0; i<values.length; i++) {
        var vcount;
        if(rereduce) { vcount = values[i].count } else { vcount = 1 }

We now loop through each of the values and work out how many weather records the value we’re processing represents. The initial pass will just represent a single record, but in the rereduce step it will be more.

        temp_in = temp_in + values[i].temp_in * vcount;
        temp_out = temp_out + values[i].temp_out * vcount;
        abs_pressure = abs_pressure + values[i].abs_pressure * vcount;

Here we build up the total values for temperature and pressure. Later we’ll divide these by the number of records to get the average. The next section adds the rain count up and selects the maximum wind gust.

        rain = rain + values[i].rain;

        wind_ave = wind_ave + values[i].wind_ave * vcount;
        if(values[i].wind_gust > wind_gust) { wind_gust = values[i].wind_gust; }

So far we’ve not really had to worry about the possibility of a rereduce, but for wind direction we need to take it into account. An individual record has a single window direction but for a hourly records we want to store the count of the number of times each direction was recorded. If we’re rereducing we need to loop through all the directions and combine them.

        if(rereduce) {
            for(var j=0; j<8; j++) {
                wind_dir[j]["value"] += values[i].wind_dir[j]["value"];
            }
        } else if(values[i].wind_ave > 0 && values[i].wind_dir >= 0 && values[i].wind_dir < 16) {
            wind_dir[Math.floor(values[i].wind_dir/2)]["value"] += 1;
        }

        if(values[i].timestamp < timestamp) { timestamp = values[i].timestamp; }
        count = count + vcount;
    }

The final stage is to build the object that we’re going to return. This stage is very straightforward, we just need to divide the numbers we calculated before by the count of the number of records. This gives us the correct average for these values.

    return {
            "count": count,
            "status": status,
            "timestamp": timestamp,
            "temp_in": temp_in / count,
            "temp_out": temp_out / count,
            "abs_pressure": abs_pressure / count,
            "rain": rain,
            "wind_ave": wind_ave / count,
            "wind_gust": wind_gust,
            "wind_dir": wind_dir,
        };
}

Now we have averaged the weather data into hourly chunks we can use a list, like the one described in the previous post, to display the data.

In the next and final post in this series I’ll discuss the records page on the weather site.


Photo of Weather front by Paul Wordingham.

Back Garden Weather in CouchDB (Part 3)

almost mayIn this series I’m describing how I used a CouchDB CouchApp to display the weather data collected by a weather station in my back garden. In the first post I described CouchApps and how to get a copy of the site. In the next post we looked at how to import the data collected by PyWWS and how to render a basic page in a CouchApp. In the post we’ll extend the basic page to display real weather data.

Each document in the database is a record of the weather data at a particular point in time. As we want to display the data over a whole day we need to use a list function. list functions work similarly to the show function we saw in the previous post. Unlike show functions list functions don’t have the document passed in, they can call a getRow function which returns the next row to process. When there are no rows left it returns null.

show functions process an individual document and return a single object containing the processed data and any HTTP headers. Because a list function can process a potentially huge number of rows they return data in a different way. Rather than returning a single object containing the whole response list functions must return their response in chunks. First you need to call the start function, passing in any headers that you want to return. Then you call send one or more times to return parts of your response. A typical list function will look like the code below.

function (head, req) {
    start({ "headers": { "Content-Type": "text/html" }});

    send(header);
    while(row = getRow()) {
        data = /* process row */;
        send(row);
    }
    send(footer);
}

To process the weather data we can’t follow this simple format because we need to split each document up and display the different measurements separately. Let’s look at the code for creating the day page. The complete code is a bit too long to include in a blog post so checkout the first post in this series to find out how to get a complete copy of the code.

To start the function we load the templates and code that we need using the CouchApp macros. Next we return the appropriate Content-Type header, and then we create the object that we’ll pass to Mustache when we’ve processed everything.

function(head, req) {
    // !json templates.day
    // !json templates.head
    // !json templates.foot
    // !code vendor/couchapp/lib/mustache.js
    // !code vendor/sprintf-0.6.js
    // !code vendor/date_utils.js

    start({ "headers": { "Content-Type": "text/html" }});

    var stash = {
        head: templates.head,
        foot: templates.foot,
        date: req.query.startkey,
    };

Next we build a list of the documents that we’re processing so we can loop over the documents multiple times.

    var rows = [];
    while (row = getRow()) {
        rows.push(row.doc);
    }

To calculate maximum and minimum values we need to choose the first value and then run through each piece of data and see whether it is higher or lower than the current record. As the data collector of the weather station is separate to the outside sensors occasionally they lose their connection. This means that we can just pick the value in the first document as our starting value, instead we must choose the first document where the connection with the outside sensors was made.

    if(rows.length &gt; 0) {
        for(var i=0; i<rows.length; i++) {
            if((rows[i].status &amp; 64) == 0) {
                max_temp_out = rows[i].temp_out;
                min_temp_out = rows[i].temp_out;
                max_hum_out = rows[i].hum_out;
                min_hum_out = rows[i].hum_out;

                break;
            }
        }

Now we come to the meat of the function. We loop through all of the documents and process them into a series of arrays, one for each graph that we’ll draw on the final page.

        for(var i=0; i<rows.length; i++) {
            var temp_out = null;
            var hum_out = null;
            if((rows[i].status & 64) == 0) {
                temp_out = rows[i].temp_out;
                hum_out = rows[i].hum_out;

                total_rain = total_rain + rows[i].rain;
                rainfall.push({ "time": time_text, "rain": rows[i].rain });

                wind.push({ "time": time_text, "wind_ave": rows[i].wind_ave, "wind_gust": rows[i].wind_gust });

            }

            pressure.push({ "time": time_text, "pressure": rows[i].abs_pressure });

            temps.push({ "time": time_text, "temp_out": temp_out, "temp_in": rows[i].temp_in });

            humidity.push({ "time": time_text, "hum_in": rows[i].hum_in, ";hum_out": hum_out });
        }
    }

Lastly we take the stash, which in a bit of code I’ve not included here has the data arrays added to it, and use it to render the day template.

    send(Mustache.to_html(templates.day, stash));

    return &quot;&quot;;
}

Let’s look at a part of the day template. The page is a fairly standard use of the Google Chart Tools library. In this first snippet we render the maximum and minimum temperature values, and a blank div that we’ll fill with the chart.

<h3>Temperature</h3>

<p>Outside: <b>Maximum:</b> {{ max_temp_out }}<sup>o</sup>C <b>Minimum:</b> {{ min_temp_out }}<sup>o</sup>C</p>
<p>Inside: <b>Maximum:</b> {{ max_temp_in }}<sup>o</sup>C <b>Minimum:</b> {{ min_temp_in }}<sup>o</sup>C</p>

<div id="tempchart_div"></div>

In the following Javascript function we build a DataTable object that we pass to the library to draw a line chart. The {{#temps}} and {{/temps}} construction is the Mustache way of looping through the temps array. We use it to dynamically write out Javascript code containing the data we want to render.

function drawTempChart() {
    var data = new google.visualization.DataTable();
    data.addColumn('string', 'Time');
    data.addColumn('number', 'Outside');
    data.addColumn('number', 'Inside');

    data.addRows([
    {{#temps}}
        ['{{ time }}', {{ temp_out }}, {{ temp_in }}],
    {{/temps}}
        null]);

    var chart = new google.visualization.LineChart(document.getElementById('tempchart_div'));
    chart.draw(data, {width: 950, height: 240, title: 'Temperature'});
}
google.setOnLoadCallback(drawTempChart);

We now have a page that displays all the collected weather data for a single day. In the next post in this series we’ll look at how to use CouchDB’s map/reduce functions to process the data so we can display it by month and by year.


Photo of almost may by paul bica.

Back Garden Weather in CouchDB (Part 2)

its raining..its pouringIn my last post I described the new CouchDB-based website I have built to display the weather data collected from the weather station in my back garden. In this post I’ll describe to import the data into CouchDB and the basics of rendering a page with a CouchApp.

PyWWS writes out the raw data it collected into a series of CSV files, one per day. These are stored in two nested directory, the first being the year, the second being year-month. To collect the data I use PyWWS’s live logging mode, which consists of a process constantly running, talking to the data collector. Every five minutes it writes a new row into today’s CSV file. Another process then runs every five minutes to read the new row, and import it into the database.

Because CouchDB stores its data using an append only format you should aim to avoid unnecessary updates. The simplest way to write the import script would be to import each day’s data every five minutes. This would cause the database to balloon in size, so instead we query the database to find the last update time and import everything after than. Each update is stored as a separate document in the database, with the timestamp attribute containing the unix timestamp of the update.

The map code to get the most recent update is quite simple, we just need to emit the timestamp for each update. The reason the timestamp is emitted as the key is so we can filter the range of updates. It is also emitted as the value so we can use the timestamp in the reduce function.

function(doc) {
    emit(doc.timestamp, doc.timestamp);
}

The reduce function is a fairly simple way to calculate the maximum value of the keys. I’ve mostly included it here for completeness.

function(keys, values, rereduce) {
    if(values.length == 0) {
        return 0;
    }

    var m = values[0];

    for(var i=0; i<values.length; i++) {
        if(values[i] > m) { m = values[i]; }
    }

    return m;
}

You’ll find the import script that I use in the directory you cloned in the previous post, when you got a copy of the website.

So, we’ve got some data in our database. How do we display it on a webpage? First, let’s consider the basics of rendering a webpage.

CouchDB has two ways to display formatted data, show and list functions. Show functions allow you to format a single documents, for example a blog post. List functions allow you to format a group of documents, such as a the comments on a post. Because viewing a single piece of weather data is not interesting the weather site only uses list functions. To get started let’s create a simple Show function, as these are simpler.

CouchApp doesn’t come with a templating library, but a common one to use is Mustache. The syntax is superficially like Django templates, but in reality it is far less powerful. For a simple website like this, Mustache is perfect.

In the show directory of your CouchApp create a new file, test.js. As with the map/reduce functions this file contains an anonymous function. In this case the function takes two parameters, the document and the request obejct, and returns an object containing the response body and any headers.

function (doc, req) {
    // !json templates.records
    // !json templates.head
    // !json templates.foot
    // !code vendor/couchapp/lib/mustache.js

The function begins with some magic comments. These are commands to CouchDB which includes the referenced code or data in the function. This allows you to keep shared data separate from the functions that uses it.

The first !json command causes the compiler to load the file templates/records.* and add it to a templates objects, under the records attribute.

The !code command works similarly, but in loads the specified file and includes the code in your function. Here we load the Mustache library, but I have also used the function to load a javascript implementation of sprintf. You might want to load some of your own common code using this method.

    var stash = {
        head: templates.head,
        foot: templates.foot
    };

    return { body: Mustache.to_html(templates.records, stash), headers: { "Content-Type": "text/html" } };
}

Firstly we build an object containing the data we want to use in our template. As Mustache doesn’t allow you to extend templates we need to pass the header and footer HTML code in as data.

As mentioned the return type of a show function is a object containing the HTML and any HTTP headers. We only want to include the content type of the page, but you could return any HTTP header in a similar fashion. To generate the HTML we call the to_html function provided by Mustache, passing the template and the data object we prepared earlier.

Now we have data in our database and can create simple pages using a CouchApp we can move on to showing real data. In the next post I will describe the list functions use to show summarized day and month weather information.


Photo of its raining..its pouring by samantha celera.

Back Garden Weather in CouchDB (Part 1)

RainWhen she was younger my wife wanted to be a meteorologist. That didn’t pan out, but our recent move found us with a garden, which we’ve not had before. This gave me the opportunity to buy her a weather station. I didn’t just choose any old station though, I wanted one that did wind and rain as well as the usual temperature, pressure and humidity. And, the deciding factor, a USB interface with Linux support. Fortunately the excellent PyWWS supports a range of weather stations, including the one I brought.

I’m not going to go into how I mounted the system, or configured PyWWS. That’s all covered in the documentation. PyWWS can produce a static website, but as someone who earns his living building websites I wanted something a bit better. Continuing my experiments with CouchDB I decided to build the website as a CouchApp.

As well as allowing you to query your data with Javascript, CouchDB lets you display webpages directly out of your database. If you visit welwynweather.co.uk you’ll notice that you’re redirected to a url that contains url arguments that look a lot like those used to query a view. That’s because that’s exactly what’s going on. Things become clearer when you discover that that http://www.welwynweather.co.uk is an alias for http://db.welwynweather.co.uk/_design/weather/_rewrite/. Now you can see a more complete CouchDB URL, albeit without the database name. db.welwynweather.co.uk points to an Apache reverse proxy that routes requests through to CouchDB.

Over the next few posts I’ll detail how the CouchApp works, but to get started you can clone my app and poke it yourself. Once you’ve installed the couchapp command line client simply run couchapp clone http://db.welwynweather.co.uk/_design/weather. This will give you a directory, weather, that contains a number of subdirectories including templates and views which contain the complete website.

To deploy the site to your own server you need to create a database and then run couchapp push weather http://localhost:5984/welwynweather. Visiting http://localhost:5984/welwynweather/_design/weather/_rewrite/ should show you the site. You’ll need some data though, and you can use CouchDB replication to pull my data to your server. Using Futon simply set http://db.welwynweather.co.uk/ as the replication source and your database as the destination and you’ll quickly get a complete copy of the database.

When replicating my data you currently cannot use continuous replication. When it completes replication CouchDB calls POST /_ensure_full_commit, but obviously I’ve disabled POST, PUT and DELETE on my server. This causes replication to fail and to restart from the beginning. The data will already have been copied, but CouchDB will copy it again. If you have any ideas on how to avoid this, please answer my StackOverflow question.

The website consists of four main pages. When you visit you are redirected to a page that shows the weather for the current day. Clicking on the date at the top of the page lets you also view the weather by month and by year. The daily weather pages show as much detail as is recorded by the station, in my case this is an update every five minutes. The monthly page is much the same except that the values are averaged across an hour. The yearly page is a bit different as it shows a single point for each day. An average temperature for each day is not that useful so we calculate the high and low for each day and display that.

The final page is the records page. This displays information like the highest and lowest temperature ever recorded and the heaviest rain by hour and by day. The previous three pages are all fully generated by the server. The records page is a bit different though as calculating the records in one step is a bit complicated, instead we use AJAX to load each record individually. This means we can focus on each record keeping the code simple.

In the next post I’ll discuss how I import data into CouchDB and the basics of rendering a page in a CouchApp.


If you visit the site you may find that there is no recent weather data. This is because I run PyWWS on my MythTV box. Rather than running the PC all the time the weather data only updates when a programme is being recorded, or I’m watching TV.


Photo of Rain by Moyan Brenn.

Beating Google With CouchDB, Celery and Whoosh (Part 8)

github 章魚貼紙In the previous seven posts I’ve gone through all the stages in building a search engine. If you want to try and run it for yourself and tweak it to make it even better then you can. I’ve put the code up on GitHub. All I ask is that if you beat Google, you give me a credit somewhere.

When you’ve downloaded the code it should prove to be quite simple to get running. First you’ll need to edit settings.py. It should work out of the box, but you should change the USER_AGENT setting to something unique. You may also want to adjust some of the other settings, such as the database connection or CouchDB urls.

To set up the CouchDB views type python manage.py update_couchdb.

Next, to run the celery daemon you’ll need to type the following two commands:

python manage.py celeryd -Q retrieve
python manage.py celeryd -Q process

This sets up the daemons to monitor the two queues and process the tasks. As mentioned in a previous post two queues are needed to prevent one set of tasks from swamping the other.

Next you’ll need to run the full text indexer, which can be done with python manage.py index_update and then you’ll want to run the server using python manage.py runserver.

At this point you should have several process running not doing anything. To kick things off we need to inject one or more urls into the system. You can do this with another management command, python manage.py start_crawl http://url. You can run this command as many times as you like to seed your crawler with different pages. It has been my experience that the average page has around 100 links on it so it shouldn’t take long before your crawler is scampering off to crawl many more pages that you initially seeded it with.

So, how well does Celery work with CouchDB as a backend? The answer is that it’s a bit mixed. Certainly it makes it very easy to get started as you can just point it at the server and it just works. However, the drawback, and it’s a real show stopper, is that the Celery daemon will poll the database looking for new tasks. This polling, as you scale up the number of daemons will quickly bring your server to its knees and prevent it from doing any useful work.

The disappointing fact is that Celery could watch the _changes feed rather than polling. Hopefully this will get fixed in a future version. For now though, for anything other experimental scale installations RabbitMQ is a much better bet.

Hopefully this series has been useful to you, and please do download the code and experiment with it!


Photo of github 章魚貼紙 by othree.

Beating Google With CouchDB, Celery and Whoosh (Part 7)

The Planet Data CenterThe key ingredients of our search engine are now in place, but we face a problem. We can download webpages and store them in CouchDB. We can rank them in order of importance and query them using Whoosh but the internet is big, really big! A single server doesn’t even come close to being able to hold all the information that you would want it to – Google has an estimated 900,000 servers. So how do we scale this the software we’ve written so far effectively?

The reason I started writing this series was to investigate how well Celery’s integration with CouchDB works. This gives us an immediate win in terms of scaling as we don’t need to worry about a different backend, such as RabbitMQ. Celery itself is designed to scale so we can run celeryd daemons as many boxes as we like and the jobs will be divided amongst them. This means that our indexing and ranking processes will scale easily.

CouchDB is not designed to scale across multiple machines, but there is some mature software, CouchDB-lounge that does just that. I won’t go into how to get set this up but fundamentally you set up a proxy that sits in front of your CouchDB cluster and shards the data across the nodes. It deals with the job of merging view results and managing where the data is actually stored so you don’t have to. O’Reilly’s CouchDB: The Definitive Guide has a chapter on clustering that is well worth a read.

Unfortunately while Woosh is easy to work with it’s not designed to be used on a large scale. Indeed if someone was crazy enough to try to run the software we’ve developed in this series they might be advised to replace Whoosh with Solr. Solr is a lucene-based search server which provides an HTTP interface to the full-text index. Solr comes with a sharding system to enable you to query an index that is too large for a single machine.

So, with our two data storage tools providing HTTP interface and both having replication and sharding either built in or as available as a proxy the chances of being able to scale effectively are good. Celery should allow the background tasks that are needed to run a search engine can be scaled, but the challenges of building and running a large scale infrastructure are many and I would not claim that these tools mean success is guarenteed!

In the final post of this series I will discuss what I’ve learnt about running Celery with CouchDB, and with CouchDB in general. I’ll also describe how to download and run the complete code so you can try these techniques for yourself.

Read part 8.


Photo of The Planet Data Center by The Planet.

Beating Google With CouchDB, Celery and Whoosh (Part 6)

QueryWe’re nearing the end of our plot to create a Google-beating search engine (in my dreams at least) and in this post we’ll build the interface to query the index we’ve built up. Like Google the interface is very simple, just a text box on one page and a list of results on another.

To begin with we just need a page with a query box. To make the page slightly more interesting we’ll also include the number of pages in the index, and a list of the top documents as ordered by our ranking algorithm.

In the templates on this page we reference base.html which provides the boiler plate code needed to make an HTML page.

{% extends "base.html" %}

{% block body %}
    <form action="/search" method="get">
        <input name="q" type="text">
        <input type="submit">
    </form>

    <hr>

    <p>{{ doc_count }} pages in index.</p>

    <hr>

    <h2>Top Pages</h2>

    <ol>
    {% for page in top_docs %}
        <li><a href="{{ page.url }}">{{ page.url }}</a> - {{ page.rank }}</li>
    {% endfor %}
    </ol>
{% endblock %}

To show the number of pages in the index we need to count them. We’ve already created an view to list Pages by their url and CouchDB can return the number of documents in a view without actually returning any of them, so we can just get the count from that. We’ll add the following function to the Page model class.

    @staticmethod
    def count():
        r = settings.db.view("page/by_url", limit=0)
        return r.total_rows

We also need to be able to get a list of the top pages, by rank. We just need to create view that has the rank as the key and CouchDB will sort it for us automatically.

With all the background pieces in place the Django view function to render the index is really very straightforward.

def index(req):
    return render_to_response("index.html", { "doc_count": Page.count(), "top_docs": Page.get_top_by_rank(limit=20) })

Now we get to the meat of the experiment, the search results page. First we need to query the index.

def search(req):
    q = QueryParser("content", schema=schema).parse(req.GET["q"])

This parses the user submitted query and prepares the query ready to be used by Whoosh. Next we need to pass the parsed query to the index.

    results = get_searcher().search(q, limit=100)

Hurrah! Now we have list of results that match our search query. All that remains is to decide what order to display them in. To do this we normalize the score returned by Whoosh and the rank that we calculated, and add them together.

    if len(results) > 0:
        max_score = max([r.score for r in results])
        max_rank = max([r.fields()["rank"] for r in results])

To calculate our combined rank we normalize the score and the rank by setting the largest value of each to one and scaling the rest appropriately.

        combined = []
        for r in results:
            fields = r.fields()
            r.score = r.score/max_score
            r.rank = fields["rank"]/max_rank
            r.combined = r.score + r.rank
            combined.append(r)

The final stage is to sort our list by the combined score and render the results page.

        combined.sort(key=lambda x: x.combined, reverse=True)
    else:
        combined = []

    return render_to_response("results.html", { "q": req.GET["q"], "results": combined })

The template for the results page is below.

{% extends "base.html" %}

{% block body %}
    <form action="/search" method="get">
        <input name="q" type="text" value="{{ q }}">
        <input type="submit">
    </form>

    {% for result in results|slice:":20" %}
        <p>
            <b><a href="{{ result.url }}">{{ result.title|safe }}</a></b> ({{ result.score }}, {{ result.rank }}, {{ result.combined }})<br>
            {{ result.desc|safe }}
        </p>
    {% endfor %}
{% endblock %}

So, there we have it. A complete web crawler, indexer and query website. In the next post I’ll discuss how to scale the search engine.

Read part 7.


Photo of Query by amortize.

Beating Google With CouchDB, Celery and Whoosh (Part 5)

orderIn this post we’ll continue building the backend for our search engine by implementing the algorithm we designed in the last post for ranking pages. We’ll also build a index of our pages with Whoosh, a pure-Python full-text indexer and query engine.

To calculate the rank of a page we need to know what other pages link to a given url, and how many links that page has. The code below is a CouchDB map called page/links_to_url. For each page this will output a row for each link on the page with the url linked to as the key and the page’s rank and number of links as the value.

function (doc) {
    if(doc.type == "page") {
        for(i = 0; i < doc.links.length; i++) {
            emit(doc.links[i], [doc.rank, doc.links.length]);
        }
    }
}

As before we’re using a Celery task to allow us to distribute our calculations. When we wrote the find_links task we called calculate_rank with the document id for our page as the parameter.

@task
def calculate_rank(doc_id):
    page = Page.load(settings.db, doc_id)

Next we get a list of ranks for the page’s that link to this page. This static method is a thin wrapper around the page/links_to_url map function given above.

    links = Page.get_links_to_url(page.url)

Now we have the list of ranks we can calculate the rank of this page by dividing the rank of the linking page by the number of links and summing this across all the linking pages.

    rank = 0
    for link in links:
        rank += link[0] / link[1]

To prevent cycles (where A links to B and B links to A) from causing an infinite loop in our calculation we apply a damping factor. This causes the value of the link to decline by 0.85 and combined with the limit later in the function will force any loops to settle on a value.

    old_rank = page.rank
    page.rank = rank * 0.85

If we didn’t find any links to this page then we give it a default rank of 1/number_of_pages.

    if page.rank == 0:
        page.rank = 1.0/settings.db.view("page/by_url", limit=0).total_rows

Finally we compare the new rank to the previous rank in our system. If it has changed by more than 0.0001 then we save the new rank and cause all the pages linked to from our page to recalculate their rank.

    if abs(old_rank - page.rank) > 0.0001:
        page.store(settings.db)

        for link in page.links:
            p = Page.get_id_by_url(link, update=False)
            if p is not None:
                calculate_rank.delay(p)

This is a very simplistic implementation of a page rank algorithm. It does generate a useful ranking of pages, but the number of queued calculate_rank tasks explodes. In a later post I’ll discuss how this could be made rewritten to be more efficient.

Whoosh is a pure-Python full text search engine. In the next post we’ll look at querying it, but first we need to index the pages we’ve crawled.

The first step with Whoosh is to specify your schema. To speed up the display of results we store the information we need to render the results page directly in the schema. For this we need the page title, url and description. We also store the score given to the page by our pagerank-like algorithm. Finally we add the page text to the index so we can query it. If you want more details, the Whoosh documentation is pretty good.

from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), url=ID(stored=True, unique=True), desc=ID(stored=True), rank=NUMERIC(stored=True, type=float), content=TEXT)

CouchDB provides an interface for being informed whenever a document in the database changes. This is perfect for building an index.

Our full-text indexing daemon is implemented as a Django management command so there is some boilerplate code required to make this work.

class Command(BaseCommand):
    def handle(self, **options):
        since = get_last_change()
        writer = get_writer()

CouchDB allows you to get all the changes that have occurred since a specific point in time (using a revision number). We store this number inside the Whoosh index directory, and accessing it using the get_last_change and set_last_change functions. Our access to the Whoosh index is through a IndexWriter object, again accessed through an abstraction function.

Now we enter an infinite loop and call the changes function on our CouchDB database object to get the changes.

        try:
            while True:
                changes = settings.db.changes(since=since)
                since = changes["last_seq"]
                for changeset in changes["results"]:
                    try:
                        doc = settings.db[changeset["id"]]
                    except couchdb.http.ResourceNotFound:
                        continue

In our database we store robots.txt files as well as pages, so we need to ignore them. We also need to parse the document so we can pull out the text from the page. We do this with the BeautifulSoup library.

                    if "type" in doc and doc["type"] == "page":
                        soup = BeautifulSoup(doc["content"])
                        if soup.body is None:
                            continue

On the results page we try to use the meta description if we can find it.

                        desc = soup.findAll('meta', attrs={ "name": desc_re })

Once we’ve got the parsed document we update our Whoosh index. The code is a little complicated because we need to handle the case where the page doesn’t have a title or description, and that we search for the title as well as the body text of the page. The key element here is text=True which pulls out just the text from a node and strips out all of the tags.

                        writer.update_document(
                                title=unicode(soup.title(text=True)[0]) if soup.title is not None and len(soup.title(text=True)) > 0 else doc["url"],
                                url=unicode(doc["url"]),
                                desc=unicode(desc[0]["content"]) if len(desc) > 0 and desc[0]["content"] is not None else u"",
                                rank=doc["rank"],
                                content=unicode(soup.title(text=True)[0] + "\n" + doc["url"] + "\n" + "".join(soup.body(text=True)))
                            )

Finally we update the index and save the last change number so next time the script is run we continue from where we left off.

                    writer.commit()
                    writer = get_writer()

                set_last_change(since)
        finally:
            set_last_change(since)

In the next post I’ll discuss how to query the index, sort the documents by our two rankings and build a simple web interface.

Read part 6.


Photo of order by Steve Mishos.

Beating Google With CouchDB, Celery and Whoosh (Part 4)

Red Sofa encounter iIn this series I’m showing you how to build a webcrawler and search engine using standard Python based tools like Django, Celery and Whoosh with a CouchDB backend. In previous posts we created a data structure, parsed and stored robots.txt and stored a single webpage in our document. In this post I’ll show you how to parse out the links from our stored HTML document so we can complete the crawler, and we’ll start calculating the rank for the pages in our database.

There are several different ways of parsing out the links in a given HTML document. You can just use a regular expression to pull the urls out, or you can use a more complete but also more complicated (and slower) method of parsing the HTML using the standard Python htmlparser library, or the wonderful Beautiful Soup. The point of this series isn’t to build a complete webcrawler, but to show you the basic building blocks. So, for simplicity’s sake I’ll use a regular expression.

link_single_re = re.compile(r"<a[^>]+href='([^']+)'")
link_double_re = re.compile(r'<a[^>]+href="([^"]+)"')

All we need to look for an href attribute in an a tag. We’ll use two regular expressions to handle single and double quotes, and then build a list containing all the links in the document.

@task
def find_links(doc_id):
    doc = Page.load(settings.db, doc_id)

    raw_links = []
    for match in link_single_re.finditer(doc.content):
        raw_links.append(match.group(1))

    for match in link_double_re.finditer(doc.content):
        raw_links.append(match.group(1))

Once we’ve got a list of the raw links we need to process them into absolute urls that we can send back to the retrieve_page task we wrote earlier. I’m cutting some corners with processing these urls, in particular I’m not dealing with base tags.

    doc.links = []
    for link in raw_links:
        if link.startswith("#"):
            continue
        elif link.startswith("http://") or link.startswith("https://"):
            pass
        elif link.startswith("/"):
            parse = urlparse(doc["url"])
            link = parse.scheme + "://" + parse.netloc + link
        else:
            link = "/".join(doc["url"].split("/")[:-1]) + "/" + link

        doc.links.append(unescape(link.split("#")[0]))

    doc.store(settings.db)

Once we’ve got our list of links and saved the modified document we then need to trigger the next series of steps to occur. We need to calculate the rank of this page, so we trigger that task and then we step through each page that we linked to. If we’ve already got a copy of the page then we want to recalculate its rank to take into account the rank of this page (more on this later) and if we don’t have a copy then we queue it up to be retrieved.

    calculate_rank.delay(doc.id)

    for link in doc.links:
        p = Page.get_id_by_url(link, update=False)
        if p is not None:
            calculate_rank.delay(p)
        else:
            retrieve_page.delay(link)

We’ve now got a complete webcrawler. We can store webpages and robots.txt files. Given a starting URL our crawler will set about parsing pages to find out what they link to and retrieve those pages as well. Given enough time you’ll end up with most of the internet on your harddisk!

When we come to write the website to query the information we’ve collected we’ll use two numbers to rank pages. First we’ll use the a value that ranks pages base on the query used, but we’ll also use a value that ranks pages based on their importance. This is the same method used by Google, known as Page Rank.

Pank Rank is a measure of how likely you are to end up on a given page by clicking on a random link anywhere on the internet. The Wikipedia article goes into some detail on a number of ways to calculate it, but we’ll use a very simple iterative algorithm.

When created, a page is given a rank equal to 1/number of pages. Each link that is found on a newly crawled page then causes the rank of the destination page to be calculated. In this case the rank of a page is the sum of the ranks of the pages that link to it, divided by the number of links on those pages, multiplied by a dampening factor (I use 0.85, but this could be adjusted.) If a page has a rank of 0.25 and has five links then each page linked to gains 0.05*0.85 rank for that link. If the change in rank of the page when recalculated is significant then the rank of all the pages it links to are recalculated.

In this post we’ve completed the web crawler part of our search engine and discussed how to rank pages in importance. In the next post we’ll implement this ranking and also create a full text index of the pages we have crawled.

Read part 5.


Photo of Red Sofa encounter i by Ricard Gil.