CouchQL development progressing

As I mentioned in a previous post I have been working of a library to ease the creation of map/reduce views in CouchDB.

The code is being hosted on google code and can be checked out and used now. The development is currently at a very early stage, but the fundamentals are sound.

Code such that given below will work. In this example it will return all the documents with a member ‘x’ whoes value is greater than one.

c = db.cursor()
c.execute("SELECT * FROM _ WHERE x > %s", (1, ))
for doc in c.fetchall():
     # process doc

The code is executed as a temporary view, but very high on my list is to use permanent views for much higher performance. This will be added before a first release, as will the ability to have multiple expressions anded together in the where clause.

Introducing CouchQL

CouchDB is a very exciting development in the world of databases and I’m greatly enjoying building a website which uses it. One problem is that most of the of views that I have created are extremely simple and could easily be represented using SQL. Although I wrote some code to help make life easier, creating a view such as that below is never going to be as simple as including SELECT * FROM table WHERE (status="open" OR status="accepted") AND latest AND key="xyz" directly in your code.

function (doc) {
    if((doc["status"] == "open" || doc["status"] == "accepted") && doc["latest"]) {
        emit(doc["key"], null);
    }
}

The SQL above and the Javascript view function are directly equivalent, which is why I’ve started working on an extension to the Python CouchDB library, which I’ve decided to call CouchQL.

The basic strategy is going to be this. The library adds a method to the Database object, cursor which returns an object which is compatible with the standard Python database API. When executing a CouchQL query a hash is taken of the textual query and a call is made to the view couchql_hash. If the view is not found then the query is turned into Javascript, added the server and the call repeated.

One of the common mistakes with CouchDB is to treat as if it were a traditional RDBMS. CouchQL has the danger of confusing people even more by allowing users to query CouchDB as if it is an RDBMS. CouchQL is not SQL, even if it does pretend to be SQL-like. I’ve not yet decided on how much processing should be done in the library to make the query language more SQL-like. The query SELECT * FROM table WHERE x > 5 OR x < 3 cannot be directly represented as call to a CouchDB view. It can be represented as two separate calls to the same view with the results merged. Is this a good idea? I’m not sure.

Development work has only just started on this library, but I’m actively working on and hope to be able to announce something useful to the CouchDB mailing list soon.

Updating CouchDB Views In Django

CouchDB views are a bit like stored procedures in a traditional database system. As with stored procedures it’s difficult to keep them in sync with your code, and to keep them in your version control system. In this article I’ll show you how you can use a django management command to update your views from files in your code base.

CouchDB uses a map/reduce system where each view is made of a filter program (the map) and an optional post processor that runs over the output of the map (the reduce). These pairs are grouped into design documents which are stored as a single unit in the couchdb database.

This command assumes that you store your map and reduce functions in the directory structure set out below.

project/
    app/
        couchviews/
            database1/
                design1/
                    mapreduce1/
                        map.js reduce.js
                    mapreduce2/
                        map.js
                 design2/
                     mapreduce3/
                        map.js reduce.js
             database3/
                  design3/
                      mapreduce4/
                          map.js reduce.js

Inside your app directory create a folder called couchviews. Inside that create one for each of your CouchDB databases. Finally, create two layers of directories to represent the design documents and views stored within. Each javascript file should contain a single anonymous function.

For this management command to work your settings file needs to contain a variable for each database, containing the Python CouchDB database objects. In this example three variables need to be added to ettings.pydatabase1, database2 and database3.

Add the code below to the file project/app/mangement/commands/updatecouchviews.py and when you type manage.py updatecouchviews it’ll walk your directory structure and update all your design documents in one fell swoop. Easy!

import couchdb
import glob
import os

from django.core.management.base import NoArgsCommand

class Command(NoArgsCommand):
    help = "Update couchdb views"

    can_import_settings = True

    def handle_noargs(self, **options):
        import settings

        couchdir = os.path.realpath(os.path.split(__file__)[0] + "../../../couchviews")

        databases = glob.glob(couchdir+"/*")
        for d in databases:
            if not os.path.isdir(d):
                continue

            db = getattr(settings, d.split("/")[-1])

            for design in glob.glob(d + "/*"):
                design = design.split("/")[-1]
                try:
                    doc = db["_design/" + design]
                except couchdb.client.ResourceNotFound:
                    doc = {"_id": "_design/" + design}

                doc["views"] = {}
                for mapreduce in glob.glob(d+"/"+design+"/*"):
                     mapreduce = mapreduce.split("/")[-1]
                     mr = {}
                     mr["map"] = open(d+"/"+design+"/"+mapreduce+"/map.js").read()
                     try:
                         mr["reduce"] = reduce = open(d+"/"+design+"/"+mapreduce+"/reduce.js").read()
                     except IOError:
                         pass

                     doc["views"][mapreduce] = mr

                db["_design/" + design] = doc

CouchDB Performance

I’ve been toying with CouchDB for a short while, and I’m definitely impressed by what I’ve seen. Once I’d upgraded to Erlang R12B and trunk CouchDB any bugs I was seeing disappearing and importing all 1 million documents was straightforward.

With 1 million documents the map/reduce takes a long time, as you would expect. What would be nice is if the maps could be spread across different nodes to speed things up dramatically. Once the map has been calculated and cached, retrieving it is relatively fast. Parsing it in Python does seem to be quite slow, taking a few seconds for a few tens of thousands of results. This is far too slow for a webpage response.

Is there any way to speed up CouchDB? Well aggressive use of memcache will probably help, but too me it seems that CouchDB is not suited to large datasets. I do hope I’m wrong though, and I’m going to investigate further because I really want to find a use for CouchDB in my work.