<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Andrew Wilkinson &#187; django</title>
	<atom:link href="http://andrewwilkinson.wordpress.com/tag/django/feed/" rel="self" type="application/rss+xml" />
	<link>http://andrewwilkinson.wordpress.com</link>
	<description>Random Ramblings on Programming</description>
	<lastBuildDate>Thu, 23 May 2013 21:29:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='andrewwilkinson.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Andrew Wilkinson &#187; django</title>
		<link>http://andrewwilkinson.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://andrewwilkinson.wordpress.com/osd.xml" title="Andrew Wilkinson" />
	<atom:link rel='hub' href='http://andrewwilkinson.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Django ImportError Hiding</title>
		<link>http://andrewwilkinson.wordpress.com/2012/03/07/django-importerror-hiding/</link>
		<comments>http://andrewwilkinson.wordpress.com/2012/03/07/django-importerror-hiding/#comments</comments>
		<pubDate>Wed, 07 Mar 2012 13:59:29 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[web development]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[error handling]]></category>
		<category><![CDATA[exceptions]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=699</guid>
		<description><![CDATA[A little while ago I was asked what my biggest gripe with Django was. At the time I couldn&#8217;t think of a good answer because since I started using Django in the pre-1.0 days most of the rough edges have been smoothed. Yesterday though, I encountered an error that made me wish I thought of [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=699&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/grahford/458564891/"><img style="float:right;border:0;" src="http://farm1.staticflickr.com/183/458564891_5e943e5794_m.jpg" alt="Hidden Cat" /></a>A little while ago I was asked what my biggest gripe with Django was. At the time I couldn&#8217;t think of a good answer because since I started using Django in the pre-1.0 days most of the rough edges have been smoothed. Yesterday though, I encountered an error that made me wish I thought of it at the time.</p>
<p>The code that produced the error looked like this:</p>
<pre class="brush: python; title: ; notranslate">
from django.db import models

class MyModel(model.Model):
    ...

    def save(self):
        models.Model.save(self)

        ...

    ...
</pre>
<p>The error that was raised was <tt>AttributeError: 'NoneType' object has no attribute 'Model'</tt>. This means that rather than containing a module object, <tt>models</tt> was None. Clearly this is impossible as the class could not have been created if that was the case. Impossible or not, it was clearly happening.</p>
<p>Adding a print statement to the module showed that when it was imported the <tt>models</tt> variable did contain the expected module object. What that also showed was that module was being imported more than once, something that should also be impossible.</p>
<p>After a wild goose chase investigating reasons why the module might be imported twice I tracked it down to the <tt>load_app</tt> method in <tt>django/db/models/loading.py</tt>. The code there looks something like this:</p>
<pre class="brush: python; title: ; notranslate">
    def load_app(self, app_name, can_postpone=False):
        try:
            models = import_module('.models', app_name)
        except ImportError:
            # Ignore exception
</pre>
<p>Now I&#8217;m being a harsh here, and the exception handler does contain a comment about working out if it should reraise the exception. The issue here is that it wasn&#8217;t raising the exception, and it&#8217;s really not clear why. It turns out that I had a misspelt module name in an import statement in a different module. This raised an <tt>ImportError</tt> which was caught, hidden and then Django repeatedly attempted to import the models as they were referenced in the models of other apps. The strange exception that was originally encountered is probably an artefact of Python&#8217;s garbage collection, although how exactly it occurred is still not clear to me.</p>
<p>There are a number of tickets (<a href="https://code.djangoproject.com/ticket/6379">#6379</a>, <a href="https://code.djangoproject.com/ticket/14130">#14130</a> and probably others) on this topic. A common refrain in Python is that it&#8217;s easier to ask for forgiveness than to ask for permission, and I certainly agree with Django and follow that most of the time.</p>
<p>I always follow the rule that try/except clauses should cover as little code as possible. Consider the following piece of code.</p>
<pre class="brush: python; title: ; notranslate">
try:
    var.method1()

    var.member.method2()
except AttributeError:
    # handle error
</pre>
<p>Which of the three attribute accesses are we actually trying to catch here? Handling exceptions like this are a useful way of implementing Duck Typing while following the easier to ask forgiveness principle. What this code doesn&#8217;t make clear is which member or method is actually optional. A better way to write this would be:</p>
<pre class="brush: python; title: ; notranslate">
var.method1()

try:
    member = var.member
except AttributeError:
    # handle error
else:
    member.method2()
</pre>
<p>Now the code is very clear that the <tt>var</tt> variable may or may not have a <tt>member</tt> member variable. If <tt>method1</tt> or <tt>method2</tt> do not exist then the exception is not masked and is passed on. Now lets consider that we want to allow the <tt>method1</tt> attribute to be optional.</p>
<pre class="brush: python; title: ; notranslate">
try:
    var.method1()
except AttributeError:
    # handle error
</pre>
<p>At first glance it&#8217;s obvious that <tt>method1</tt> is optional, but actually we&#8217;re catching too much here. If there is a bug in <tt>method1</tt> that causes an <tt>AttributeError</tt> to raised then this will be masked and the code will treat it as if <tt>method1</tt> didn&#8217;t exist. A better piece of code would be:</p>
<pre class="brush: python; title: ; notranslate">
try:
    method = var.method1
except AttributeError:
    # handle error
else:
    method()
</pre>
<p><tt>ImportError</tt>s are similar because code can be executed, but then when an error occurs you can&#8217;t tell whether the original import failed or whether an import inside that failed. Unlike with an <tt>AttributeError</tt> there is a no easy way to rewrite the code to only catch the error you&#8217;re interested in. Python does provide some tools to divide the import process into steps, so you can tell whether the module exists before attempting to import it. In particular the <tt><a href="http://docs.python.org/library/imp.html#imp.find_module">imp.find_module</a></tt> function would be useful.</p>
<p>Changing Django to avoid catching the wrong <tt>ImportError</tt>s will greatly complicate the code. It would also introduce the danger that the algorithm used would not match the one used by Python. So, what&#8217;s the moral of this story? Never catch more exceptions than you intended to, and if you get some really odd errors in your Django site watch out for <tt>ImportErrors</tt>.</p>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/grahford/458564891/">Hidden Cat</a> by <a href="http://www.flickr.com/photos/grahford/">Craig Grahford</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/699/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/699/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=699&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2012/03/07/django-importerror-hiding/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm1.staticflickr.com/183/458564891_5e943e5794_m.jpg" medium="image">
			<media:title type="html">Hidden Cat</media:title>
		</media:content>
	</item>
		<item>
		<title>Beating Google With CouchDB, Celery and Whoosh (Part 8)</title>
		<link>http://andrewwilkinson.wordpress.com/2011/10/21/beating-google-with-couchdb-celery-and-whoosh-part-8/</link>
		<comments>http://andrewwilkinson.wordpress.com/2011/10/21/beating-google-with-couchdb-celery-and-whoosh-part-8/#comments</comments>
		<pubDate>Fri, 21 Oct 2011 11:00:18 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[celerycrawler]]></category>
		<category><![CDATA[django]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=489</guid>
		<description><![CDATA[In the previous seven posts I&#8217;ve gone through all the stages in building a search engine. If you want to try and run it for yourself and tweak it to make it even better then you can. I&#8217;ve put the code up on GitHub. All I ask is that if you beat Google, you give [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=489&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/othree/5228608281/"><img style="float:right;border:0;" src="http://farm6.static.flickr.com/5245/5228608281_2d50d3855c_m.jpg" alt="github 章魚貼紙" /></a>In the previous seven posts I&#8217;ve gone through all the stages in building a search engine. If you want to try and run it for yourself and tweak it to make it even better then you can. I&#8217;ve put the <a href="https://github.com/andrewjw/celery-crawler">code up on GitHub</a>. All I ask is that if you beat Google, you give me a credit somewhere.</p>
<p>When you&#8217;ve downloaded the code it should prove to be quite simple to get running. First you&#8217;ll need to edit settings.py. It should work out of the box, but you should change the <tt>USER_AGENT</tt> setting to something unique. You may also want to adjust some of the other settings, such as the database connection or CouchDB urls.</p>
<p>To set up the CouchDB views type <tt>python manage.py update_couchdb</tt>.</p>
<p>Next, to run the celery daemon you&#8217;ll need to type the following two commands:</p>
<pre class="brush: plain; title: ; notranslate">
python manage.py celeryd -Q retrieve
python manage.py celeryd -Q process
</pre>
<p>This sets up the daemons to monitor the two queues and process the tasks. As mentioned in a previous post two queues are needed to prevent one set of tasks from swamping the other.</p>
<p>Next you&#8217;ll need to run the full text indexer, which can be done with <tt>python manage.py index_update</tt> and then you&#8217;ll want to run the server using <tt>python manage.py runserver</tt>.</p>
<p>At this point you should have several process running not doing anything. To kick things off we need to inject one or more urls into the system. You can do this with another management command, <tt>python manage.py start_crawl <a href="http://url" rel="nofollow">http://url</a></tt>. You can run this command as many times as you like to seed your crawler with different pages. It has been my experience that the average page has around 100 links on it so it shouldn&#8217;t take long before your crawler is scampering off to crawl many more pages that you initially seeded it with.</p>
<p>So, how well does Celery work with CouchDB as a backend? The answer is that it&#8217;s a bit mixed. Certainly it makes it very easy to get started as you can just point it at the server and it just works. However, the drawback, and it&#8217;s a real show stopper, is that the Celery daemon will poll the database looking for new tasks. This polling, as you scale up the number of daemons will quickly bring your server to its knees and prevent it from doing any useful work.</p>
<p>The disappointing fact is that Celery could watch the <tt>_changes</tt> feed rather than polling. Hopefully this will get fixed in a future version. For now though, for anything other experimental scale installations RabbitMQ is a much better bet.</p>
<p>Hopefully this series has been useful to you, and please do download the code and experiment with it!</p>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/othree/5228608281/">github 章魚貼紙</a> by <a href="http://www.flickr.com/photos/othree/">othree</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/489/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/489/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=489&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2011/10/21/beating-google-with-couchdb-celery-and-whoosh-part-8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm6.static.flickr.com/5245/5228608281_2d50d3855c_m.jpg" medium="image">
			<media:title type="html">github 章魚貼紙</media:title>
		</media:content>
	</item>
		<item>
		<title>Beating Google With CouchDB, Celery and Whoosh (Part 7)</title>
		<link>http://andrewwilkinson.wordpress.com/2011/10/19/beating-google-with-couchdb-celery-and-whoosh-part-7/</link>
		<comments>http://andrewwilkinson.wordpress.com/2011/10/19/beating-google-with-couchdb-celery-and-whoosh-part-7/#comments</comments>
		<pubDate>Wed, 19 Oct 2011 11:00:16 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[celerycrawler]]></category>
		<category><![CDATA[django]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=474</guid>
		<description><![CDATA[The key ingredients of our search engine are now in place, but we face a problem. We can download webpages and store them in CouchDB. We can rank them in order of importance and query them using Whoosh but the internet is big, really big! A single server doesn&#8217;t even come close to being able [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=474&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/theplanetdotcom/4878813385/"><img style="float:right;border:0;" src="http://farm5.static.flickr.com/4079/4878813385_3229fe1be4_m.jpg" alt="The Planet Data Center" /></a>The key ingredients of our search engine are now in place, but we face a problem. We can download webpages and store them in <a href="http://couchdb.apache.org/">CouchDB</a>. We can rank them in order of importance and query them using <a href="https://bitbucket.org/mchaput/whoosh/wiki/Home">Whoosh</a> but the internet is big, <a href="http://thenextweb.com/shareables/2011/01/11/infographic-how-big-is-the-internet/">really big!</a> A single server doesn&#8217;t even come close to being able to hold all the information that you would want it to &#8211; Google has an estimated <a href="http://www.datacenterknowledge.com/archives/2009/05/14/whos-got-the-most-web-servers/">900,000 servers</a>. So how do we scale this the software we&#8217;ve written so far effectively?</p>
<p>The reason I started writing this series was to investigate how well Celery&#8217;s integration with CouchDB works. This gives us an immediate win in terms of scaling as we don&#8217;t need to worry about a different backend, such as <a href="http://www.rabbitmq.com/">RabbitMQ</a>. Celery itself is designed to scale so we can run <tt>celeryd</tt> daemons as many boxes as we like and the jobs will be divided amongst them. This means that our indexing and ranking processes will scale easily.</p>
<p>CouchDB is not designed to scale across multiple machines, but there is some mature software, <a href="http://tilgovi.github.com/couchdb-lounge/">CouchDB-lounge</a> that does just that. I won&#8217;t go into how to get set this up but fundamentally you set up a proxy that sits in front of your CouchDB cluster and shards the data across the nodes. It deals with the job of merging view results and managing where the data is actually stored so you don&#8217;t have to. O&#8217;Reilly&#8217;s CouchDB: The Definitive Guide has a chapter <a href="http://guide.couchdb.org/draft/clustering.html">on clustering</a> that is well worth a read.</p>
<p>Unfortunately while Woosh is easy to work with it&#8217;s not designed to be used on a large scale. Indeed if someone was crazy enough to try to run the software we&#8217;ve developed in this series they might be advised to replace Whoosh with <a href="http://lucene.apache.org/solr/">Solr</a>. Solr is a lucene-based search server which provides an HTTP interface to the full-text index. Solr comes with a <a href="http://wiki.apache.org/solr/DistributedSearch">sharding system</a> to enable you to query an index that is too large for a single machine.</p>
<p>So, with our two data storage tools providing HTTP interface and both having replication and sharding either built in or as available as a proxy the chances of being able to scale effectively are good. Celery should allow the background tasks that are needed to run a search engine can be scaled, but the challenges of building and running a large scale infrastructure are many and I would not claim that these tools mean success is guarenteed!</p>
<p>In the final post of this series I will discuss what I&#8217;ve learnt about running Celery with CouchDB, and with CouchDB in general. I&#8217;ll also describe how to download and run the complete code so you can try these techniques for yourself.</p>
<p>Read <a href="http://wp.me/pkxET-7T">part 8</a>.</p>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/theplanetdotcom/4878813385/">The Planet Data Center</a> by <a href="http://www.flickr.com/photos/theplanetdotcom/">The Planet</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/474/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/474/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=474&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2011/10/19/beating-google-with-couchdb-celery-and-whoosh-part-7/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm5.static.flickr.com/4079/4878813385_3229fe1be4_m.jpg" medium="image">
			<media:title type="html">The Planet Data Center</media:title>
		</media:content>
	</item>
		<item>
		<title>Beating Google With CouchDB, Celery and Whoosh (Part 6)</title>
		<link>http://andrewwilkinson.wordpress.com/2011/10/13/beating-google-with-couchdb-celery-and-whoosh-part-6/</link>
		<comments>http://andrewwilkinson.wordpress.com/2011/10/13/beating-google-with-couchdb-celery-and-whoosh-part-6/#comments</comments>
		<pubDate>Thu, 13 Oct 2011 11:00:30 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[celerycrawler]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[whoosh]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=471</guid>
		<description><![CDATA[We&#8217;re nearing the end of our plot to create a Google-beating search engine (in my dreams at least) and in this post we&#8217;ll build the interface to query the index we&#8217;ve built up. Like Google the interface is very simple, just a text box on one page and a list of results on another. To [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=471&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/amortize/527435776/"><img style="float:right;border:0;" src="http://farm2.static.flickr.com/1179/527435776_a929bf88af_m.jpg" alt="Query" /></a>We&#8217;re nearing the end of our plot to create a Google-beating search engine (in my dreams at least) and in this post we&#8217;ll build the interface to query the index we&#8217;ve built up. Like Google the interface is very simple, just a text box on one page and a list of results on another.</p>
<p>To begin with we just need a page with a query box. To make the page slightly more interesting we&#8217;ll also include the number of pages in the index, and a list of the top documents as ordered by our ranking algorithm.</p>
<p>In the templates on this page we reference <tt>base.html</tt> which provides the boiler plate code needed to make an HTML page.</p>
<pre class="brush: xml; title: ; notranslate">
{% extends &quot;base.html&quot; %}

{% block body %}
    &lt;form action=&quot;/search&quot; method=&quot;get&quot;&gt;
        &lt;input name=&quot;q&quot; type=&quot;text&quot;&gt;
        &lt;input type=&quot;submit&quot;&gt;
    &lt;/form&gt;

    &lt;hr&gt;

    &lt;p&gt;{{ doc_count }} pages in index.&lt;/p&gt;

    &lt;hr&gt;

    &lt;h2&gt;Top Pages&lt;/h2&gt;

    &lt;ol&gt;
    {% for page in top_docs %}
        &lt;li&gt;&lt;a href=&quot;{{ page.url }}&quot;&gt;{{ page.url }}&lt;/a&gt; - {{ page.rank }}&lt;/li&gt;
    {% endfor %}
    &lt;/ol&gt;
{% endblock %}
</pre>
<p>To show the number of pages in the index we need to count them. We&#8217;ve already created an view to list <tt>Page</tt>s by their url and CouchDB can return the number of documents in a view without actually returning any of them, so we can just get the count from that. We&#8217;ll add the following function to the <tt>Page</tt> model class.</p>
<pre class="brush: python; title: ; notranslate">
    @staticmethod
    def count():
        r = settings.db.view(&quot;page/by_url&quot;, limit=0)
        return r.total_rows
</pre>
<p>We also need to be able to get a list of the top pages, by rank. We just need to create view that has the rank as the key and CouchDB will sort it for us automatically.</p>
<p>With all the background pieces in place the Django view function to render the index is really very straightforward.</p>
<pre class="brush: python; title: ; notranslate">
def index(req):
    return render_to_response(&quot;index.html&quot;, { &quot;doc_count&quot;: Page.count(), &quot;top_docs&quot;: Page.get_top_by_rank(limit=20) })
</pre>
<p>Now we get to the meat of the experiment, the search results page. First we need to query the index.</p>
<pre class="brush: python; title: ; notranslate">
def search(req):
    q = QueryParser(&quot;content&quot;, schema=schema).parse(req.GET[&quot;q&quot;])
</pre>
<p>This parses the user submitted query and prepares the query ready to be used by Whoosh. Next we need to pass the parsed query to the index.</p>
<pre class="brush: python; title: ; notranslate">
    results = get_searcher().search(q, limit=100)
</pre>
<p>Hurrah! Now we have list of results that match our search query. All that remains is to decide what order to display them in. To do this we normalize the score returned by Whoosh and the rank that we calculated, and add them together.</p>
<pre class="brush: python; title: ; notranslate">
    if len(results) &gt; 0:
        max_score = max([r.score for r in results])
        max_rank = max([r.fields()[&quot;rank&quot;] for r in results])
</pre>
<p>To calculate our combined rank we normalize the score and the rank by setting the largest value of each to one and scaling the rest appropriately.</p>
<pre class="brush: python; title: ; notranslate">
        combined = []
        for r in results:
            fields = r.fields()
            r.score = r.score/max_score
            r.rank = fields[&quot;rank&quot;]/max_rank
            r.combined = r.score + r.rank
            combined.append(r)
</pre>
<p>The final stage is to sort our list by the combined score and render the results page.</p>
<pre class="brush: python; title: ; notranslate">
        combined.sort(key=lambda x: x.combined, reverse=True)
    else:
        combined = []

    return render_to_response(&quot;results.html&quot;, { &quot;q&quot;: req.GET[&quot;q&quot;], &quot;results&quot;: combined })
</pre>
<p>The template for the results page is below.</p>
<pre class="brush: xml; title: ; notranslate">
{% extends &quot;base.html&quot; %}

{% block body %}
    &lt;form action=&quot;/search&quot; method=&quot;get&quot;&gt;
        &lt;input name=&quot;q&quot; type=&quot;text&quot; value=&quot;{{ q }}&quot;&gt;
        &lt;input type=&quot;submit&quot;&gt;
    &lt;/form&gt;

    {% for result in results|slice:&quot;:20&quot; %}
        &lt;p&gt;
            &lt;b&gt;&lt;a href=&quot;{{ result.url }}&quot;&gt;{{ result.title|safe }}&lt;/a&gt;&lt;/b&gt; ({{ result.score }}, {{ result.rank }}, {{ result.combined }})&lt;br&gt;
            {{ result.desc|safe }}
        &lt;/p&gt;
    {% endfor %}
{% endblock %}
</pre>
<p>So, there we have it. A complete web crawler, indexer and query website. In the next post I&#8217;ll discuss how to scale the search engine.</p>
<p>Read <a href="http://wp.me/pkxET-7E">part 7</a>.</p>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/amortize/527435776/">Query</a> by <a href="http://www.flickr.com/photos/amortize/">amortize</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/471/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/471/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=471&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2011/10/13/beating-google-with-couchdb-celery-and-whoosh-part-6/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm2.static.flickr.com/1179/527435776_a929bf88af_m.jpg" medium="image">
			<media:title type="html">Query</media:title>
		</media:content>
	</item>
		<item>
		<title>Beating Google With CouchDB, Celery and Whoosh (Part 5)</title>
		<link>http://andrewwilkinson.wordpress.com/2011/10/11/beating-google-with-couchdb-celery-and-whoosh-part-5/</link>
		<comments>http://andrewwilkinson.wordpress.com/2011/10/11/beating-google-with-couchdb-celery-and-whoosh-part-5/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 11:00:16 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[celerycrawler]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[whoosh]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=462</guid>
		<description><![CDATA[In this post we&#8217;ll continue building the backend for our search engine by implementing the algorithm we designed in the last post for ranking pages. We&#8217;ll also build a index of our pages with Whoosh, a pure-Python full-text indexer and query engine. To calculate the rank of a page we need to know what other [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=462&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/flyzipper/61475775/"><img style="float:right;border:0;" src="http://farm1.static.flickr.com/29/61475775_6b823a6db7_m.jpg" alt="order" /></a>In this post we&#8217;ll continue building the backend for our search engine by implementing the algorithm we designed in the last post for ranking pages. We&#8217;ll also build a index of our pages with <a href="https://bitbucket.org/mchaput/whoosh/wiki/Home">Whoosh</a>, a pure-Python full-text indexer and query engine.</p>
<p>To calculate the rank of a page we need to know what other pages link to a given url, and how many links that page has. The code below is a CouchDB map called <tt>page/links_to_url</tt>. For each page this will output a row for each link on the page with the url linked to as the key and the page&#8217;s rank and number of links as the value.</p>
<pre class="brush: jscript; title: ; notranslate">
function (doc) {
    if(doc.type == &quot;page&quot;) {
        for(i = 0; i &lt; doc.links.length; i++) {
            emit(doc.links[i], [doc.rank, doc.links.length]);
        }
    }
}
</pre>
<p>As before we&#8217;re using a Celery task to allow us to distribute our calculations. When we wrote the <tt>find_links</tt> task we called <tt>calculate_rank</tt> with the document id for our page as the parameter.</p>
<pre class="brush: python; title: ; notranslate">
@task
def calculate_rank(doc_id):
    page = Page.load(settings.db, doc_id)
</pre>
<p>Next we get a list of ranks for the page&#8217;s that link to this page. This static method is a thin wrapper around the <tt>page/links_to_url</tt> map function given above.</p>
<pre class="brush: python; title: ; notranslate">
    links = Page.get_links_to_url(page.url)
</pre>
<p>Now we have the list of ranks we can calculate the rank of this page by dividing the rank of the linking page by the number of links and summing this across all the linking pages.</p>
<pre class="brush: python; title: ; notranslate">
    rank = 0
    for link in links:
        rank += link[0] / link[1]
</pre>
<p>To prevent cycles (where <tt>A</tt> links to <tt>B</tt> and <tt>B</tt> links to <tt>A</tt>) from causing an infinite loop in our calculation we apply a damping factor. This causes the value of the link to decline by 0.85 and combined with the limit later in the function will force any loops to settle on a value.</p>
<pre class="brush: python; title: ; notranslate">
    old_rank = page.rank
    page.rank = rank * 0.85
</pre>
<p>If we didn&#8217;t find any links to this page then we give it a default rank of <tt>1/number_of_pages</tt>.</p>
<pre class="brush: python; title: ; notranslate">
    if page.rank == 0:
        page.rank = 1.0/settings.db.view(&quot;page/by_url&quot;, limit=0).total_rows
</pre>
<p>Finally we compare the new rank to the previous rank in our system. If it has changed by more than 0.0001 then we save the new rank and cause all the pages linked to from our page to recalculate their rank.</p>
<pre class="brush: python; title: ; notranslate">
    if abs(old_rank - page.rank) &gt; 0.0001:
        page.store(settings.db)

        for link in page.links:
            p = Page.get_id_by_url(link, update=False)
            if p is not None:
                calculate_rank.delay(p)
</pre>
<p>This is a very simplistic implementation of a page rank algorithm. It does generate a useful ranking of pages, but the number of queued <tt>calculate_rank</tt> tasks explodes. In a later post I&#8217;ll discuss how this could be made rewritten to be more efficient.</p>
<p><a href="https://bitbucket.org/mchaput/whoosh/wiki/Home">Whoosh</a> is a pure-Python full text search engine. In the next post we&#8217;ll look at querying it, but first we need to index the pages we&#8217;ve crawled.</p>
<p>The first step with Whoosh is to specify your schema. To speed up the display of results we store the information we need to render the results page directly in the schema. For this we need the page title, url and description. We also store the score given to the page by our pagerank-like algorithm. Finally we add the page text to the index so we can query it. If you want more details, the <a href="http://packages.python.org/Whoosh/">Whoosh documentation</a> is pretty good.</p>
<pre class="brush: python; title: ; notranslate">
from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), url=ID(stored=True, unique=True), desc=ID(stored=True), rank=NUMERIC(stored=True, type=float), content=TEXT)
</pre>
<p>CouchDB provides an interface for being informed whenever a document in the database <a href="http://guide.couchdb.org/draft/notifications.html">changes</a>. This is perfect for building an index.</p>
<p>Our full-text indexing daemon is implemented as a Django management command so there is some boilerplate code required to make this work.</p>
<pre class="brush: python; title: ; notranslate">
class Command(BaseCommand):
    def handle(self, **options):
        since = get_last_change()
        writer = get_writer()
</pre>
<p>CouchDB allows you to get all the changes that have occurred since a specific point in time (using a revision number). We store this number inside the Whoosh index directory, and accessing it using the <tt>get_last_change</tt> and <tt>set_last_change</tt> functions. Our access to the Whoosh index is through a <a href="http://packages.python.org/Whoosh/quickstart.html#the-indexwriter-object">IndexWriter</a> object, again accessed through an abstraction function.</p>
<p>Now we enter an infinite loop and call the <tt>changes</tt> function on our CouchDB database object to get the changes.</p>
<pre class="brush: python; title: ; notranslate">
        try:
            while True:
                changes = settings.db.changes(since=since)
                since = changes[&quot;last_seq&quot;]
                for changeset in changes[&quot;results&quot;]:
                    try:
                        doc = settings.db[changeset[&quot;id&quot;]]
                    except couchdb.http.ResourceNotFound:
                        continue
</pre>
<p>In our database we store <tt>robots.txt</tt> files as well as pages, so we need to ignore them. We also need to parse the document so we can pull out the text from the page. We do this with the <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> library.</p>
<pre class="brush: python; title: ; notranslate">
                    if &quot;type&quot; in doc and doc[&quot;type&quot;] == &quot;page&quot;:
                        soup = BeautifulSoup(doc[&quot;content&quot;])
                        if soup.body is None:
                            continue
</pre>
<p>On the results page we try to use the meta description if we can find it.</p>
<pre class="brush: python; title: ; notranslate">
                        desc = soup.findAll('meta', attrs={ &quot;name&quot;: desc_re })
</pre>
<p>Once we&#8217;ve got the parsed document we update our Whoosh index. The code is a little complicated because we need to handle the case where the page doesn&#8217;t have a title or description, and that we search for the title as well as the body text of the page. The key element here is <tt>text=True</tt> which pulls out just the text from a node and strips out all of the tags.</p>
<pre class="brush: python; title: ; notranslate">
                        writer.update_document(
                                title=unicode(soup.title(text=True)[0]) if soup.title is not None and len(soup.title(text=True)) &gt; 0 else doc[&quot;url&quot;],
                                url=unicode(doc[&quot;url&quot;]),
                                desc=unicode(desc[0][&quot;content&quot;]) if len(desc) &gt; 0 and desc[0][&quot;content&quot;] is not None else u&quot;&quot;,
                                rank=doc[&quot;rank&quot;],
                                content=unicode(soup.title(text=True)[0] + &quot;\n&quot; + doc[&quot;url&quot;] + &quot;\n&quot; + &quot;&quot;.join(soup.body(text=True)))
                            )
</pre>
<p>Finally we update the index and save the last change number so next time the script is run we continue from where we left off.</p>
<pre class="brush: python; title: ; notranslate">
                    writer.commit()
                    writer = get_writer()

                set_last_change(since)
        finally:
            set_last_change(since)
</pre>
<p>In the next post I&#8217;ll discuss how to query the index, sort the documents by our two rankings and build a simple web interface.</p>
<p>Read <a href="http://wp.me/pkxET-7B">part 6</a>.</p>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/flyzipper/61475775/">order</a> by <a href="http://www.flickr.com/photos/flyzipper/">Steve Mishos</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/462/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/462/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=462&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2011/10/11/beating-google-with-couchdb-celery-and-whoosh-part-5/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm1.static.flickr.com/29/61475775_6b823a6db7_m.jpg" medium="image">
			<media:title type="html">order</media:title>
		</media:content>
	</item>
		<item>
		<title>Beating Google With CouchDB, Celery and Whoosh (Part 4)</title>
		<link>http://andrewwilkinson.wordpress.com/2011/10/06/beating-google-with-couchdb-celery-and-whoosh-part-4/</link>
		<comments>http://andrewwilkinson.wordpress.com/2011/10/06/beating-google-with-couchdb-celery-and-whoosh-part-4/#comments</comments>
		<pubDate>Thu, 06 Oct 2011 11:00:26 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[celerycrawler]]></category>
		<category><![CDATA[django]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=449</guid>
		<description><![CDATA[In this series I&#8217;m showing you how to build a webcrawler and search engine using standard Python based tools like Django, Celery and Whoosh with a CouchDB backend. In previous posts we created a data structure, parsed and stored robots.txt and stored a single webpage in our document. In this post I&#8217;ll show you how [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=449&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/grismarengo/2516495079/"><img style="float:right;border:0;" src="http://farm3.static.flickr.com/2039/2516495079_a4c363f960_m.jpg" alt="Red Sofa encounter i" /></a>In this series I&#8217;m showing you how to build a webcrawler and search engine using standard Python based tools like Django, Celery and Whoosh with a CouchDB backend. In previous posts we created a data structure, parsed and stored <tt>robots.txt</tt> and stored a single webpage in our document. In this post I&#8217;ll show you how to parse out the links from our stored HTML document so we can complete the crawler, and we&#8217;ll start calculating the rank for the pages in our database.</p>
<p>There are several different ways of parsing out the links in a given HTML document. You can just use a regular expression to pull the urls out, or you can use a more complete but also more complicated (and slower) method of parsing the HTML using the standard Python <a href="http://docs.python.org/library/htmlparser.html">htmlparser</a> library, or the wonderful <a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a>. The point of this series isn&#8217;t to build a complete webcrawler, but to show you the basic building blocks. So, for simplicity&#8217;s sake I&#8217;ll use a regular expression. </p>
<pre class="brush: python; title: ; notranslate">
link_single_re = re.compile(r&quot;&lt;a[^&gt;]+href='([^']+)'&quot;)
link_double_re = re.compile(r'&lt;a[^&gt;]+href=&quot;([^&quot;]+)&quot;')
</pre>
<p>All we need to look for an <tt>href</tt> attribute in an <tt>a</tt> tag. We&#8217;ll use two regular expressions to handle single and double quotes, and then build a list containing all the links in the document.</p>
<pre class="brush: python; title: ; notranslate">
@task
def find_links(doc_id):
    doc = Page.load(settings.db, doc_id)

    raw_links = []
    for match in link_single_re.finditer(doc.content):
        raw_links.append(match.group(1))

    for match in link_double_re.finditer(doc.content):
        raw_links.append(match.group(1))
</pre>
<p>Once we&#8217;ve got a list of the raw links we need to process them into absolute urls that we can send back to the <tt>retrieve_page</tt> task we wrote earlier. I&#8217;m cutting some corners with processing these urls, in particular I&#8217;m not dealing with <a href="http://www.w3.org/TR/html4/struct/links.html#h-12.4">base</a> tags.</p>
<pre class="brush: python; title: ; notranslate">
    doc.links = []
    for link in raw_links:
        if link.startswith(&quot;#&quot;):
            continue
        elif link.startswith(&quot;http://&quot;) or link.startswith(&quot;https://&quot;):
            pass
        elif link.startswith(&quot;/&quot;):
            parse = urlparse(doc[&quot;url&quot;])
            link = parse.scheme + &quot;://&quot; + parse.netloc + link
        else:
            link = &quot;/&quot;.join(doc[&quot;url&quot;].split(&quot;/&quot;)[:-1]) + &quot;/&quot; + link

        doc.links.append(unescape(link.split(&quot;#&quot;)[0]))

    doc.store(settings.db)
</pre>
<p>Once we&#8217;ve got our list of links and saved the modified document we then need to trigger the next series of steps to occur. We need to calculate the rank of this page, so we trigger that task and then we step through each page that we linked to. If we&#8217;ve already got a copy of the page then we want to recalculate its rank to take into account the rank of this page (more on this later) and if we don&#8217;t have a copy then we queue it up to be retrieved.</p>
<pre class="brush: python; title: ; notranslate">
    calculate_rank.delay(doc.id)

    for link in doc.links:
        p = Page.get_id_by_url(link, update=False)
        if p is not None:
            calculate_rank.delay(p)
        else:
            retrieve_page.delay(link)
</pre>
<p>We&#8217;ve now got a complete webcrawler. We can store webpages and <tt>robots.txt</tt> files. Given a starting URL our crawler will set about parsing pages to find out what they link to and retrieve those pages as well. Given enough time you&#8217;ll end up with most of the internet on your harddisk!</p>
<p>When we come to write the website to query the information we&#8217;ve collected we&#8217;ll use two numbers to rank pages. First we&#8217;ll use the a value that ranks pages base on the query used, but we&#8217;ll also use a value that ranks pages based on their importance. This is the same method used by Google, known as <a href="http://en.wikipedia.org/wiki/Page_Rank">Page Rank</a>.</p>
<p>Pank Rank is a measure of how likely you are to end up on a given page by clicking on a random link anywhere on the internet. The <a href="http://en.wikipedia.org/wiki/Page_Rank">Wikipedia article</a> goes into some detail on a number of ways to calculate it, but we&#8217;ll use a very simple iterative algorithm.</p>
<p>When created, a page is given a rank equal to <tt>1/number of pages</tt>. Each link that is found on a newly crawled page then causes the rank of the destination page to be calculated. In this case the rank of a page is the sum of the ranks of the pages that link to it, divided by the number of links on those pages, multiplied by a dampening factor (I use 0.85, but this could be adjusted.) If a page has a rank of 0.25 and has five links then each page linked to gains 0.05*0.85 rank for that link. If the change in rank of the page when recalculated is significant then the rank of all the pages it links to are recalculated.</p>
<p>In this post we&#8217;ve completed the web crawler part of our search engine and discussed how to rank pages in importance. In the next post we&#8217;ll implement this ranking and also create a full text index of the pages we have crawled.</p>
<p>Read <a href="http://wp.me/pkxET-7s">part 5</a>.</p>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/grismarengo/2516495079/">Red Sofa encounter i</a> by <a href="http://www.flickr.com/photos/grismarengo/">Ricard Gil</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/449/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/449/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=449&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2011/10/06/beating-google-with-couchdb-celery-and-whoosh-part-4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2039/2516495079_a4c363f960_m.jpg" medium="image">
			<media:title type="html">Red Sofa encounter i</media:title>
		</media:content>
	</item>
		<item>
		<title>Beating Google With CouchDB, Celery and Whoosh (Part 3)</title>
		<link>http://andrewwilkinson.wordpress.com/2011/10/04/beating-google-with-couchdb-celery-and-whoosh-part-3/</link>
		<comments>http://andrewwilkinson.wordpress.com/2011/10/04/beating-google-with-couchdb-celery-and-whoosh-part-3/#comments</comments>
		<pubDate>Tue, 04 Oct 2011 11:00:18 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[celerycrawler]]></category>
		<category><![CDATA[django]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=441</guid>
		<description><![CDATA[In this series I&#8217;ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we&#8217;ll start crawling the web and filling our database with the contents of pages. One of the rules we set down was to not request a page too often. If, by [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=441&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/tim_ellis/5586571637/"><img style="float:right;border:0;" src="http://farm6.static.flickr.com/5269/5586571637_f106791f3b_m.jpg" alt="Celery" /></a>In this series I&#8217;ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we&#8217;ll start crawling the web and filling our database with the contents of pages.</p>
<p>One of the rules we set down was to not request a page too often. If, by accident, we try to retrieve a page more than once a week then don&#8217;t want that request to actually make it to the internet. To help prevent this we&#8217;ll extend the <tt>Page</tt> class we created in the last post with a function called <tt>get_by_url</tt>. This static method will take a url and return the Page object that represents it, retrieving the page if we don&#8217;t already have a copy. You could create this as an independent function, but I prefer to use static methods to keep things tidy.</p>
<p>We only actually want to retrieve the page from the internet in one of the three tasks the we&#8217;re going to create so we&#8217;ll give <tt>get_by_url</tt> a parameter, <tt>update</tt> that enables us to return <tt>None</tt> if we don&#8217;t have a copy of the page.</p>
<pre class="brush: python; title: ; notranslate">
    @staticmethod
    def get_by_url(url, update=True):
        r = settings.db.view(&quot;page/by_url&quot;, key=url)
        if len(r.rows) == 1:
            doc = Page.load(settings.db, r.rows[0].value)
            if doc.is_valid():
                return doc
        elif not update:
            return None
        else:
            doc = Page(url=url)

        doc.update()

        return doc
</pre>
<p>The key line in the static method is <tt>doc.update()</tt>. This calls the function to retrieves the page and makes sure we respect the <tt>robots.txt</tt> file. Let&#8217;s look at what happens in that function</p>
<pre class="brush: python; title: ; notranslate">
    def update(self):
        parse = urlparse(self.url)
</pre>
<p>We need to split up the given URL so we know whether it&#8217;s a secure connection or not, and we need to limit our connects to each domain so we need get that as well. Python has a module, <a href="http://docs.python.org/library/urlparse.html">urlparse</a>, that does the hard work for us.</p>
<pre class="brush: python; title: ; notranslate">
        robotstxt = RobotsTxt.get_by_domain(parse.scheme, parse.netloc)
        if not robotstxt.is_allowed(parse.netloc):
            return False
</pre>
<p>In the previous post we discussed parsing the <tt>robots.txt</tt> file and here we make sure that if we&#8217;re not allowed to index a page, then we don&#8217;t</p>
<pre class="brush: python; title: ; notranslate">
        while cache.get(parse.netloc) is not None:
            time.sleep(1)
        cache.set(parse.netloc, True, 10)
</pre>
<p>As with the code to parse <tt>robots.txt</tt> files we need to make sure we don&#8217;t access the same domain too often.</p>
<pre class="brush: python; title: ; notranslate">
        req = Request(self.url, None, { &quot;User-Agent&quot;: settings.USER_AGENT })

        resp = urlopen(req)
        if not resp.info()[&quot;Content-Type&quot;].startswith(&quot;text/html&quot;):
            return
        self.content = resp.read().decode(&quot;utf8&quot;)
        self.last_checked = datetime.now()

        self.store(settings.db)
</pre>
<p>Finally, once we&#8217;ve checked we&#8217;re allowed to access a page and haven&#8217;t accessed another page on the same domain recently we use the standard Python tools to download the content of the page and store it in our database.</p>
<p>Now we can retrieve a page we need to add it to the task processing system. To do this we&#8217;ll create a <a href="http://celeryproject.org/">Celery</a> task to retrieve the page. The task just needs to call the <tt>get_by_url</tt> static method we created earlier and then, if the page is downloaded trigger a second task to parse out all of the links.</p>
<pre class="brush: python; title: ; notranslate">
@task
def retrieve_page(url):
    page = Page.get_by_url(url)
    if page is None:
        return

    find_links.delay(page.id)
</pre>
<p>You might be asking why the links aren&#8217;t parsed immediately after retrieving the page. They certainly could be, but a key goal was to enable the crawling process to scale as much as possible. Each page crawled has, based on the pages I&#8217;ve crawled so far, around 100 links on it. As part of the <tt>find_links</tt> task a new <tt>retrieve_task</tt> is created. This quickly swamps the tasks to perform other tasks like calculating the rank of a page and prevents them from being processed.</p>
<p>Celery provides the tools to ensure that a subset of message are processed in a timely manner, called <tt>Queues</tt>. Tasks can be assigned to different queues and daemons can be made to watch a specific set of queues. If you have a Celery daemon that only watches the queue used by your high priority tasks then those tasks will always be processed quickly.</p>
<p>We&#8217;ll use two queues, one for retrieving the pages and another for processing them. First we need to tell Celery about the queues (we also need to include the default <tt>celery</tt> queue here) and then we create a router class. The router looks at the task name and decides which queue to put it into. Your routing code could be very complicated, but ours is very straightforward.</p>
<pre class="brush: python; title: ; notranslate">
CELERY_QUEUES = {&quot;retrieve&quot;: {&quot;exchange&quot;: &quot;default&quot;, &quot;exchange_type&quot;: &quot;direct&quot;, &quot;routing_key&quot;: &quot;retrieve&quot;},
                 &quot;process&quot;: {&quot;exchange&quot;: &quot;default&quot;, &quot;exchange_type&quot;: &quot;direct&quot;, &quot;routing_key&quot;: &quot;process &quot;},
                 &quot;celery&quot;: {&quot;exchange&quot;: &quot;default&quot;, &quot;exchange_type&quot;: &quot;direct&quot;, &quot;routing_key&quot;: &quot;celery&quot;}}

class MyRouter(object):
    def route_for_task(self, task, args=None, kwargs=None):
        if task == &quot;crawler.tasks.retrieve_page&quot;:
            return { &quot;queue&quot;: &quot;retrieve&quot; }
        else:
            return { &quot;queue&quot;: &quot;process&quot; }

CELERY_ROUTES = (MyRouter(), )
</pre>
<p>The final step is to allow the crawler to be kicked off by seeding it with some URLs. I&#8217;ve previously posted about how to create a <a href="http://andrewwilkinson.wordpress.com/2009/03/06/creating-django-management-commands/">Django management command</a> and they&#8217;re a perfect fit here. The command takes one argument, the url, and creates a Celery task to retrieve it.</p>
<pre class="brush: python; title: ; notranslate">
class Command(BaseCommand):
    def handle(self, url, **options):
         retrieve_page.delay(url)
</pre>
<p>We&#8217;ve now got a web crawler that is almost complete. In the next post I&#8217;ll discuss parsing links out of the HTML, and we&#8217;ll look at calculating the rank of each page.</p>
<p>Read <a href="http://wp.me/pkxET-7f">part 4</a>.</p>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/tim_ellis/5586571637/">Celery</a> by <a href="http://www.flickr.com/photos/tim_ellis/">tim ellis</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/441/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/441/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=441&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2011/10/04/beating-google-with-couchdb-celery-and-whoosh-part-3/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm6.static.flickr.com/5269/5586571637_f106791f3b_m.jpg" medium="image">
			<media:title type="html">Celery</media:title>
		</media:content>
	</item>
		<item>
		<title>Beating Google With CouchDB, Celery and Whoosh (Part 2)</title>
		<link>http://andrewwilkinson.wordpress.com/2011/09/29/beating-google-with-couchdb-celery-and-whoosh-part-2/</link>
		<comments>http://andrewwilkinson.wordpress.com/2011/09/29/beating-google-with-couchdb-celery-and-whoosh-part-2/#comments</comments>
		<pubDate>Thu, 29 Sep 2011 11:00:41 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[celerycrawler]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[whoosh]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=412</guid>
		<description><![CDATA[In this series I&#8217;ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we&#8217;ll begin by creating the data structure for storing the pages in the database, and write the first parts of the webcrawler. CouchDB&#8217;s Python library has a simple ORM system that [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=412&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/johnnystiletto/5226474427/"><img style="float:right;border:0;" src="http://farm6.static.flickr.com/5050/5226474427_90d7388bed_m.jpg" alt="Celery, Carrots &amp; Sweet Onion for Chicken Feet Stock by I Believe I Can Fry" /></a>In this <a href="http://andrewwilkinson.wordpress.com/2011/09/27/beating-google-with-couchdb-celery-and-whoosh-part-1/">series</a> I&#8217;ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we&#8217;ll begin by creating the data structure for storing the pages in the database, and write the first parts of the webcrawler.</p>
<p>CouchDB&#8217;s Python library has a simple <a href="http://packages.python.org/CouchDB/mapping.html">ORM system</a> that makes it easy to convert between the JSON objects stored in the database and a Python object.</p>
<p>To create the class you just need to specify the names of the fields, and their type. So, what do a search engine need to store? The url is an obvious one, as is the content of the page. We also need to know when we last accessed the page. To make things easier we&#8217;ll also have a list of the urls that the page links to. One of the great advantages of a database like CouchDB is that we don&#8217;t need to create a separate table to hold the links, we can just include them directly in the main document. To help return the best pages we&#8217;ll use a <a href="http://en.wikipedia.org/wiki/PageRank">page rank</a> like algorithm to rank the page, so we also need to store that rank. Finally, as is good practice on CouchDB we&#8217;ll give the document a <tt>type</tt> field so we can write views that only target this document type.</p>
<pre class="brush: python; title: ; notranslate">
class Page(Document):
    type = TextField(default=&quot;page&quot;)

    url = TextField()

    content = TextField()

    links = ListField(TextField())

    rank = FloatField(default=0)

    last_checked = DateTimeField(default=datetime.now)
</pre>
<p>That&#8217;s a lot of description for not a lot of code! Just add that class to your <tt>models.py</tt> file. It&#8217;s not a normal Django model, but we&#8217;re not using Django models in this project so it&#8217;s the right place to put it. </p>
<p>We also need to keep track of the urls that we are and aren&#8217;t allowed to access. Fortunately for us Python comes with a class, <a href="http://docs.python.org/library/robotparser.html">RobotFileParser</a> which handles the parsing of the file for us. So, this becomes a much simpler model. We just need the domain name, and a <a href="http://docs.python.org/library/pickle.html">pickled</a> RobotFileParser instance. We also need to know whether we&#8217;re accessing an http or https and we&#8217;ll give it <tt>type</tt> field to distinguish it from the <tt>Page</tt> model.</p>
<pre class="brush: python; title: ; notranslate">
class RobotsTxt(Document):
    type = TextField(default=&quot;robotstxt&quot;)

    domain = TextField()
    protocol = TextField()

    robot_parser_pickle = TextField()
</pre>
<p>We want to make the pickle/unpickle process transparent so we&#8217;ll create a property that hides the underlying pickle representation. CouchDB can&#8217;t store the binary pickle value, so we also base64 encode it and store that instead. If the object hasn&#8217;t been set yet then we create a new one on the first access.</p>
<pre class="brush: python; title: ; notranslate">
    def _get_robot_parser(self):
        if self.robot_parser_pickle is not None:
            return pickle.loads(base64.b64decode(self.robot_parser_pickle))
        else:
            parser = RobotFileParser()
            parser.set_url(self.protocol + &quot;://&quot; + self.domain + &quot;/robots.txt&quot;)
            self.robot_parser = parser

            return parser
    def _set_robot_parser(self, parser):
        self.robot_parser_pickle = base64.b64encode(pickle.dumps(parser))
    robot_parser = property(_get_robot_parser, _set_robot_parser)
</pre>
<p>For both pages and <tt>robots.txt</tt> files we need to know whether we should reaccess the page. We&#8217;ll do this by testing whether the we accessed the file in the last seven days of not. For Page models we do this by adding the following function which implements this check.</p>
<pre class="brush: python; title: ; notranslate">
    def is_valid(self):
        return (datetime.now() - self.last_checked).days &lt; 7
</pre>
<p>For the <tt>RobotsTxt</tt> we can take advantage of the last modified value stored in the <tt>RobotFileParser</tt> that we&#8217;re wrapping. This is a unix timestamp so the <tt>is_valid</tt> function needs to be a little bit different, but follows the same pattern. </p>
<pre class="brush: python; title: ; notranslate">
    def is_valid(self):
        return (time.time() - self.robot_parser.mtime()) &lt; 7*24*60*60
</pre>
<p>To update the stored copy of a <tt>robots.txt</tt> we need to get the currently stored version, read a new one, set the last modified timestamp and then write it back to the database. To avoid hitting the same server too often we can use <a href="https://docs.djangoproject.com/en/dev/topics/cache/">Django&#8217;s cache</a> to store a value for ten seconds, and sleep if that value already exists.</p>
<pre class="brush: python; title: ; notranslate">
    def update(self):
        while cache.get(self.domain) is not None:
            time.sleep(1)
        cache.set(self.domain, True, 10)

        parser = self.robot_parser
        parser.read()
        parser.modified()
        self.robot_parser = parser

        self.store(settings.db)
</pre>
<p>Once we&#8217;ve updated the stored file we need to be able to query it. This function just passes the URL being tested through to the underlying model along with our user agent string.</p>
<pre class="brush: python; title: ; notranslate">
    def is_allowed(self, url):
        return self.robot_parser.can_fetch(settings.USER_AGENT, url)
</pre>
<p>The final piece in our <tt>robots.txt</tt> puzzle is a function to pull the write object out of the database. We&#8217;ll need a view that has the protocol and domain for each file as the key. </p>
<pre class="brush: python; title: ; notranslate">
    @staticmethod
    def get_by_domain(protocol, domain):
        r = settings.db.view(&quot;robotstxt/by_domain&quot;, key=[protocol, domain])
</pre>
<p>We query that mapping and if it returns a value then we load the object. If it&#8217;s still valid then we can return right away, otherwise we need to update it.</p>
<pre class="brush: python; title: ; notranslate">
        if len(r) &gt; 0:
            doc = RobotsTxt.load(settings.db, r.rows[0].value)
            if doc.is_valid():
                return doc
</pre>
<p>If we&#8217;ve never loaded this domain&#8217;s <tt>robots.txt</tt> file before then we need to create a blank object. The final step is to read the file and store it in the database.</p>
<pre class="brush: python; title: ; notranslate">
        else:
            doc = RobotsTxt(protocol=protocol, domain=domain)

        doc.update()

        return doc
</pre>
<p>For completeness, here is the map file required for this function.</p>
<pre class="brush: jscript; title: ; notranslate">
function (doc) {
    if(doc.type == &quot;robotstxt&quot;) {
        emit([doc.protocol, doc.domain], doc._id);
    }
}
</pre>
<p>In this post we&#8217;ve discussed how to represent a webpage in our database as well as keep track of what paths we are and aren&#8217;t allowed to access. We&#8217;ve also seen how to retrieve the <tt>robots.txt</tt> files and update them if they&#8217;re too old.</p>
<p>Now that we can test whether we&#8217;re allowed to access a URL in the next post in this series I&#8217;ll show you how to begin crawling the web and populating our database.</p>
<p>Read <a href="http://wp.me/pkxET-77">part 3</a>.</p>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/johnnystiletto/5226474427/">Celery, Carrots &amp; Sweet Onion for Chicken Feet Stock</a> by <a href="http://www.flickr.com/photos/johnnystiletto/">I Believe I Can Fry</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/412/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/412/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=412&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2011/09/29/beating-google-with-couchdb-celery-and-whoosh-part-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm6.static.flickr.com/5050/5226474427_90d7388bed_m.jpg" medium="image">
			<media:title type="html">Celery, Carrots &#38; Sweet Onion for Chicken Feet Stock by I Believe I Can Fry</media:title>
		</media:content>
	</item>
		<item>
		<title>Beating Google With CouchDB, Celery and Whoosh (Part 1)</title>
		<link>http://andrewwilkinson.wordpress.com/2011/09/27/beating-google-with-couchdb-celery-and-whoosh-part-1/</link>
		<comments>http://andrewwilkinson.wordpress.com/2011/09/27/beating-google-with-couchdb-celery-and-whoosh-part-1/#comments</comments>
		<pubDate>Tue, 27 Sep 2011 11:00:44 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[celerycrawler]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[whoosh]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=407</guid>
		<description><![CDATA[Ok, let&#8217;s get this out of the way right at the start &#8211; the title is a huge overstatement. This series of posts will show you how to create a search engine using standard Python tools like Django, Celery and Whoosh with CouchDB as the backend. Celery is a message passing library that makes it [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=407&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/judy-van-der-velden/5668310473/"><img src="http://farm6.static.flickr.com/5227/5668310473_1573cde550_m.jpg" alt="celery by Judy **" style="float:right;border:0;"></a>Ok, let&#8217;s get this out of the way right at the start &#8211; the title is a huge overstatement. This series of posts will show you how to create a search engine using standard Python tools like Django, Celery and Whoosh with CouchDB as the backend.</p>
<p><a href="http://celeryproject.org/">Celery</a> is a message passing library that makes it really easy to run background tasks and to spread them across a number of nodes. The most recent release added the NoSQL database <a href="http://couchdb.apache.org/">CouchDB</a> as a possible backend. I&#8217;m a huge fan of CouchDB, and the idea of running both my database and message passing backend on the same software really appealed to me. Unfortunately the documentation doesn&#8217;t make it clear what you need to do to get CouchDB working, and what the downsides are. I decided to write this series partly to explain how Celery and CouchDB work, but also to experiment with using them together.</p>
<p>In this series I&#8217;m going to talk about setting up Celery to work with Django, using CouchDB as a backend. I&#8217;m also going to show you how to use Celery to create a web-crawler. We&#8217;ll then index the crawled pages using <a href="https://bitbucket.org/mchaput/whoosh/wiki/Home">Whoosh</a> and use a <a href="http://en.wikipedia.org/wiki/PageRank">PageRank</a>-like algorithm to help rank the results. Finally, we&#8217;ll attach a simple Django frontend to the search engine for querying it.</p>
<p>Let&#8217;s consider what we need to implement for our webcrawler to work, and be a good citizen of the internet. First and foremost is that we must be read and respect <a href="http://www.robotstxt.org/">robots.txt</a>. This is a file that specifies what areas of a site crawlers are banned from. We must also not hit a site too hard, or too often. It is very easy to write a crawler than repeatedly hits a site, and requests the same document over and over again. These are very big no-noes. Lastly we must make sure that we use a custom <a href="http://en.wikipedia.org/wiki/User_agent">User Agent</a> so our bot is identifiable.</p>
<p>We&#8217;ll divide the algorithm for our webcrawler into three parts. Firstly we&#8217;ll need a set of urls. The crawler picks a url, retrieves the page then store it in the database. The second stage takes the page content, parses it for links, and adds the links to the set of urls to be crawled. The final stage is to index the retrieved text. This is done by watching for pages that are retrieved by the first stage, and adding them to the full text index.</p>
<p>Celery&#8217;s allows you to create &#8216;tasks&#8217;. These are units of work that are triggered by a piece of code and then executed, after a period of time, on any node in your system. For the crawler we&#8217;ll need two seperate tasks. The first retrieves and stores a given url. When it completes it will triggers a second task, one that parses the links from the page. To begin the process we&#8217;ll need to use an external command to feed some initial urls into the system, but after that it will continuously crawl until it runs out of links. A real search engine would want to monitor its index for stale pages and reload those, but I won&#8217;t implement that in this example.</p>
<p>I&#8217;m going to assume that you have a decent level of knowledge about <a href="http://www.python.org">Python</a> and <a href="http://www.djangoproject.com/">Django</a>, so you might want to read some tutorials on those first. If you&#8217;re following along at home, create yourself a blank Django project with a single app inside. You&#8217;ll also need to install <tt>django-celery</tt>, the CouchDB Python library, and have a working install of CouchDB available.</p>
<p>Read <a href="http://wp.me/pkxET-6E">part 2</a>.</p>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/judy-van-der-velden/5668310473/">celery</a> by <a href="http://www.flickr.com/photos/judy-van-der-velden/">Judy **</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/407/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/407/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=407&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2011/09/27/beating-google-with-couchdb-celery-and-whoosh-part-1/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm6.static.flickr.com/5227/5668310473_1573cde550_m.jpg" medium="image">
			<media:title type="html">celery by Judy **</media:title>
		</media:content>
	</item>
		<item>
		<title>Cleaning Your Django Project With PyLint And Buildbot</title>
		<link>http://andrewwilkinson.wordpress.com/2011/03/07/cleaning-your-django-project-with-pylint-and-buildbot/</link>
		<comments>http://andrewwilkinson.wordpress.com/2011/03/07/cleaning-your-django-project-with-pylint-and-buildbot/#comments</comments>
		<pubDate>Mon, 07 Mar 2011 13:39:23 +0000</pubDate>
		<dc:creator>Andrew Wilkinson</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[buildbot]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[pylint]]></category>

		<guid isPermaLink="false">http://andrewwilkinson.wordpress.com/?p=369</guid>
		<description><![CDATA[There are a number of tools for checking whether your Python code meets a coding standard. These include pep8.py, PyChecker and PyLint. Of these, PyLint is the most comprehensive and is the tool which I prefer to use as part of my buildbot checks that run on every commit. PyLint works by parsing the Python [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=369&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/inf3ktion/4477642894/"><img src="http://farm5.static.flickr.com/4048/4477642894_2cfbc8ea4f_m.jpg" alt="Cleaning by inf3ktion" style="float:right;border:0;"></a>There are a number of tools for checking whether your Python code meets a coding standard. These include <a href="http://pypi.python.org/pypi/pep8">pep8.py</a>, <a href="http://pychecker.sourceforge.net/">PyChecker</a> and <A href="http://www.logilab.org/857">PyLint</a>. Of these, PyLint is the most comprehensive and is the tool which I prefer to use as part of <a href="http://andrewwilkinson.wordpress.com/2010/06/30/continuous-integration-testing/">my buildbot checks</a> that run on every commit.</p>
<p>PyLint works by parsing the Python source code itself and checking things like using variables that aren&#8217;t defined, missing doc strings and a large array of other checks. A downside of PyLint&#8217;s comprehensiveness is that it runs the risk of generating false positives. As it parses the source code itself it struggles with some of Python&#8217;s more dynamic features, in particular <A href="http://www.voidspace.org.uk/python/articles/metaclasses.shtml">metaclasses</a>, which, unfortunately, are a key part of Django. In this post I&#8217;ll go through the changes I make to the standard PyLint settings to make it more compatible with Django.</p>
<pre class="brush: plain; title: ; notranslate">
disable=W0403,W0232,E1101
</pre>
<p>This line disables a few problems that are picked up entirely. <tt>W0403</tt> stops relative imports from generating a warning, whether you want to disable these or not is really a matter of personal preference. Although I appreciate why there is a check for this, I think this is a bit too picky. <tt>W0232</tt> stops a warning appearing when a class has no <tt>__init__</tt> method. Django models will produce this warning, but because they&#8217;re metaclasses there is nothing wrong with them. Finally, <tt>E1101</tt> is generated if you access a member variable that doesn&#8217;t exist. Accessing members such as <tt>id</tt> or <tt>objects</tt> on a model will trigger this, so it&#8217;s simplest just to disable the check.</p>
<pre class="brush: plain; title: ; notranslate">
output-format=parseable
include-ids=yes
</pre>
<p>These makes the output of PyLint easier to parse by Buildbot, if you&#8217;re not using it then you probably don&#8217;t need to include these lines.</p>
<pre class="brush: plain; title: ; notranslate">
good-names= ...,qs
</pre>
<p>Apart from a limited number of names PyLint tries to enforce a minimum size of three characters in a variable name. As <tt>qs</tt> is such a useful variable name for a QuerySet I force this be allowed as a good name.</p>
<pre class="brush: plain; title: ; notranslate">
max-line-length=160
</pre>
<p>The last change I make is to allow much longer lines. By default PyLint only allows 80 character long lines, but how many people have screens that narrow anymore? Even the argument that it allows you to have two files side by side doesn&#8217;t hold water in this age where multiple monitors for developers are the norm.</p>
<p>PyLint uses the exit code to indicate what errors occurred during the run. This confuses Buildbot which assumes that a non-zero return code means the program failed to run, even when using the <a href="http://buildbot.net/buildbot/docs/0.8.0/PyLint.html">PyLint buildstep</a>. To work around this I use a simple management command to duplicate the <tt>pylint</tt> program&#8217;s functionality but that doesn&#8217;t let the return code propagate back to Builtbot.</p>
<pre class="brush: python; title: ; notranslate">
from django.core.management.base import BaseCommand

from pylint import lint

class Command(BaseCommand):
    def handle(self, *args, **options):
        lint.Run(list(args + (&quot;--rcfile=../pylint.cfg&quot;, )), exit=False)
</pre>
<hr />
<p>Photo of <a href="http://www.flickr.com/photos/inf3ktion/4477642894/">Cleaning</a> by <a href="http://www.flickr.com/photos/inf3ktion">inf3ktion</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andrewwilkinson.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andrewwilkinson.wordpress.com/369/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andrewwilkinson.wordpress.com&#038;blog=4895947&#038;post=369&#038;subd=andrewwilkinson&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://andrewwilkinson.wordpress.com/2011/03/07/cleaning-your-django-project-with-pylint-and-buildbot/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0d5abb071bb1ab8518c3e9b0f4e718eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andrew</media:title>
		</media:content>

		<media:content url="http://farm5.static.flickr.com/4048/4477642894_2cfbc8ea4f_m.jpg" medium="image">
			<media:title type="html">Cleaning by inf3ktion</media:title>
		</media:content>
	</item>
	</channel>
</rss>