Rsync backups to Amazon S3

Having recently got married I wanted to make sure all the photos taken at the event are safely stored for the posterity. I thought I’d take the opportunity of making sure that all the rest of my photos are safely backed up, and that any new ones are also backed up without me needing to do anything.

One of the simplest places to keep your backups is Amazon S3. There is essentially an unlimited amount of space available, and it’s pretty cheap. rsync is a great tool to use when backing up because it only copies files, and parts of files that have changed so it will reduce the amount of data transferred to the lowest amount of possible. With S3 you not only pay for the data stored, but also for the data transferred so rsync is perfect. So, how do we use rsync to transfer data to S3?

I won’t go through setting up an Amazon S3 or creating a bucket, the Amazon documentation does that just fine.

The first thing to do is download and install s3fs. This is a tool that uses FUSE to mount your S3 account as if it was an ordinary part of your filesystem. Once you’ve got it installed you need to configure it with your access and secret ids. You’ll be given these when you set up your S3 account. Create a file .passwd-s3fs in your home directory and chmod it so it has no group or other permissions.

Mounting your S3 bucket is simple, just run:

s3fs bucket_name /mount/point

Any file operations you conduct in /mount/point will now be mirrored to S3 automatically. Neat!

To copy the files across we need to run rsync.

rsync -av --delete /backup/directory /mount/point

This will copy all files from /backup/directory to /mount/point and so to S3. The -a option means archive mode, which sets the correct options for performing a backup. -v is verbose so you can see how far it gets while --delete means that files will be deleted from /mount/point if they’ve been deleted from the directory your backing up.

On your initial backup you’ll likely be transferring multiple gigabytes of data, and that will saturate the upload on your Internet connection. This prevents you from doing pretty much anything else until it’s finished, so lets look at limiting how fast the back up runs.

trickle is a very useful program that limits the bandwidth that a single program can consume. We don’t want to limit rsync, because that is running locally, it’s the s3fs program we want to limit so alter the mount command to be:

trickle -u 256 s3fs bucket_name /mount/point

This will only allow s3fs to consume a maximum of 256KB/s of upload, allowing you to continue to browse Facebook while the backup is progressing. Simply change to the upload number depending on how fast your internet connection to get the right balance between a usable connection and the speed of the backup.

To automate the backup just add the two commands to a script file and put it in your crontab like so.

@daily /home/username/bin/s3_backup

Naming Screen Sessions

I develop a number of Django-powered websites at work, and usually I want to leave them running when I’m not working on them so others can check out my progress and give me suggestions. The Django development server is incredibly useful when developing, but it’s not detached from the terminal so as soon as you log out the server gets switched off. One alternative is to run the website under Apache, as you would deploy it normally. This solves the problem of leaving the website running, but makes it much harder to develop with.

A third option is the GNU program Screen. When run without arguments screen puts you into a new bash session. Pressing Ctrl+d drops you back out to where you were. The magic occurs when you press Ctrl+a d. This drops you back out, but the bash session is stilling running! By typing screen -r you’ll reattach to the session and can carry on working as before. You can leave it as long as you like between detaching and reattaching to a session, as long as the computer is still running.

It is possible to run multiple screen sessions at once, perhaps with a different Django development server running in each. Unfortunately screen will only reattach automatically when there is just one detached session. If you have more than one then you’ll be confronted by a cryptic series of numbers that uniquely identifies each session. You can reattach to a specific session you can type screen -r <pid>.

To make things easier to reattach to the session that I’m working on I give these sessions name so rather than a cryptic series of numbers I see a useful set of names. To do this you just need to type Ctrl + A : sessioname <name>.

There are plenty of other useful things that screen can do, but named sessions is by far and away the most common one that I use.