Random thoughts of a warped mind…

February 12, 2014

Sync S3 buckets in parallel mode via concurrent threads

Filed under: All,Amazon EC2,EC2,Git,Ruby,S3 — Srinivas @ 18:13

A week back I realized one of my core S3 buckets at work (which we use for all a bunch of app uploads that are always needed) was a us-west-2 only bucket and not US-Standard. (Dont like that, When S3 gives you 11 9s why not get a US Standard bucket???). Considering that we had varnish in multiple regions with this bucket as the backend, I wanted to do two things -

1. Migrate all data from this bucket to a US-Standard bucket

2. Migrate all data from this bucket to a EU/Ireland bucket as well (coz I have app servers etc out there as well which need the same data – Did’nt want to come across the pond for every object we had to retrieve). Why? Reduced latency and reduced B/W costs (costs nothing when a EC2 instance in EU has to pull an object from a EU bucket).

Tried the usual routes (standard s3sync, s3cmd in sync mode etc) but all of them seemed to sync one object at a time… I had s3cmd running for over 20 hours and managed to sync only about 19000 objects… That obviously was’nt going to work… And hence my own version of s3sync…


Objectives -

1. Run in parallel and have multiple threads syncing a subset of my data.

2. Avoid having to fetch the S3 object to my instance and uploading it to new bucket – Instead use S3s less know copy_to option.

I wrote the initial code in Ruby and later moved to running it under Jruby to avoid the Ruby MRI GIL issue (Global interpreter lock – ensures only one thread is active at a given time). This let me sync the whole us-west-2/Oregon bucket to a US-Standard one in under 4 hours and to a EU/Ireland bucket in under 6 hours… Not bad huh?

The key to this are:

1. Running multiple threads via Jruby (so I can source the Java concurrent callable libs)

2. Use S3 “copy_to” functionality which allows to send an api call to AWS to copy a bucket object to another bucket (or the same bucket) without having to download the whole object locally and write it back to the new bucket.

Couple of caveats, In my case my objects size was capped at 400MB so I could avoid having to do multi-part uploads (which tends to mess up the Etags that AWS generates for each object). Doing a sync based entirely on Etags will lead to items uploaded as multi-parts having a Etag that is not a md5sum.

See aws-sdk S3 copy_to for details on the ‘copy_to’ method.

The code could do with a lot of cleanup but has the option to sync only “new” objects on source bucket to target or else look at all objects in source bucket and sync those that changed to the destination bucket as well. Its works and works fast – all I care about for now…

And the result is S3sync (https://github.com/srinivasmohan/s3sync) – Use it and let me know if it helps… Even better, fork it and make it better!


Powered by WordPress