Sunday, June 24, 2012

Data-center Rolling Upgrades coordinated by ZooKeeper

Still playing around trying to improve the daily deploy work in the data-centers.

The idea is to replace a sequential/semi-manual process with something more automatic that don't need human intervention unless some failure happens.

Services and Deploy rules:
  • Services has dependencies (Service B depends on Service A), Deploy order matter!
  • You can't bring down all the machines at the same time! 
  • One or more machine can be unreachable during the deploy (network problems, hw failures, ...).
  • Each machine need to be self-sufficient!
Must to Have (Monitoring)
  • Current service state of each machines (online/offline, service v1, v2)
  • Current "deploy" state (Ready to roll?)

The idea is quite simple, using ZooKeeper to keep track of each Service (A, B, ..., K) with the list of machines available (ephemeral znodes) and to keep track of te deploy state ("staging").
  • /dc/current: Contains a list of services with the list of online machines (and relative service version).
  • /dc/staging: Contains a list of services with the list of machines ready to roll.
  • /dc/deploy: Deploy order queue each node represent the service to upgrade.
When you're ready to deploy something new you can create the new znodes:
  • Add services to "staging" with the useful metadata (version, download path, ...)
  • Define a deploy "order" queue
Each service is notified about the new staging version and starts downloading (see "data-center deploy using torrent and mlock()" post). Once the download is completed, the service register it self to the "staging" queue.

Now the tricky part is when can I start switching to the new version? The idea is to specify a quorum foreach service. The First machine in the "Staging" queue for the first service in the "Deploy" queue, looks for the quorum, and when is time shutdown it self and restart the new service. Once is done  adds it self to the "Current" list and remove it self from the staging queue.

And one by one each machine start upgrading it self, until the deploy queue is empty. If a machine is down during the deploy, the "Current" node is checked to find which version is the most popular, and the service will be started.

Saturday, June 2, 2012

Data-center deploy using torrent and mlock()

Every morning you come in the office hoping that the nightly job that produce your blobs has finished... and if everything is fine, you spend the rest of the day hoping that none of the machines fails during transfer...
If you've a service that consume static data and you've more than one datacenter, probably everyday you face the problem of distributing data on all your service machines.

Remember: 60MiB/sec * 1hour = ~210GiB

So what are the possible solution to transfer this blobs?
The first solution is copying all the data to one machine in each datacenter, and then each machine with the data is responsible to copy everything to all the other machines inside the datacenter.
Note: prefer rsync over scp, since if you lost connection with scp you need to retransfer everything from byte zero.

But what happens if a machine is down?
One of the solution is making all the machines part of this distribution, removing identities. Every machine is equal, every machine need to fetch these blobs. So, instead of using rsync from the "build" machine to the dist-host and from dist-host to service machines the "build" machine send an information "I've new data" and each machine starts fetching this data in a collaborative way (using bittorrent).

Each machine (build-machine/dist-hosts/services) need to run a torrent client, you can implement your torrent client in few lines of python using libtorrent. The idea is to fetch from a feed hosted on a build machine the latest blobs.torrent and start downloading. The build machine will be the initial seeder, but then every machine will be part of the data distribution. By writing your own tracker you can also tune your peer selection, preferring machines inside your datacenter or inside your rack to avoid cross-site latency.

Another important thing to remember, if your service rely on the buffer-cache to keep data in memory, is to tell to the system, to avoid evict your pages otherwise you'll probably see your service starting to slow down once you start to copy data to that machine... So make sure to mlock() your heavily used pages or if your blobs can be kept in memory use vmtouch to do the trick (vmtouch -l -d -m 64G blob) remember to add memlock entry for your user in /etc/security/limits.d/ otherwise you'll see mlock() fail.

You can find the source code of a simple command line bit-torrent client and a tracker at