London DevOps Meetup – 11th June 2015
I was at the London DevOps Meetup last night, graciously hosted by Facebook at their London HQ. We had three chats – one about how Facebook ensures it’s running, one about Facebook’s data centres, and one from Google about how they schedule containers onto their server farm.
First up was a chat about how Mark Drayton keeps Facebook running. He’s a member of the 15-strong Web Foundation team which is responsible for making sure that Facebook is up and running, incident reporting and producing best practices. They’ve been using Cubism.js to show as much data in as compact a space as possible. They only have 3 alarms – one for HTTP 500 errors, one for latency, and the final one is for when the amount of data egress plummets (indicating people aren’t getting any data).
Secondly was a chat from Niall McEntegart about the OpenCompute Project, and how it allows Facebook to run at the scale of 1 tech per 25000 servers. Apparently they even ran at 1 tech per 50000 servers at some point last year. And they have a first-time-fix rate of 100%. As recently as 2011, they were still using colocation but now they have 4 data centres, with a few more due to be announced. Using OCP has also improved their efficiency, allowing a PUE of 1.07 against the industry standard 1.9 (1.0 being the ideal).
Finally, Mandy Waite talked about scheduling containers inside Google, and a brief overview of Kubernetes. Google itself uses Borg which is more feature-complete, but Mandy said that they’re expecting Kubernetes to catch up quickly. I think the key takeaway is that Kubernetes and Containers abstract away the servers from developers – no more ssh/sftp’ing to servers to copy and run code. Instead you upload the binaries and a manifest and Borg (or Kubernetes) schedules up your 10000 copies on the available nodes within minutes.
An interesting viewpoint – Mark said that Facebook perform Root Cause Analysis on all failures, but Mandy implied that the scheduler starts a new one if one of the instances dies. Maybe Google do an RCA if the server host dies?