Essay


I recently attended an event where one of the speakers was the CEO of a company built on top of Amazon cloud services, the most critical of these being the Simple Storage Service known as Amazon S3.

The S3 service runs “out there” (in the cloud) and provides a scalable repository for applications to store and manage data files. The service can support files of any size, as well as any quantity. So you can put as much stuff up there as you want – and since it is a pay-as-you-go service, you pay for what you use. The S3 service is very popular. An example of a well-known customer, according to Wikipedia, is SmugMug:

Photo hosting service SmugMug has used S3 since April 2006. They experienced a number of initial outages and slowdowns, but after one year they described it as being “considerably more reliable than our own internal storage” and claimed to have saved almost $1 million in storage costs.

Good stuff.

Of course, Amazon isn’t the only cloud vendor with such an offering. Google offers Google Storage, and Microsoft offers Windows Azure Blob Storage; both offer features and capabilities very similar to those of S3. While Amazon was the first to market, all three services are now mature, and all three companies are experts at building internet-scale systems and high-volume data storage platforms.

As I mentioned above, S3 came up during a talk I attended. The speaker – CTO of a company built entirely on Amazon services – twice touted S3’s incredibly strong Service Level Agreement (SLA). He said this was both a competitive differentiator for his company, and also a competitive differentiator for Amazon versus other cloud vendors.

Pause and think for a moment – any idea? – What is the SLA for S3? How about Google Storage? How about Windows Azure Blob Storage?

Before I give away the answer, let me remind you that a Service Level Agreement (SLA) is a written policy offered by the service provider (Amazon, Google, and Microsoft in this case) that describes the level of service being offered, how it is measured, and consequences if it is not met. Usually, the “level of service” part relates to uptime and is measured in “nines” as in 99.9% (“three nines”) and so forth. More nines is better, in general – and wikipedia offers a handy chart translating the number of nines into aggregate downtime/unavailability. (More generally, an SLA also deals with other factors – like refunds to customers if expectations are not met, what speed to expect, limitations, and more. I will focus only on the “nines” here.)

So… back to the question… For S3 and equivalent services from other vendors, how many nines are in the Amazon, Google, and Microsoft SLAs? The speaker at the talk said that S3 had an uptime SLA with 11 9s. Let me say that again – eleven nines – or 99.999999999% uptime. half-an-eye-blinkIf you attempt to look this up in the chart mentioned above, you will find this number is literally “off the chart” – the chart doesn’t go past six nines! But my back-of-the-envelope calculation says it amounts to – on average – less than 32 milliseconds of downtime per year. This is about half what “a blink of your eye” would take – yes, a mere half of an eye-blink. (Which ends with your eyes closed. :-) )

This is an impressive number! If only it was true. It turns out the real SLA for Amazon S3 has exactly as many nines as the SLA for Windows Azure Blob Storage and the SLA for Google Storage: they are all 99.9%.

Storage SLAs for Amazon, Google, and Microsoft all have exactly the same number of nines: they are all 99.9%. That’s three nines.

I am not picking on the CTO I heard gushing about the (non-existant) eleven-nines SLA. (In fact, his or her identity is irrelevent to the overall discussion here.) The more interesting part to me is the impressive reality distortion field around Amazon and its platform’s capabilities. The CTO I heard speak got it wrong, but this is not the first time it was misinterpreted as an SLA, and it won’t be the last.

I tracked down the origin of the eleven nines. Amazon CTO Werners Vogels mentions in a blog post that the S3 service is “design[ed]” for “99.999999999% durability” – choosing his words carefully. Consistent with Vogels’ language is the following Amazon FAQ on the same topic:

Q: How durable is Amazon S3? Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities.

First of all, these mentions are a comment on a blog and an item in an FAQ page; neither is from a company SLA. And second, they both speak to durability of objects – not uptime or availability. And third, also critically, they say “designed” for all those nines – but guarantee nothing of the sort. Even still, it is a bold statement. And good marketing.

It is nice that Amazon can have so much confidence in their S3 design. I did not find a comparable statement about confidence in the design of their compute infrastructure… Reality is that [cloud] services are about more than design and architecture – also about implementation, operations, management, and more. To have any hope, architecture and design need to be solid, of course, but alone they cannot prevent a general service outage which could take your site down with it (and maybe even still lose data). Some others on the interwebs are skeptical as I am, not just of Amazon, but anyone claiming too many nines.

How about the actual 99.9% “three-nines” SLA? Be careful in your expectations. As a wise man once told me, there’s a reason they are called Service Level Agreements, rather than Service Level Guarantees. There are no guarantees here.

This isn’t to pick on Amazon – other vendors have had – and will have – interruptions in service. For most companies, the cloud will still be the most cost-effective and reliable way to host your applications; few companies can compete with the big platform cloud vendors for expertise, focus, reliability, security, economies-of-scale, and efficiency. It is only a matter of time before you are there. Today, your competitors (known and unknown) are moving there already. As a wise man once told me (citing Crossing the Chasm), the innovators and early adoptors are those companies willing to trade off risk for competitive advantage. You saw it here first: this Internet thing is going to stick around for a while. Yes, and cloud services will just make too much sense to ignore. You will be on the cloud; it is only a matter of where you’ll be on the curve.

Back to all those nines… Of course, Amazon has done nothing wrong here. I see nothing inaccurate or deceptive in their documentation. But those of us in the community need to pay closer attention to what is really being described.  So here’s a small favor I ask of this technology community I am part of: Let’s please do our homework so that when we discuss and compare the cloud platforms – on blogs, when giving talks, or chatting 1:1 - we can at least keep the discussions based on facts.

I saw on twitter this morning a long time ago [it was a long time ago when I wrote this post but didn't publish it] (from Jeff Atwood, of Coding Horror blog and Stack Overflow fame) the following elegantly and concisely stated counter-example that would – if true – disprove perhaps the most famous of mathematical theorems, Fermat’s Last Theorem (FLT):

1782^12 + 1841^12 = 1922^12

Wow! A counter-example for FLT. A theorem I’ve known about since I was a kid. One counter-example is all it takes to disprove the whole deal.

Fermat’s theorem states that the equation an + bn = cn has no solutions for integer n > 2, and integers a, b, and c not equal to zero. For n = 2 we have many solutions (Pythagorean triples), but none for n > 2. Nor should we, according to English mathematician Andrew Wiles, who proved FLT in 1995.

Until now. Or do we? The equation Jeff posted is a little awkward to validate since most calculators cannot handle numbers this size at full precision. They appear equal with a normal calculator – due to precision limits (round-off errors). Same problem with Excel.

So, since I’ve recently started playing with F#, I put together a trivial F# program (included below) to show the math at full precision, with the following results:

1782^12 + 1841^12 = 2541210258614589176288669958142428526657
and
1922^12 = 2541210259314801410819278649643651567616
which differ by
700212234530608691501223040959

So Fermat is safe. Saved by F#. :-) But don’t feel bad if you fell for it – just be glad you knew what it meant. Bonus if  you noticed it on The Simpsons or Futurama.

Just watched the The Fountainhead movie from 1946 (yes, from netflix).

Howard Rourke hacking Open Source code

Howard Rourke hacking Open Source

Here is the plot summary, brought up-to-date:

  • Open Source is represented by the protagonist, a brilliant architect named Howard Rourke. Rourke is idealistic, does his own thing, is uncompromising, and is not driven by money or recognition – and certainly not by Big Business.
  • Big Business is  represented by newspaper magnate Gail Wynand. Wynand wields substantial influence and is in perpetual pursuit of any means to incite the populace – an energized populace buys more product.
  • Consultants and Certified Vendor X Developers and Vendor Partners are represented by  architect Peter Keating. Keating goes with the flow, producing whatever the powers that be say is desirable. At one point, he mentions to Ms. Francon he’s polling folks on what they think of Rourke’s latest building to which she responds (with some disdain) “why, so you’ll know what you think of it?”

Lessons:

  • Talent != influence. Keating’s influence is limited to those who recognize his greatness. Most only recognize as great what they are told to recognize as great.
  • Passion can be directed constructively (Rourke pours his love into his life’s work) or destructively (Wynand devotes his career to controlling the masses through his newspaper).

The movie is based a book of the same title. The author, Ayn Rand, became well known for her Objectivism philosophy of life, exemplified in the movie by Gary Cooper who played the lead character, Howard Roark. [I wonder what Richard Stallman thinks of the book?]

I wonder how many professional software developers identify more with Howard Rourke or Peter Keating? And which is more desirable?

Any my clean analogies fall apart when one considers the combinations of Big  Business and Open Source. Microsoft just announced CodePlex.org and the CodePlex Foundation “to enable the exchange of code and understanding among software companies and open source communities.”

A dirty little secret of Eclipse, Linux, Apache and other high-profile projects is that they also have professional, full-time staff – sponsored by Big Business (like IBM) – since the success of these endeavors is strategic for their business.

Maybe Open Source isn’t as pure as the romantic notion of developers from around the world contributing since it was a nice thing to do. The world-wide altruistic contributions may still be there in some cases, just supplemented by Big Business. Which is okay with me, though might not be with Howard Rourke.

(more…)

Follow

Get every new post delivered to your Inbox.

Join 431 other followers