Today I spoke at VirtG Boston’s annual Deep Dive Day. The title of my talk, Meet Windows Azure, Your Next Data Center, is probably descriptive enough to get the gist of it.
My slide deck follows.
Today I spoke at VirtG Boston’s annual Deep Dive Day. The title of my talk, Meet Windows Azure, Your Next Data Center, is probably descriptive enough to get the gist of it.
My slide deck follows.
Examine your Windows Azure MANAGEMENT CERTIFICATES in the Windows Azure Portal (under “SETTINGS” in the left nav, then “MANAGEMENT CERTIFICATES” in the top nav). These are the certificates that control which people or which machines can programmatically manipulate your Windows Azure resources through the Service Management API.
Every time you initiate a Publish Profile file download (whether through the portal, with PowerShell, or through the CLI), a new certificate is generated and added to your list of management certificates. You cannot control these names – they are generated.
Upon examination, you may find that some certificates – like #1 shown below – have generated names. And also look at the several certificates immediately below #1 – they have similar names – also generated. These are hard to distinguish from each other.
But this is okay some of the time – it is convenient to let tools create these certificates for you since it saves time. It may be perfectly adequate on low security accounts – perhaps a developer’s individual dev-test account from MSDN, or an account only used to give demos with. But for a team account running production, you probably don’t want it to have 17 untraceable, indistinguishable certificates hanging off it.
Now look at the names for #2 and 3 shown above. They are custom names.
While we can debate whether the custom names shown above are truly meaningful (this is a demo account), you can probably appreciate that seeing a certificate name like “BUILD SERVER” or “Person/Machine” (e.g., “Maura/DRAGNIPUR”) or “Foobar Contractor Agency” might be more useful than “Azdem123EIEIO” to a human.
The Windows Azure Management Portal has some heuristics for deciding what to display for a certificate’s name, but the first one it considers is the Common Name, and will display its value if present. So the short answer: take control of the Common Name.
Here we show creating a Service Management certificate manually in two steps – first the PEM (for use locally) and second deriving a CER (for uploading to the portal).
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem -subj "/CN=This Name Shows in the Portal" openssl x509 -inform pem -in mycert.pem -outform der -out mycert.cer
Note the use of
-subj "/CN=This Name Shows in the Portal" when generating a PEM in the first command. The specified text will appear as the description for this certificate within the Windows Azure Portal. OpenSSL is available on Linux and Mac systems by default. For Windows, you can install it directly, or – if you happen to use GitHub for Windows – it gets installed along with it.
For a pure Windows solution, use makecert to create a Management Certificate for Windows Azure.
Once you assume responsibility for naming your own certificates, you are simultaneously also taking on generating them, deploying the certificates containing the private keys to the machines from which your Windows Azure resources will be managed using the Service Management API, and uploading the CER public keys to the portal. To make some parts of this easier – especially if you are distributing to a team – consider building your own publish settings file. Also, realize the same certificate can be used by more than one client, and the can also be applied to more than one subscription on Windows Azure; its a many-to-many relationship that’s allowed.
This is one of a series of webinars hosted by XBOSoft (@xbosoft) as part of their 2014 Webinar series from the XBOSoft Software Quality Knowledge Center. They are a global company (San Francisco, Amsterdam, Oslo & Beijing) focused on Software Quality Improvement. Check them out here: www.xbosoft.com.
The free Webinar was held on March 6, 2014.
At a high level, these are the 7 things discussed:
The PowerPoint deck I used to walk through the topics is here:
Today I gave a talk at Better Software Conference East 2013 about how the cloud impacts your development team. The talk was called “Making the Cloud Less Cloudy: A Perspective for Software Development Teams” and was heavy with short demos on making your dev team more productive, then a slightly longer look into how you can evolve your application to fully go cloud-native with some interesting patterns. All the demos showed off the Windows Azure Cloud Platform, though, as I explained, most of the techniques are general and can be used with other platforms such as Amazon Web Services (AWS).
Tweet stream: twitter.com/#bsceadc
The deck doesn’t mention this explicitly, but all of my demos (and my slide presentation) were done from the cloud! Yes, I was in the room, but my laptop was remotely connected to a Windows Azure Virtual Machine running in Microsoft’s East US Windows Azure data center. It worked flawlessly. :-)
Here’s the PowerPoint Deck:
When deploying an application or service to Windows Azure, a public IP address is assigned, making it easy to host a web server, API, or other services. Here are some of the more frequently asked questions asked about these IP addresses.
Short answer: Yes. Longer answer: For Cloud Services and Virtual Machines (but not Azure Web Sites) the IP address – once assigned – is stable, provided you do not remove the deployment. If you delete the deployment, your IP address goes back into the pool. For most production cloud applications it would very unusual to ever delete the deployment, so this is reasonable. Windows Azure supports in-place updates as well as the VIP Swap approach for Cloud Services, both of which always preserve the IP Address. Windows Azure Web Sites also has an IP Address-preserving swap feature.
Short answer: Yes. The formal name for a so-called “naked” domain is a zone apex. But regardless of what we call it, it is simply a domain without any subdomain prefix. The address “devpartners.com” is a “naked” or “apex” domain, whereas “www.devpartners.com” is not. And it is not just about counting periods in the domain: “amazon.co.jp” is also an apex domain. A DNS Address Record – or “A Record” for short – is used to configure an apex domain, and an A Record must be mapped to an IP Address. As noted in the question immediately above, you can have a stable IP address in Windows Azure, so therefore a stable A Record is possible, so therefore you can definitely map an apex record to your Windows Azure application or service. You can also use a DNS Canonical Name Record – or “CNAME” for short – to refer to a subdomain in your service. This is easy since, in addition to the stable IP address support mentioned above, Windows Azure provides a DNS name you can assign CNAMEs against. In Cloud Services (which includes Virtual Machines) this is of the form mycloudservice.cloudapp.net. [As opposed to Azure Web Sites which are of the form mywebsite.azurewebsites.net.]
Short answer: Yes. Longer answer: Microsoft publishes the IP Address Ranges used, organized by data center. So this published list of ranges can be consulted to review the possible IP address ranges. Specifically, the IP Address Ranges are documented here (http://msdn.microsoft.com/en-us/library/windowsazure/dn175718.aspx) and are expressed in Classless Inter-Domain Routing (CIDR) format. Be aware that as capacity increases and new data centers come on line, these ranges will evolve (I assume mostly the number of addresses will grow).
Today I spoke at the NYC Code Camp. My talk was Telemetry: Beyond Logging to Insight and focused on Event Tracing for Windows (ETW), ETW support in .NET 4.5, some .NET 4.5.1 additions, Semantic Logging Application Block (SLAB), Semantic Logging, and a number of other tools and ideas for using logging and other means to generate insight and answer questions. In order to allow this, “logging” needs to be structured, which ETW facilitates. In order for the structured data to make sense, developers need to be disciplined, which the Semantic Logging mindset supports.
The talk abstract and the slide deck used are both included below.
What is my application doing? This question can be difficult to answer in distributed environments such as the cloud. Parsing logs doesn’t cut it anymore. We need insight. In this talk we look at current logging approaches, contrast it with Telemetry, mix in the Semantic Logging mindset, and then use some new-fangled tools and techniques (enabled by .NET 4.5) alongside some old-school tools and techniques to see how to apply this goodness in our code. Event Tracing for Windows (ETW), the Semantic Logging Application Block, and several other tools and technologies will play a role.
[This post is part 21 of the 31 Days of Server (VMs) in the Cloud Series - I contributed the article below, but all others have been contributed by others - please find the index for the whole series by clicking here.]
As technology professionals we need to be careful about how we spend our time. Unless we want short careers, we find time to keep us with at least some new technologies, but there isn’t time in anyone’s day to keep up with every technology. We have to make choices.
For the IT Pro looking at cloud technologies, the IaaS capabilities are a far more obvious area on which to spend time than PaaS capabilities. In this post, we’ll take a peek into PaaS. The goal is to clarify the difference between IaaS and PaaS, understand what PaaS is uniquely good for, and offer some reasons why a busy IT Pro might want to invest some time learning about PaaS.
While the concepts in this point can apply generally to many platforms – including public and private clouds, Microsoft technologies and competing solutions – this post focuses on IaaS and PaaS capabilities within the Windows Azure Cloud Platform. Virtual machines and SQL databases are highlighted since these are likely of greatest interest to the IT Pro.
The NIST Definition of Cloud Computing (SP800-145) defines some terms that are widely used in the industry for classifying cloud computing approaches. One set of definitions delineates Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). You can read the NIST definitions for more details, but the gist is this:
|Service Model||What You Provide||Target Audience||Control & Flexibility||Expertise Needed||Example|
|SaaS||Users||Business Users||Low||App usage||Office 365|
|PaaS||Applications||Developers||Medium||App design and mgmt||Windows Azure Cloud Services|
|IaaS||Virtual Machines||IT Pros||High||App design and mgmt
+ VM/OS mgmt
|Windows Azure Virtual Machines (Windows Server, Linux)|
Generally speaking, as we move from SaaS through PaaS to IaaS, we gain more control and flexibility at the expense of more cost and expertise needed due to added complexity. There are always exceptions (perhaps a SaaS solution that requires complex integration with an on-premises solution), but this is good enough to set the stage. Now let’s look at the core differences between PaaS and IaaS as they relate to the IT Pro.
Even though Windows Azure has vastly more to offer (more on that later), the most obvious front-and-center offering is the humble VM. This is true both for PaaS and IaaS. So what distinguishes the two approaches?
The VMs for PaaS and IaaS behave very differently. The PaaS VM has a couple of behaviors that may surprise you, while the IaaS VM behavior is more familiar. Let’s start with the most far-reaching difference: On a PaaS VM, local storage is not durable.
This has significant implications. Suppose you install software (perhaps a database) on a PaaS VM and it stores some data locally. This will work fine… at least for a short while. At some point, Azure will migrate your application from one node to another… and it will not bring local data with it. Your locally-stored database data, not to mention any custom system tuning you did during installation, are gone. And this is by design. (For list of scenarios where PaaS VM drive data is destroyed, see the bottom of this document.)
How can this possibly be useful: a VM that doesn’t hold on to its local data…
You might wonder how this can possibly be useful: a VM that doesn’t hold on to its data. The fact of the matter is that it is not very useful for many applications written with conventional (pre-cloud) assumptions (such as guarantees around the durability of data). [PaaS may not be good at running certain applications, but is great at running others. So please keep reading!]
The PaaS VM drives use conventional server hard drives. These can fail, of course, and they are not RAID or high-end drives; this is commodity hardware optimized for high value for the money. And even if drives don’t outright fail, there are scenarios where the Azure operating environment does not guarantee durability of locally stored data (as referenced earlier).
On the other hand, IaaS VMs do have persistent/durable local drives. This is what makes them so much more convenient to use – and why they have a more familiar feel to IT Pros (and developers). But these drives are not the local server hard drives (other than the D: drive which is expected to be used only for temporary caching), they use a high capacity, highly scalable data storage service known as the Windows Azure Blob service (“blobs” for short, where each blob is roughly equivalent to a file, and each drive referenced by the VM is a VHD stored as one of these files). Data stored in blobs is safe from hardware failure: it is stored in triplicate by the blob service (each copy on a different physical node), and is even then geo-replicated in the background to a data center in another region, resulting (after a few minutes of latency) in an additional three copies.
IaaS VMs have persistent/durable local storage backed by blobs… this makes them so much more convenient to use – and more familiar to IT Pros
Storing redundant copies of your data offers a RAID-like feel, though is more cost-efficient at the scale of a data center.
Since blobs transparently handle storage for IaaS VMs (operating system drive, one or more data drives) and is external to any particular VM instance, in addition to being a familiar model, it is extremely robust and convenient.
Summarizing Some Key Differences
|PaaS VM||IaaS VM|
|Virtual Machine image||Choose from Win 2008 SP2, Win 2008 R2, and Win 2012. There are patch releases within each of these families.||There are many to choose from, including those you can create yourself. Can be Windows or Linux.|
|Hard Disk Persistence||Not durable. Could be lost due to hardware failure or when moved from one machine to another.||Durable. Backed by a blob (blobs are explained below).|
|Service Level Agreement (SLA)||99.95% for two or more instances (details). No SLA offered for single instance.||99.95% for two or more instances. 99.9% for single instance. (Preliminary details.)|
SLA details for the IaaS VM are preliminary since the service is still in preview as of this writing.
Windows Azure offers a PaaS database option, formerly called SQL Azure, and today known simply as SQL Database. This is really SQL Server behind the scenes, though it is not exactly the same as SQL Server 2012 (“Denali”).
SQL Database is offered as a service. This means with a few mouse clicks (or a few lines of PowerShell) you can have a database connection string that’s ready to go. Connecting to this database will actually connect you to a 3-node SQL Server cluster behind the scenes, but this is not visible to you; it appears to you to simply be a single-node instance. Three copies of your data are maintained by the cluster (each on different hardware).
Consider the three copies of every byte to be great for High Availability (HA), but offers no defense against Human Error (HE). If someone drops the CUSTOMER table, that drop will be immediately replicated to all three copies of your data. You still need a backup strategy.
One big benefit of the SQL Database service is that the server is completely managed by Windows Azure… with the flip side of that coin being that an IT Pro simply cannot make any adjustments to the configuration. Note that SQL tuning and database schema design skills have not gone anywhere; this is all just as demanding in the cloud as outside the cloud.
SQL Database has some limitations. The most obvious is that you cannot store more than 150 GB in a single instance. What happens when you have 151 GB? This brings to light another PaaS/IaaS divergence: the IaaS approach is to grow the database (“scale up” or “vertical scaling”) while the PaaS approach is to add additional databases (“scale out” or “horizontal scaling”). For the SQL Database service in Windows Azure, only the “horizontal scaling” approach is supported – it becomes up to the application to distribute its data across more than one physical database, an approach known commonly as sharding, where each shard represents one physical database server. This can be a big change for an application to support since the database schema needs to be compatible, which usually means it needs to have been originally designed with sharding in mind. Further, the application needs to be built to handle finding and connecting to the correct shard.
For PaaS applications that wish to support sharding, the Federations in SQL Database feature provides robust support for handling most of the routine tasks. Without the kind of support offered by Federations, building a sharding layer can be far more daunting. Federations simplifies connection string management, has smart caching, and offers management features that allow you to repartition your data across SQL Database nodes without experiencing downtime.
The alternative to SQL Database is for you to simply use an IaaS VM to host your own copy of SQL Server. You have full control (you can configure, tune, and manage your own database, unlike with the SQL Database service where these functions are all handled and controlled by Windows Azure). You can grow it beyond 150 GB. It is all yours.
But realize that in the cloud, there are still limitations. All public cloud vendors offer a fixed menu of virtual machine sizes, so you will need to ensure that your self-managed IaaS SQL Server will have enough resources (e.g., RAM) for your largest database.
Any database can outgrow its hardware, whether on the cloud or not.
It is worth pointing out that any database can outgrow its hardware. And the higher end the hardware, the more expensive it becomes from a “capabilities for the money” point of view. And at some point you can reach the point where (a) you can’t afford sufficiently large hardware, or (b) the needed hardware is so high end that it is not commercially available. This will drive you towards a either a sharding architecture, or some other approach to make your very large database smaller so that it will fit in available hardware.
Another significant difference between the SQL Database service and a self-hosted IaaS SQL Server is that the SQL Database service is multitenant: your data sits alongside the data of other customers. This is secure – one customer cannot access another customer’s data – but it does present challenges when one customer’s queries are very heavy, and another customer [potentially] experiences variability in performance as a result. For this reason, the SQL Database service protects itself and other customers by not letting any one customer dominate resources – this is accomplished with “throttling” and can manifest in multiple ways, from a delay in execution to dropping a connection (which the calling application is responsible for reestablishing).
Don’t underestimate the importance of properly handling of throttling. Applications need to be written to handle these scenarios in order to function correctly. Throttling can happen even if your application is doing nothing wrong.
Handling the throttling should not be underestimated. Proper throttling handling requires that application code handle certain types of transient failures and retry. Most existing application code does not do this. Blindly pointing an existing application at a SQL Database instance might seem to work, but will also potentially experience odd errors occasionally that may be hard to track down or diagnose if the application was written (and tested) in an environment where interactions with SQL Server always succeeded.
The self-managed IaaS database does not suffer this unpredictability since you presumably control which application can connect and can manage resources more directly.
The SQL Database service has some easy to enable features that may make your like easier. One example is the database sync service that can be enabled in the Windows Azure Portal. You can easily configure a SQL Database instance to be replicated with one or more other instances in the same or different databases. This can help with an offsite-backup strategy, mirroring globally to reduce latency, and is one area where PaaS shines.
Windows Azure today offers the SQL Database service based on SQL Server 2012. If your application (for some reason) needs an older version SQL Server (perhaps it is a vendor product and you don’t control this), then your hands are tied.
Or perhaps you want another database besides SQL Server. Windows Azure has a partner offering MySQL, and other vendor products will likely be offered over time. NoSQL Databases are also becoming more popular. Windows Azure natively offers the NoSQL Windows Azure Table service, and a few examples of other third-party ones include MongoDB, Couchbase, RavenDB, and Riak. Unless (or until) these are offered as PaaS services through the Windows Azure Store, your only option will be to run them yourself in an IaaS VM.
The main thrust of PaaS is to make operations efficient for applications designed to align with the PaaS approach. For example, applications that can deal with throttling, or can deal with a PaaS VM being migrated and losing all locally stored data. This is all doable – and without degrading user experience – it just so happens that most applications that exist today (and will still exist tomorrow) don’t work this way.
The PaaS approach can be used to horizontally scale an application very efficiently (whether computational resources running on VMs or database resources sharded with Federations for SQL Database), overcome disruptions due to commodity hardware failures, gracefully handle throttling (whether from SQL Database or other Azure services not discussed), and do so with minimal human interaction. But getting to this point is not automatic.
WazOps – DevOps, Windows Azure style! – is the role that will build out this reality. There are auto-scaling tools, both external services, and some that we can run ourselves — like the awesome WASABi auto-scaling application block from Microsoft’s Patterns & Practices group – that can be configured to scale an application on a schedule or based on environmental signals (like the CPU is spiking in a certain VM).
There is also the mundane. How to script a managed deployment so our application can be upgraded without downtime? Windows Azure PaaS services have features for this, such as the in-place update and the VIP Swap. But we still need to understand them and create a strategy to use them appropriately.
Further, there are at least some of the same-old-details. For example, it is easy to deploy an SSL certificate to my PaaS VM that is being deployed to IIS… but it still will expire in a year and someone still needs to know this – and know what to do about it before it results in someone being called at 2:00 AM on a Sunday.
Clearly there are some drawbacks to running PaaS since most existing applications will not run successfully without some non-trivial rework, but will work just fine if deployed to IaaS VMs.
However, that does not mean that PaaS is not useful. It turns out that some of the most reliable, scalable, cost-efficient applications in the world are architected for this sort of PaaS environment. The Bing services behind bing.com take this approach, as only one example. The key here is that these applications are architected assuming a PaaS environment. I don’t use the term “architected” lightly, since architecture dictates the most fundamental assumptions about how an application is put together. Most applications that exist today are not architected with PaaS-compatible assumption. However, as we move forward, and developer skills catch up with the cloud offerings, we will see more and more applications designed from the outset to be cloud-native; these will be deployed using these PaaS facilities.
A stateless web-tier (with no session affinity in the load balancer) is a good example today of an application that could run successfully in a PaaS environment – though I’ll be quick to note that other tiers of that application may not run so well in PaaS. Which bring up an obvious path going forward: hybrid applications that mix PaaS and IaaS. This will be a popular mix in the coming years.
Consider a 3-tier application with a web tier running in IIS, a service tier, and a SQL Server back-end database. If built with conventional approaches, not considering the PaaS cloud, none of these three tiers would be ready for a PaaS environment. So we could deploy all three tiers using IaaS VMs.
As a software maintenance step, it would be reasonable to upgrade the web site (perhaps written in PHP or ASP.NET) to be stateless and not need session affinity (Windows Azure PaaS Cloud Services do not support session affinity from the load balancer). These types of changes may be enough to allow the web tier to run more efficiently using PaaS VMs, while still interacting with a service tier and database running on IaaS VMs.
A future step could upgrade the service tier to handle SQL Database throttling correctly, allowing the SQL Server instance running on an IaaS VM to be migrated to the SQL Database service. This will reduce the number of Windows servers and SQL Servers being managed by the organization (shifting these to Windows Azure), and may also simplify some other tasks (like replicating that data using the Data Sync Service). Each services and VM also has its own direct costs (our monthly bill to Microsoft for the Windows Azure Cloud services we consume), which are detailed in the pricing section of the Windows Azure Portal.
Still another future step could be to migrate the middle tier to be stateless – but maybe not. All of these decisions are business decisions; perhaps the cost-benefit is not there. It depends on you application and your business and the skills and preferences of the IT Pros and developers in the organization.
I’ll summarize here with some of the key take-aways for the IT Pro who is new to PaaS services:
Feedback always welcome and appreciated. Good luck in your cloud journey!
The backstory is that Windows Azure uses certificates in a few different ways, and understanding the different types of certificate uses is key to understanding why these different ways of using and deploying certificates are the way they are.
The slide deck is here:
Disaster Recovery, or DR, refers to your approach for recovering from an event that results in failure of your software system. Some examples of such events: hurricanes, earthquakes, and fires. The common thread with these events is that they were not your fault and they happened suddenly, usually at the most inconvenient of times.
Damage from one of these events might be temporary: a prolonged power outage that is eventually restored. Damage might be permanent: servers immersed in water are unlikely to work after drying out.
Whether a one-person shop with all the customer data on a single laptop, or a large multi-national with its own data centers, any business that uses computers to manage data important to that business needs to consider DR.
The remainder of this article focuses on some useful DR approaches for avoiding loss of business data when engineering applications for the cloud. The detailed examples are specific to the Windows Azure Cloud Platform, but the concepts apply more broadly, such as with Amazon Web Services and other cloud platforms. Notable this post does not discuss DR approaches as they apply to other parts of infrastructure, such as web server nodes or DNS routing.
Your first line of defense is to minimize exposure. Consider a cloud application with business logic running on many compute nodes.
Terminology note: I will use the definition of node from page 2 of my Cloud Architecture Patterns book (and occasionally in other places in this post I will reference patterns and primers from the book where they add more information):
An application runs on multiple nodes, which have hardware resources. Application logic runs on compute nodes and data is stored on data nodes. There are other types of nodes, but these are the primary ones. A node might be part of a physical server (usually a virtual machine), a physical server, or even a cluster of servers, but the generic term node is useful when the underlying resource doesn’t matter. Usually it doesn’t matter.
In cloud-native Windows Azure applications, these compute nodes are Web Roles and Worker Roles. The thing to realize is that local storage on Web Roles and Worker Roles is not a safe place to keep important data long term. Well before getting to an event significant enough to be characterized as needing DR, small events such as a hard-disk failure can result in the loss of such data.
While not a DR issue per se due to the small scope, these applications should nevertheless apply the Node Failure Pattern (Chapter 10) to deal with this.
But the real solution is to not use local storage on compute nodes to store important business data. This is part of an overall strategy of using stateless nodes to enable your application to scale horizontally, which comes with many important benefits beyond just resilience to failure. Further details are described in the Horizontally Scaling Compute Pattern (Chapter 2).
In the United States, there are television commercials featuring “The Most Interesting Man in the World” who lives an amazing, fantastical life, and doesn’t always drink beer, but when he does he drinks DOS EQUIS.
In the cloud, our compute nodes do not always need to persist data long-term, but when they do, they use cloud platform services.
And the “DOS” in “DOS EQUIS” stands for neither Disk Operating System nor Denial of Service here, but rather is the number two in Spanish. But cloud platform services for data storage do better than dos, they have tres – as in three copies.
Windows Azure Storage and Windows Azure SQL Database both write three copies of each byte onto three independent servers on three independent disks. The hardware is commodity hardware – chosen for high value, not strictly for high availability – so it is expected to fail, and the failures are overcome by keeping multiple copies of every byte. If the one of the three instances fails, a new third instance is created by making copies from the other two. The goal state is to continually have three copies of every byte.
Windows Azure Storage is always accessed through a REST interface, either directly, or via specific SDK which uses the REST interface under the hood. For any REST API call that modifies data, the API does not return until all three copies of the bytes are successfully stored.
Windows Azure SQL Database is always accessed through TDS, which is the same TCP protocol as SQL Server. While your application is provided a single connection string, and you create a single TDS connection, behind the scenes there is a three-node cluster. For any operation that modifies data, the operation does not return until at least two copies of the update have been successfully applied on two of the nodes in this cluster; the third node is updated asynchronously.
So if you have a Web Role or Worker Role in Windows Azure, and that node has to save data, it should use one of the persistent storage mechanisms just mentioned.
What about Windows Azure Virtual Machines?
Windows Azure also has a Virtual Machine node that you can deploy (Windows or Linux flavored), and the hard disks attached to those nodes are persistent, but how can that be? It turns out they are backed by Windows Azure Blob storage, so that doesn’t break the model: they also have some storage that is truly local and can use it for caching sorts of functions, but any long-term data is persisted to blob storage, even though it is indistinguishable from a local disk drive from the point of view of any code running on the virtual machine.
In addition to this, Windows Azure Storage asynchronously geo-replicates blobs and tables to a sister data center. There are eight Azure data centers, and they are paired as follows: East US-West US, North Central US-South Central US, North Europe-West Europe, and East Asia-Southeast Asia. Note that the pairs are chosen to be in the same geo-political region to simplify regulatory compliance in many cases. So if you save data to a blob in East US, three copies will be synchronously written in East US, then three more copies will be asynchronously written to West US.
It is easy to overlook the immense value of having data stored in triplicate and transparently geo-replicated. While the feature comes across rather matter-of-factly, you get incredibly rich DR features without lifting a finger. Don’t let the ease of use mask the great value of this powerful feature.
All of the local and geo-replication mentioned so far happens for free: it is included as part of the listed at-rest storage costs, and no action needed on your part to enable this capability (though you can turn it off).
All the replication listed above will help DR. If a hardware failure takes out one of your three local copies, the system self-heals – you will never even know most types of failures happen. If a natural disaster takes out a whole data center, Microsoft decides when to reroute DNS traffic for Windows Azure Storage away from the disabled data center and over to its sister data center which has the geo-replicated copies.
Note that the geo-replication is only out-of-the-box today for Windows Azure Storage (and not for queues – just for blobs and tables) and not for SQL Database. However, this can be enabled using the sync service available today – you decide how many copies and to which data centers and at what frequency.
Note that there are additional costs associated with using the sync service for SQL Database, for the sync service itself and for data center egress bandwidth.
Regardless of the mechanism, there is always a time-lag in asynchronous geo-replication, so if a primary data center was lost suddenly, the last few minutes worth of updates may not have been fully replicated. Of course, you could choose to write synchronously to two data centers for super-extra safety, but please consult the Network Latency Primer (Chapter 11) before doing so.
This is all part of the overall Multisite Deployment Pattern (Chapter 15), though servicing a geo-distributed user base is another feature of this architecture pattern, beyond the DR features.
The title of this blog post is “Engineering for Disaster Recovery in the Cloud” but where did all the engineering happen?
Much of what you need for DR is handled for you by cloud platform services, but not all of it. From time-to-time we alluded to some design patterns that your applications need to adhere to in order for these platform services to make sense. As one example, if your application is written to assume it is safe to use local storage on your web server as a good long-term home for business data, well… the awesomeness built into cloud platform services isn’t going to help you.
There is an important assumption here if you want to leverage the full set of services available in the cloud: you need to build cloud-native applications. These are cloud application that are architected to align with the architecture of the cloud.
I wrote an entire book explaining what it means to architect a cloud-native application and detailing specific cloud architecture patterns to enable that, so I won’t attempt to cover it in a blog post, except to point out that many of the architectural approaches of traditional software will not be optimal for applications deployed to the cloud.
Finally, we need to distinguish DR from HE – Disaster Recover from Human Error.
Consider how the DR features built into the cloud will not help with many classes of HE. If you modify or delete data, your changes will dutifully be replicated throughout the system. There is no magic “undo” in the cloud. This is why you usually will still want to take control of making back-ups of certain data.
So backups are still desirable. There are cloud platform services to help you with backups, and some great third-party tools as well. Details on which to choose warrant an entire blog post of their own, but hopefully this post at least clarifies the different needs driven by DR vs. HE.
Maybe. It depends on your business needs. If your application is one of those rare applications that needs to be responsive 24×7 without exception, not even for a natural disaster, then no, this is not enough. If your application is a line-of-business application (even an important one), often it can withstand a rare outage under unusual circumstances, so this approach might be fine. Most applications are somewhere in between and you will need to exercise judgement in weighing the business value against the engineering investment and operational cost of a more resilient solution.
And while this post talked about how the combination of following some specific cloud architecture patterns to design cloud-native applications provides a great deal of out-of-the-box resilience in DR situations, it did not cover ongoing continuity, such as with computation, or immediate access to data from multiple data centers. If you rely entirely on the cloud platform to preserve your data, you may not have access to it for a while since (as mentioned earlier, and emphasized nicely in Neil’s comment) you don’t control all the failover mechanisms; you will need to wait until Microsoft decides to failover the DNS for Windows Azure Storage, for example. And remember that background geo-replication does not guarantee zero data loss: some changes may be lost due to the additional latency needed in moving data across data centers, and not all data is geo-replicated (such as queued messages and some other data not discussed).
The ITIL term for “how much data can I stand to lose” is known as the recovery point objective (RPO). The ITIL term for “how long can I be down” is known as the recovery time objective (RTO). The RPO and RTO are useful concepts for modeling DR.
So the DR capabilities built into cloud platform services are powerful, but somewhat short of all-encompassing. However, they do offer a toolbox providing you with unprecedented flexibility in making this happen.
The underlying need to understand RPO and RTO and use them to model for DR is not specific to the cloud. These are very real issues in on-premises systems as well. The approaches to addressing them may vary, however.
Generally speaking, while the cloud does not excuse you from thinking about these important characteristics, it does provide some handy capabilities that make it easier to overcome some of the more challenging data-loss threats. Hopefully this allows you to sleep better at night.
Bill Wilder is the author of the book Cloud Architecture Patterns – Develop Cloud-Native Applications from O’Reilly. This post complements the content in the book. Feel free to connect with Bill on twitter (@codingoutloud) or leave a comment on this post. (He’s also warming up to Google Plus.)
The S3 service runs “out there” (in the cloud) and provides a scalable repository for applications to store and manage data files. The service can support files of any size, as well as any quantity. So you can put as much stuff up there as you want – and since it is a pay-as-you-go service, you pay for what you use. The S3 service is very popular. An example of a well-known customer, according to Wikipedia, is SmugMug:
Photo hosting service SmugMug has used S3 since April 2006. They experienced a number of initial outages and slowdowns, but after one year they described it as being “considerably more reliable than our own internal storage” and claimed to have saved almost $1 million in storage costs.
Of course, Amazon isn’t the only cloud vendor with such an offering. Google offers Google Storage, and Microsoft offers Windows Azure Blob Storage; both offer features and capabilities very similar to those of S3. While Amazon was the first to market, all three services are now mature, and all three companies are experts at building internet-scale systems and high-volume data storage platforms.
As I mentioned above, S3 came up during a talk I attended. The speaker – CTO of a company built entirely on Amazon services – twice touted S3’s incredibly strong Service Level Agreement (SLA). He said this was both a competitive differentiator for his company, and also a competitive differentiator for Amazon versus other cloud vendors.
Pause and think for a moment – any idea? – What is the SLA for S3? How about Google Storage? How about Windows Azure Blob Storage?
Before I give away the answer, let me remind you that a Service Level Agreement (SLA) is a written policy offered by the service provider (Amazon, Google, and Microsoft in this case) that describes the level of service being offered, how it is measured, and consequences if it is not met. Usually, the “level of service” part relates to uptime and is measured in “nines” as in 99.9% (“three nines”) and so forth. More nines is better, in general – and wikipedia offers a handy chart translating the number of nines into aggregate downtime/unavailability. (More generally, an SLA also deals with other factors – like refunds to customers if expectations are not met, what speed to expect, limitations, and more. I will focus only on the “nines” here.)
So… back to the question… For S3 and equivalent services from other vendors, how many nines are in the Amazon, Google, and Microsoft SLAs? The speaker at the talk said that S3 had an uptime SLA with 11 9s. Let me say that again – eleven nines – or 99.999999999% uptime. If you attempt to look this up in the chart mentioned above, you will find this number is literally “off the chart” – the chart doesn’t go past six nines! But my back-of-the-envelope calculation says it amounts to – on average – less than 32 milliseconds of downtime per year. This is about half what “a blink of your eye” would take – yes, a mere half of an eye-blink. (Which ends with your eyes closed. :-))
This is an impressive number! If only it was true. It turns out the real SLA for Amazon S3 has exactly as many nines as the SLA for Windows Azure Blob Storage and the SLA for Google Storage: they are all 99.9%.
Storage SLAs for Amazon, Google, and Microsoft all have exactly the same number of nines: they are all 99.9%. That’s three nines.
I am not picking on the CTO I heard gushing about the (non-existant) eleven-nines SLA. (In fact, his or her identity is irrelevent to the overall discussion here.) The more interesting part to me is the impressive reality distortion field around Amazon and its platform’s capabilities. The CTO I heard speak got it wrong, but this is not the first time it was misinterpreted as an SLA, and it won’t be the last.
I tracked down the origin of the eleven nines. Amazon CTO Werners Vogels mentions in a blog post that the S3 service is “design[ed]” for “99.999999999% durability” – choosing his words carefully. Consistent with Vogels’ language is the following Amazon FAQ on the same topic:
Q: How durable is Amazon S3? Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities.
First of all, these mentions are a comment on a blog and an item in an FAQ page; neither is from a company SLA. And second, they both speak to durability of objects – not uptime or availability. And third, also critically, they say “designed” for all those nines – but guarantee nothing of the sort. Even still, it is a bold statement. And good marketing.
It is nice that Amazon can have so much confidence in their S3 design. I did not find a comparable statement about confidence in the design of their compute infrastructure… Reality is that [cloud] services are about more than design and architecture – also about implementation, operations, management, and more. To have any hope, architecture and design need to be solid, of course, but alone they cannot prevent a general service outage which could take your site down with it (and even still lose data occasionally). Some others on the interwebs are skeptical as I am, not just of Amazon, but anyone claiming too many nines.
How about the actual 99.9% “three-nines” SLA? Be careful in your expectations. As a wise man once told me, there’s a reason they are called Service Level Agreements, rather than Service Level Guarantees. There are no guarantees here.
This isn’t to pick on Amazon – other vendors have had – and will have – interruptions in service. For most companies, the cloud will still be the most cost-effective and reliable way to host your applications; few companies can compete with the big platform cloud vendors for expertise, focus, reliability, security, economies-of-scale, and efficiency. It is only a matter of time before you are there. Today, your competitors (known and unknown) are moving there already. As a wise man once told me (citing Crossing the Chasm), the innovators and early adoptors are those companies willing to trade off risk for competitive advantage. You saw it here first: this Internet thing is going to stick around for a while. Yes, and cloud services will just make too much sense to ignore. You will be on the cloud; it is only a matter of where you’ll be on the curve.
Back to all those nines… Of course, Amazon has done nothing wrong here. I see nothing inaccurate or deceptive in their documentation. But those of us in the community need to pay closer attention to what is really being described. So here’s a small favor I ask of this technology community I am part of: Let’s please do our homework so that when we discuss and compare the cloud platforms – on blogs, when giving talks, or chatting 1:1 – we can at least keep the discussions based on facts.