Category Archives: Patterns

Talk: How is Architecting for the Cloud Different?

On Thursday 07-February-2013 I spoke at DevBoston about “How is Architecting for the Cloud Different?”

Here is the abstract:

If my application runs on cloud infrastructure, am I done? Not if you wish to truly take advantage of the cloud. The architecture of a cloud-native application is different than the architecture of a traditional application and this talk will explain why. How to scale? How do I overcome failure? How do I build a system that I can manage? And how can I do all this without a huge monthly bill from my cloud vendor? We will examine key architectural patterns that truly unlock cloud benefits. By the end of the talk you should appreciate how cloud architecture differs from what most of use have become accustomed to with traditional applications. You should also understand how to approach building self-healing distributed applications that automatically overcome hardware failures without downtime (really!), scale like crazy, and allow for flexible cost-optimization.

Here are the slides:

How is Architecting for the Cloud Different — DevBoston — 06-Feb-2013 — Bill Wilder (blog.codingoutloud.com)

Here is the book we gave away copies of (and from which some of the material was drawn):

book-cover-medium.jpg

Ready to learn more about Windows Azure? Come join us at the Boston Azure Cloud User Group!

Boston Azure cloud user group logo

About these ads

Engineering for Disaster Recovery in the Cloud (Avoiding Data Loss)

Disaster Recovery, or DR, refers to your approach for recovering from an event that results in failure of your software system. Some examples of such events: hurricanes, earthquakes, and fires. The common thread with these events is that they were not your fault and they happened suddenly, usually at the most inconvenient of times.

image of storm clouds

Clouds are not always inviting! Be prepared for storm clouds.

Damage from one of these events might be temporary: a prolonged power outage that is eventually restored. Damage might be permanent: servers immersed in water are unlikely to work after drying out.

Whether a one-person shop with all the customer data on a single laptop, or a large multi-national with its own data centers, any business that uses computers to manage data important to that business needs to consider DR.

The remainder of this article focuses on some useful DR approaches for avoiding loss of business data when engineering applications for the cloud. The detailed examples are specific to the Windows Azure Cloud Platform, but the concepts apply more broadly, such as with Amazon Web Services and other cloud platforms. Notable this post does not discuss DR approaches as they apply to other parts of infrastructure, such as web server nodes or DNS routing.

Minimize Exposure

Your first line of defense is to minimize exposure. Consider a cloud application with business logic running on many compute nodes.

Terminology note: I will use the definition of node from page 2 of my Cloud Architecture Patterns book (and occasionally in other places in this post I will reference patterns and primers from the book where they add more information):

An application runs on multiple nodes, which have hardware resources. Application logic runs on compute nodes and data is stored on data nodes. There are other types of nodes, but these are the primary ones. A node might be part of a physical server (usually a virtual machine), a physical server, or even a cluster of servers, but the generic term node is useful when the underlying resource doesn’t matter. Usually it doesn’t matter.

In cloud-native Windows Azure applications, these compute nodes are Web Roles and Worker Roles. The thing to realize is that local storage on Web Roles and Worker Roles is not a safe place to keep important data long term. Well before getting to an event significant enough to be characterized as needing DR, small events such as a hard-disk failure can result in the loss of such data.

While not a DR issue per se due to the small scope, these applications should nevertheless apply the Node Failure Pattern (Chapter 10) to deal with this.

But the real solution is to not use local storage on compute nodes to store important business data. This is part of an overall strategy of using stateless nodes to enable your application to scale horizontally, which comes with many important benefits beyond just resilience to failure. Further details are described in the Horizontally Scaling Compute Pattern (Chapter 2).

Leverage Platform Services

In the United States, there are television commercials featuring “The Most Interesting Man in the World” who lives an amazing, fantastical life, and doesn’t always drink beer, but when he does he drinks DOS EQUIS.

image

In the cloud, our compute nodes do not always need to persist data long-term, but when they do, they use cloud platform services.

And the “DOS” in “DOS EQUIS” stands for neither Disk Operating System nor Denial of Service here, but rather is the number two in Spanish. But cloud platform services for data storage do better than dos, they have tres – as in three copies.

Windows Azure Storage and Windows Azure SQL Database both write three copies of each byte onto three independent servers on three independent disks. The hardware is commodity hardware – chosen for high value, not strictly for high availability – so it is expected to fail, and the failures are overcome by keeping multiple copies of every byte. If the one of the three instances fails, a new third instance is created by making copies from the other two. The goal state is to continually have three copies of every byte.

Windows Azure Storage is always accessed through a REST interface, either directly, or via specific SDK which uses the REST interface under the hood. For any REST API call that modifies data, the API does not return until all three copies of the bytes are successfully stored.

Windows Azure SQL Database is always accessed through TDS, which is the same TCP protocol as SQL Server. While your application is provided a single connection string, and you create a single TDS connection, behind the scenes there is a three-node cluster. For any operation that modifies data, the operation does not return until at least two copies of the update have been successfully applied on two of the nodes in this cluster; the third node is updated asynchronously.

So if you have a Web Role or Worker Role in Windows Azure, and that node has to save data, it should use one of the persistent storage mechanisms just mentioned.

What about Windows Azure Virtual Machines?

Windows Azure also has a Virtual Machine node that you can deploy (Windows or Linux flavored), and the hard disks attached to those nodes are persistent, but how can that be? It turns out they are backed by Windows Azure Blob storage, so that doesn’t break the model: they also have some storage that is truly local and can use it for caching sorts of functions, but any long-term data is persisted to blob storage, even though it is indistinguishable from a local disk drive from the point of view of any code running on the virtual machine.

But wait, there’s more!

In addition to this, Windows Azure Storage asynchronously geo-replicates blobs and tables to a sister data center. There are eight Azure data centers, and they are paired as follows: East US-West US, North Central US-South Central US, North Europe-West Europe, and East Asia-Southeast Asia. Note that the pairs are chosen to be in the same geo-political region to simplify regulatory compliance in many cases. So if you save data to a blob in East US, three copies will be synchronously written in East US, then three more copies will be asynchronously written to West US.

It is easy to overlook the immense value of having data stored in triplicate and transparently geo-replicated. While the feature comes across rather matter-of-factly, you get incredibly rich DR features without lifting a finger. Don’t let the ease of use mask the great value of this powerful feature.

All of the local and geo-replication mentioned so far happens for free: it is included as part of the listed at-rest storage costs, and no action needed on your part to enable this capability (though you can turn it off).

Enable More as Needed

All the replication listed above will help DR. If a hardware failure takes out one of your three local copies, the system self-heals – you will never even know most types of failures happen. If a natural disaster takes out a whole data center, Microsoft decides when to reroute DNS traffic for Windows Azure Storage away from the disabled data center and over to its sister data center which has the geo-replicated copies.

Note that the geo-replication is only out-of-the-box today for Windows Azure Storage (and not for queues – just for blobs and tables) and not for SQL Database. However, this can be enabled using the sync service available today – you decide how many copies and to which data centers and at what frequency.

Note that there are additional costs associated with using the sync service for SQL Database, for the sync service itself and for data center egress bandwidth.

Regardless of the mechanism, there is always a time-lag in asynchronous geo-replication, so if a primary data center was lost suddenly, the last few minutes worth of updates may not have been fully replicated. Of course, you could choose to write synchronously to two data centers for super-extra safety, but please consult the Network Latency Primer (Chapter 11) before doing so.

This is all part of the overall Multisite Deployment Pattern (Chapter 15), though servicing a geo-distributed user base is another feature of this architecture pattern, beyond the DR features.

Where’s the Engineering?

The title of this blog post is “Engineering for Disaster Recovery in the Cloud” but where did all the engineering happen?

Much of what you need for DR is handled for you by cloud platform services, but not all of it. From time-to-time we alluded to some design patterns that your applications need to adhere to in order for these platform services to make sense. As one example, if your application is written to assume it is safe to use local storage on your web server as a good long-term home for business data, well… the awesomeness built into cloud platform services isn’t going to help you.

There is an important assumption here if you want to leverage the full set of services available in the cloud: you need to build cloud-native applications. These are cloud application that are architected to align with the architecture of the cloud.

I wrote an entire book explaining what it means to architect a cloud-native application and detailing specific cloud architecture patterns to enable that, so I won’t attempt to cover it in a blog post, except to point out that many of the architectural approaches of traditional software will not be optimal for applications deployed to the cloud.

Distinguish HE from DR

Finally, we need to distinguish DR from HE – Disaster Recover from Human Error.

Consider how the DR features built into the cloud will not help with many classes of HE. If you modify or delete data, your changes will dutifully be replicated throughout the system. There is no magic “undo” in the cloud. This is why you usually will still want to take control of making back-ups of certain data.

So backups are still desirable. There are cloud platform services to help you with backups, and some great third-party tools as well. Details on which to choose warrant an entire blog post of their own, but hopefully this post at least clarifies the different needs driven by DR vs. HE.

Is This Enough?

Maybe. It depends on your business needs. If your application is one of those rare applications that needs to be responsive 24×7 without exception, not even for a natural disaster, then no, this is not enough. If your application is a line-of-business application (even an important one), often it can withstand a rare outage under unusual circumstances, so this approach might be fine. Most applications are somewhere in between and you will need to exercise judgement in weighing the business value against the engineering investment and operational cost of a more resilient solution.

And while this post talked about how the combination of following some specific cloud architecture patterns to design cloud-native applications provides a great deal of out-of-the-box resilience in DR situations, it did not cover ongoing continuity, such as with computation, or immediate access to data from multiple data centers. If you rely entirely on the cloud platform to preserve your data, you may not have access to it for a while since (as mentioned earlier, and emphasized nicely in Neil’s comment) you don’t control all the failover mechanisms; you will need to wait until Microsoft decides to failover the DNS for Windows Azure Storage, for example. And remember that background geo-replication does not guarantee zero data loss: some changes may be lost due to the additional latency needed in moving data across data centers, and not all data is geo-replicated (such as queued messages and some other data not discussed).

The ITIL term for “how much data can I stand to lose” is known as the recovery point objective (RPO). The ITIL term for “how long can I be down” is known as the recovery time objective (RTO). The RPO and RTO are useful concepts for modeling DR.

So the DR capabilities built into cloud platform services are powerful, but somewhat short of all-encompassing. However, they do offer a toolbox providing you with unprecedented flexibility in making this happen.

Is This Specific to the Cloud?

The underlying need to understand RPO and RTO and use them to model for DR is not specific to the cloud. These are very real issues in on-premises systems as well. The approaches to addressing them may vary, however.

Generally speaking, while the cloud does not excuse you from thinking about these important characteristics, it does provide some handy capabilities that make it easier to overcome some of the more challenging data-loss threats. Hopefully this allows you to sleep better at night.

—-

Bill Wilder is the author of the book Cloud Architecture Patterns – Develop Cloud-Native Applications from O’Reilly. This post complements the content in the book. Feel free to connect with Bill on twitter (@codingoutloud) or leave a comment on this post. (He’s also warming up to Google Plus.)

book-cover-medium

—-

Spoke at Boston Code Camp #18 – Cloud Architecture Patterns for Building Cloud-Native Applications: align your architecture with the cloud’s architecture

What a joy to be part of Boston’s 18th Code Camp! First, many thanks to the organizing team and helpers:

And of course nothing would happen without support from the Sponsors:

Here is the abstract for my talk:

Just because we get an application to run on cloud infrastructure does ensure that it runs well. To truly take advantage of the cloud we need to build cloud-native applications. The architecture of a cloud-native application is different than the architecture of a traditional application. A cloud-native application is architected for cost-efficiency, availability, and scalability. We will examine several key architecture patterns that help unlock cloud-native benefits, spanning computation, database, and resource-focused patterns. By the end of the talk you should appreciate how cloud architecture is more demanding than you might be accustomed to in some areas, but with high payoff such as handling failure without downtime, scaling arbitrarily, and allowing aggressive cost-optimization.

All the concepts and patterns I spoke about are also discussed in my recently released book, Cloud Architecture Patterns:

Cloud Architecture Patterns book

More info on the book is here:

  • www.cloudarchitecturepatterns.com
  • If you do read the book and find it of value, I’d very much appreciate you considering a short review on Amazon, O’Reilly, or Barnes & Noble.

Got Azure or Cloud questions? Feedback on the book? Just want to stay in touch? Please feel free to reach out.

  • I can be reached via twitter (@codingoutloud)
  • I can also be reached via email (which is same as my twitter handle at gmail).
  • You can also find me through my blog (blog.codingoutloud.com)

The slide deck I used is here:

Architecture Patterns for Building Cloud-Native Applications — Boston Code Camp 18 — 20-Oct-2012 — Bill Wilder (blog.codingoutloud.com) – WITH SOME SLIDES HIDDEN

Cloud-Native Architecture Patterns for Azure Florida Association

I just finished speaking to the Azure Florida Association about Cloud-Native Architecture Patterns…  The talk was 7:00-8:30 PM, so I hope too many people weren’t watching from bed… :-) (It is extra tough speaking to attendees you can’t see or hear.)

The slide deck from the talk is here:

florida–cloud-architecture-patterns-three-big-ideas–bill-wilder–28-march-2012.

The abstract for the talk reads as follows:

We can run pre-cloud software on cloud infrastructure, but to truly take advantage of the cloud we need to build cloud-native applications. The architecture of a cloud-native application is different than the architecture of a traditional pre-cloud application, and in this talk we will examine several big ideas in software architecture you need to ‘grok’ if you want to truly leverage the cloud for cost savings, higher availability, and better scalability. We will examine several key architecture patterns that help unlock those cloud-native benefits, spanning computation, database, and resource-focused patterns. By the end of the talk you should appreciate how cloud architecture is more of a partnership with your hardware than it was with pre-cloud applications (fail-retry anyone?), that the cloud may be infinite (but not all at once), and how the cloud enables cost optimization (and is “green”).

Feel free to reach out to me – my email address is in the slide deck, and my twitter handle is @codingoutloud. Also – if you are in the neighborhood – check out the Boston Azure User Group and our planned Boston Azure Bootcamp in June 2012! You may want to follow the adventures of Mr. SQL Azure Federations

Windows Azure DevCamp in Farmington, CT

Earlier this month I hung out with Jim O’Neil at the Farmington, CT offering of the Windows Azure DevCamp series. The format of the camp was a quick-ramp introduction to the Windows Azure Platform followed by some hands-on coding on the RockPaperAzure challenge.

Jim introduced cloud and presented specifics on Blob and Table storage services and SQL Azure. I had the opportunity to present one of the sections – mine was a combination of Windows Azure Compute services + the Windows Azure Queue service with some basics around using these services to assemble “cloud native” applications. The official slides for the Windows Azure DevCamp series appear to be here, though my slides were a little different and are also available (WindowsAzureDeveloperCamp-FarmingtonCT-07Dec2011-BillWilder). At the end, Jim also ran through the creation of a RockPaperAzure “bot” and it was (literally!) game on as attendees raced to create competitive entries.

I took a few photos at the event – some of Jim presenting, some showing participants at the end coming to claim their prizes from the RockPaperAzure challenge – and none from the middle!

Cloud Architecture Patterns on Azure with North Shore .NET User Group

Last week on Wednesday I went to hang out with a bunch of nice folks in Ipswich, MA at the 2nd meeting of the North Shore .NET User Group. It was an especially fun group with beer served before the talk! :-)

I spoke about Cloud Architecture Patterns like sharding, NoSQL, queue-based compute separation for scalability and reliability – with specific examples from the Windows Azure Platform such as SQL Azure Federations, Azure Table Storage, and Web Role + Queue + Worker Role patterns. The slides from my talk are here: nsnug-big-ideas-in-software-architecture-bill-wilder-14-dec-2011. (UPDATE: Note that I don’t seem to have the exact deck I used for the talk. As Ryan CrawCour pointed out, the deck I posted claims that SQL Azure is limited to 50 GB and Federations has not yet shipped, but at the talk I am certain I presented the 150 GB limit and a recently released Federations. I think I made the changes on the train en route to the event and somehow didn’t save them. Sorry! We’ll need to live with this small skew. Post a comment here if there are questions…)

There was definitely some good discussion and many questions. In fact, the following question came up, and I didn’t have a great response, but turns out there’s a timely response from Mr. SQL Azure Federations himself: http://blogs.msdn.com/b/cbiyikoglu/archive/2011/12/15/so-isn-t-the-root-database-a-bottleneck-for-federations-in-sql-azure.aspx

Also hope to see some nsnug folks at future Boston Azure User Group meetings and our planned Boston Azure Bootcamp in June 2012!

Visiting Harvard – CSCI E-175 – Cloud Computing and Software

Just got back from Harvard where I teamed up with Jim O’Neil to talk about the Windows Azure Cloud Platform to the class CSCI E-175 Cloud Computing and Software as a Service. This was at the invitation of the Dr. Zoran B. Djordjevic – who also hosted us last year, and the year before that it was Jim and some guy named Chris.

Like last year, the class was engaged, asking tough and interesting questions… which is all the more impressive since this class meets on FRIDAY NIGHT. Must be a Harvard thing… Anyhow, we went from around 5:30 – 8:00… ON FRIDAY NIGHT. :-)

Below are the resources I mentioned at the end of my talk, and the slide deck I used is here: Harvard-WhyAzureIsAwesome-Bill-Wilder-04-Nov-2011

Also, hope to see all of you at Boston Azure user group meetings! Feel free to contact me with any follow-up questions.

The slide deck Jim O’Neil used is here, plus here are a few action shots of Jim doing his thang:

This slideshow requires JavaScript.