Q. Some cloud environments support running MapReduce. Can I do this in Windows Azure?
A. You can run MapReduce in Windows Azure. First we give some pointers, then get into some other options that might even be more useful or powerful, depending on what you are doing.
Summary of most obvious Azure-oriented choices: (1) Apache Hadoop on Azure, (2) LINQ to HPC leveraging Azure, or (3) Daytona Map/Reduce on Azure.
The first approach is to use the open source Apache Hadoop project which implements MapReduce. Details on how to run Hadoop on Azure are available on the Distributed Development Blog. Update 14-Oct-2011: Check out this write-up by Ted Kummert about his keynote at PASS where he discussed deeper Hadoop support for Windows Azure: “Microsoft makes this possible through SQL Server 2012 and through new investments to help customers manage ‘big data’, including an Apache Hadoop-based distribution for Windows Server and Windows Azure and a strategic partnership with Hortonworks. Our announcements today highlight how we enable our customers to take advantage of the cloud to better manage the ‘currency’ of their data.” Also, Avkash Chauhan provides a nice summary of the announcement.
The MapReduce tutorial on the Apache Hadoop project site explains the goal of the project, as followed by detailed steps on how to use the software.
“Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.” – from Overview section of Hadoop MapReduce tutorial
Another entrant in this Big Data Analytics space is LINQ to HPC. For more details on LINQ to HPC, check out David Chappell‘s whitepaper called Introducing LINQ to HPC: Processing Big Data on Windows. Chappell explains the value proposition, and also talks about when you might use it versus using SQL Server Parallel Data Warehouse. LINQ to HPC beta 2 is availlable for download.
[Update 19-July-2011: Daytona enters the fray] “Microsoft has developed an iterative MapReduce runtime for Windows Azure, code-named Daytona.” It is available for download as of early July, though has a non-commercial-use-only license attached to it. (credit: saw it on the insideHPC blog)
[Update 19-July-2011: It is now clear that LINQ to HPC (available in beta 2!) is supplanting DryadLINQ.]
You may also be interested in checking out DryadLINQ from Microsoft Research. Though not identical to MapReduce, they describe it as “a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.” As of this writing it was not licensed for commercial use, but was available under an academic use license. (With the introduction of LINQ to HPC, I can’t tell whether these projects are related, or whether LINQ to HPC is the productized version of DryadLINQ.)
And, finally, I also just read an interesting post called Hadoop is the Answer! What is the Question? by Tim Negris. This brings up some good points about the maturity of Hadoop and other points – if you are thinking about MapReduce, Hadoop, DryadLINQ, or other approaches, give his article a read.
[05-June-2011 updates] Added info from David Chappell and Tim Negris.
Is this useful? Did I leave out something interesting or get something wrong? Please let me know in the comments! Think other people might be interested? Spread the word!