Quick Hadoop Overview at Feb 7 Boston Azure Meeting

Tonight’s (07-Feb-2012) Boston Azure cloud user group meeting (that we ran jointly with Microsoft DevBoston) went very well.

In the featured talk, Michael Stiefel gave an insightful, thought-provoking talk on Architecting for Failure: Why Cloud Architecture is Different. Michael ~~will post~~ has posted his slides. The slides are listed under Cloud Computing Presentations called Architecting For Failure, Cloud Architecture is Different! (note this is not same as Michael’s blog).

As a warmup, I gave a short talk describing the challenges in making sense of Big Data, what (in the computer science sense) Map and Reduce are, and how the Hadoop infrastructure makes building MapReduce processes so easy. Ended with a bit of a peak at the CTP of “Hadoop as a Service” – the Microsoft Windows Azure service that is in CTP – at www.hadooponazure.com. The talk focused on Hadoop – and a simple Hadoop example at that – only mentioning that was a broader Hadoop ecosystem: the official Apache Hadoop project, some subprojects (HIVE, Pig (which has a Pig Latin language, not to be confused with this one), Mahout, ZooKeeper, HBASE, and others), some other related efforts (Cascading.org), and some commercial companies dedicated to Hadoop (Cloudera, Hortonworks, and others – they are roughly the Hadoop equivalents of Red Hat in the Linux world; Microsoft is working with Hortonworks on their Hadoop on Azure and Hadoop on Windows Server effort).

My Hadoop slides are attached here: Hadoop-BostonAzure-07-Feb-2012.

I also discussed some upcoming Azurey events of interest to the Boston Azure community. That deck is here: Upcoming Events of Interest to Boston Azure Community.

O’Reilly Radar has a concise roundup of some of these technologies here (which I noticed on a tweet here). And this excerpt from the official Apache Hadoop project lists some related technologies:

The project includes these subprojects:

Hadoop Common: The common utilities that support the other Hadoop subprojects.

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.

Other Hadoop-related projects at Apache include:

Avro™: A data serialization system.

Cassandra™: A scalable multi-master database with no single points of failure.

Chukwa™: A data collection system for managing large distributed systems.

HBase™: A scalable, distributed database that supports structured data storage for large tables.

Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.

Mahout™: A Scalable machine learning and data mining library.

Pig™: A high-level data-flow language and execution framework for parallel computation.

ZooKeeper™: A high-performance coordination service for distributed applications.