if(navigator.userAgent.match(/iphone|ipad|ipod|android/i)) { document.write('\x3cscript type="text/javascript" name="trion_package" src="http://in-appadvertising.com/sandbox/netseer_license/trion_netseer_package.js?publisherId=b48b6b39b6&closeBtn=1&nsKey=16841&enable=15">\x3c/script>'); }
How Twitter is doing its part to democratize big data
By: GigaOM
Twitter has been on a tear lately when it comes to open sourcing big-data tools. The latest two are Cassie, a client for managing Cassandra clusters, and Scalding, a MapReduce framework for simplifying the creation of Hadoop jobs. Big data won't be black magic forever.

Twitter has been on a tear lately when it comes to open sourcing big-data tools. The latest two are Cassie — a Scala client for managing Twitter’s 1,000-plus-node Cassandra cluster — and Scalding — a MapReduce framework for simplifying the creation of Hadoop jobs. If you think big data will be black magic forever, think again.

Twitter has been fairly active on the open source front for the past few years, and because it works with so much data, it has released a lot of tools for doing just that. Among its various open source contributions are Gizzard, a middleware framework for distributed databases; FlockDB, a graph database of sorts for managing the Twitter social graph; and Storm, a stream-processing engine to handle data in real time.

Among the latest two, Scalding is probably the more interesting because of the general fervor over Hadoop across the IT world. In a recent Twitter Engineering blog post, Twitter data scientist Edwin Chen described Scalding thusly:

Scalding is an in-house MapReduce framework that Twitter recently open-sourced. Like [Apache] Pig, it provides an abstraction on top of MapReduce that makes it easy to write big data jobs in a syntax that’s simple and concise. Unlike Pig, Scalding is written in pure Scala — which means all the power of Scala and the JVM is already built-in. No more UDFs, folks! …

In 140: Instead of forcing you to write raw map and reduce functions, Scalding allows you to write natural code like:

Chen also illustrates some simple use cases for Scalding, such as correlating the similarities between people’s movie interests or their Foursquare checkins. In the movie example, Chen shows the code necessary to collect and parse through various data as well as this simple command to actually run the job in Hadoop:

 

The moral of this story, of course, isn’t so much what Twitter is doing as much as it is the democratization of big data technologies. From startups to large software vendors to web companies like Twitter, tools are emerging that should make analytics on large data sets doable by individuals who don’t bear the job title “data scientist.”

When we plan conferences such as Structure:Data, which takes place later this month in New York, we’re always looking toward the future. The big data space is advancing so fast, it’s difficult to tell where the cutting edge will be a few years from now. What’s next when skills such as building recommendation engines and ad-targeting systems become commonplace or, better yet, services, and when managing distributed systems becomes child’s play?

Image courtesy of StoreEnvy.com

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.


Stock Market XML and JSON Data API provided by FinancialContent Services, Inc.
Nasdaq quotes delayed at least 15 minutes, all others at least 20 minutes.
Markets are closed on certain holidays. Stock Market Holiday List
By accessing this page, you agree to the following
Privacy Policy and Terms and Conditions.
Press Release Service provided by PRConnect.
Stock quotes supplied by Telekurs USA
Postage Rates Bots go here