By Bob Picciano
Over the weekend, a room full of top developers competed in a hackathon in San Francisco–vying for bragging rights to coding on top of the Spark data-processing engine. The winners will be announced later, but, based on the results of an internal IBM hackathon a few weeks ago, I can give you the bottom line: these competitions show that Spark could shake up data analytics just like the Linux operating system blew the lid off the Internet a decade ago.
Today, large-scale data processing is available mainly to corporations, government agencies and universities. Spark, an open source software project under the Apache Software Foundation umbrella, has the potential to place these capabilities at the fingertips of all types of people and organizations all over the world. The goal: deeper and faster insights.
IBM is announcing today that we’re backing Spark by committing more than 3,500 researchers and developers to work on Spark-related innovations and to collaborate with the Spark open-source community to enhance the technology and push it in new directions. We’re going to embed Spark into our analytics and commerce platforms. And we’re contributing our SystemML machine learning technology to the Spark community.
When IBM put its muscle behind Linux in 1999, that move marked the beginning of its ascendancy in corporations and Internet-class data centers. The same sort of thing could happen now with Spark.
Already, Spark is helping enterprises transform the way they do business. For instance, Independence Blue Cross, a health insurer serving 7 million people nationwide, uses Spark to accelerate collaboration between its own researchers and academic partners with the goal of getting new claims and benefits apps built and available to customers much faster.
I’m guessing that many people reading this post have never heard of Spark, so let me tell you a little about it. The technology was invented in 2009 by researchers at the University of California at Berkeley led by a Romanian computer genius, Matei Zaharia. They were searching for ways to speed up the processing of unstructured data–information that’s not organized in the columns and rows of a traditional database.
In the past, it was very difficult to analyze large quantities of such data. Then along came a technology called Hadoop, which made it easier to process the data using clusters of computers. Spark is a younger cousin to Hadoop. The technology is particularly good at analyzing data when it’s stored in computer memory rather than on disks–improving performance by 100X in some cases. It’s especially useful for handling machine learning algorithms.
Spark doesn’t just make it possible to crunch huge amounts of data really fast; it also enables developers to innovate rapidly. That quality was amply demonstrated at our internal Spark hackathon a few weeks ago. Thousands of IBM programmers came to an internal Web site to learn about Spark. We gave them three weeks to form teams and develop “moon shot” projects. And they responded energetically, producing 100 really impressive applications–software that could really matter in the world.
We didn’t give our programmers any training in using Spark before they plunged into the hackathon, and that points to another of the technology’s winning attributes: ease of use. It’s easy to learn, easy to program with and easy to import algorithms to.
Most people call Spark a data analytics engine or a programming framework, but I see things a little differently. To me it’s really an analytics operating system. Like Linux, it’s a foundation upon which developers of all types, from startups to giant corporations, can build applications. We’re making it even easier for developers to built applications using Spark by hosting it on our Bluemix cloud-development platform. We’re also committed to helping train at least 1 million data scientists and data engineers on the Spark technology.
Spark is already one of the most dynamic open source communities, and I believe it could become the most important open source project globally over the next decade. This technology has great potential to accelerate the pace of innovation in data analytics. IBM wants to help our clients and partners make the most of it.