Big data can mean a number of things—and guide business decisions in countless ways. It can involve analyzing sheer volumes of data, from 100s of terabytes to petabytes. Or it can imply getting data and information from unlikely sources—sensors from a machine on an assembly line, Twitter and Facebook, or a company’s visitor web log. Big data also entails analyzing streaming data, such as stock market tickers and the external factors that make the market go up or down. In all instances, at the core of big data is an effort to understand behavior—and to use that understanding to make predictions and guide smart next steps.
That said, questions abound about how to make the most of big data—and use it strategically to inform key decisions in your business or organization. While there’s no easy answer, and many companies don’t have the time or expertise to craft and implement a plan, the first step is understanding the tools and technologies behind big data—and their potential to deliver deep insights to your team.
When the conversation turns to big data, Apache’s technology, Hadoop, comes up time and again. But if you ask most people how Hadoop actually works, they likely won’t know. Keep reading, and we’ll do our best to explain.
In a nutshell, Hadoop is a huge data processor and storage system. It uses the programming model MapReduce (first developed by Google) to process data and a Hadoop distributed file system (HDFS) to store data. Here’s an abridged version of how it works: MapReduce jobs roll up the original data input into aggregates defined by the job’s “keys,” which can be anything the code dictates. Once the algorithm is defined, the key aggregates are stored in the HDFS, which allows for data saving across many low level servers.
What makes Hadoop so popular—and powerful? Hadoop’s strength lies in its flexibility to add thousands of computers to the solution to improve the performance of the jobs and provide added data storage. All of the jobs working to break down the data operate in parallel across the many different servers in the Hadoop cluster.
Source: SmartCollectiveData, “A technical look at BigData” authored by Chuck Rivel