Apache Hadoop

Apache Hadoop

Apache Hadoop is an open source solution for distributed computing on big data

Big data is a marketing term that encompasses the entire idea of data mined from sources like search engines, grocery store buying patterns tracked through points cards etc. In the modern world, the internet has so many sources of data, that more often than not the scale make it unusable without processing and processing would take incredible amounts of time by any one server. Enter Apache Hadoop View full description

PROS

  • Excellent way to utilize powerful MapReduce and distributed file functions to process excessively large collections of data
  • Is open source to use on your own hardware clusters
  • Can be utilized through popular cloud platforms like Microsoft Azure and Amazon EC2

CONS

  • Not for the layman, should have some technical expertise in order to manage and utilize
  • Based on Linux, not for all users

Big data is a marketing term that encompasses the entire idea of data mined from sources like search engines, grocery store buying patterns tracked through points cards etc. In the modern world, the internet has so many sources of data, that more often than not the scale make it unusable without processing and processing would take incredible amounts of time by any one server. Enter Apache Hadoop

Less time for data processing

By leveraging Hadoop architecture to distribute processing tasks across multiple machines on a network, processing times are decreased astronomically and answers can be determined in reasonable amounts of time. Apache Hadoop is split into two different components: a storage component and a processing component. In the simplest terms, Hapood makes one virtual server out of multiple physical machines. In actuality, Hadoop manages the communication between multiple machines such that they work together closely enough that it appears as if there is only one machine working on the computations. The data is distributed across multiple machines to be stored and processing tasks are allocated and coordinated by the Hadoop architecture. This type of system is a requirement for converting raw data into useful information on the scale of Big Data inputs. Consider the amount of data that is received by Google every second from users entering search requests. As a total lump of data, you wouldn't know where to start, but Hadoop will automatically reduce the data set into smaller, organized subsets of data and assign these manageable subset to specific resources. All results are then reported back and assembled into usable information.

A server easy to set

Although the system sounds complex, most of the moving parts are obscured behind abstraction. Setting up the Hadoop server is fairly simple, just install the server components on hardware that meets the system requirements. The harder part is planning out the network of computers that the Hadoop server will utilize in order to distribute the storage and processing roles. This can involve setting up a local area network or connecting multiple networks together across the Internet. You can also utilize existing cloud services and pay for a Hadoop cluster on popular cloud platforms like Microsoft Azure and Amazon EC2. These are even easier to configure as you can spin them up ad hoc and then decommission the clusters when you don't need them anymore. These types of clusters are ideal for testing as you only pay for the time the Hadoop cluster is active.

Process your data to get the information you need

Big data is an extremely powerful resource, but data is useless unless it can be properly categorized and turned into information. At current time, Hadoop clusters offer an extremely cost effective method for processing these collections of data into information.

Apache Hadoop

Download

Apache Hadoop