Pages

Wednesday, January 18, 2017

Hadoop and Big Data

Update! At LDS BC, we cover the basics of storage from early devices to Object based Storage on large Arrays of Disks. As an exercise in the class I have the students setup Hadoop standalone, then as a cluster in their team. This way they can see Hadoop Distributed File System (HDFS) in action.

The design of HDFS is to break the large file in to many parts and "distribute" Those part across multiple nodes. This way storage is spread across many nodes, Such that even if you lost a node the data is had elsewhere. If Each node is made up of 1 PC with server level computing power and Raid hardware it will be extremely unlikely that any data will ever be lost. This way they each can be processed in parallel in order to gather information from those files very quickly. This means that 1 HUGE file Tera bytes or even Peta bytes can be spread across many nodes and processed as quickly as 1000 smaller files all at once. Hadoop was developed by Search engine companies to help them process the large quantity of data that is available on the INTERNET. The Design of the cluster has rolls that each node can play. There are a set of basic nodes including Name nodes, and data nodes. EMC has a cluster based Array that can provide data node services. EMC's Isilon's cluster system can act as a Hadoop data node.

You have heard of Data Mining... Hadoop takes it to Big Data levels. With Objects in HDFS they can each be parsed for useful information and not even be in the same format. Originally this was done with mapreduce. However, this was a JAVA based method. Since then there have been many interfaces developed that will translate into mapreduce jobs. "PIG" is one of them, that allows you to use simple language to build jobs to extract data from files on a HDFS. This means that a business can get business intelligence data by using dissimilar unstructured data files: sales, customer visit / loyalty card data, credit card data, supply chain purchasing, truck delivery schedules, and customer satisfaction surveys etc. Often new insights are found from seemingly unrelated data sets. It is for that reason "Big Data" can be of such great benefit to all levels and sizes of business. Interestingly enough, the setup of a system does not take much and a regular IT team should be able to set one up for testing. Once the Cluster is in place tools for building methods to search and analyzing the data are becoming simpler and simpler. No longer do you have to be a Java Programmer. There are now other gui options.

My next class will dive deeper into the Hadoop system.
As I learn more I'll post.

-Calvin

No comments:

Post a Comment

Followers