Clustering and Cloud Computing: 3. Big Data Analysis, Hadoop and Map Reduce

TOPICS: Big Data Analysis, Hadoop and Map Reduce: Introduction, Clustering Big Data, Classification of Big Data, Hadoop MapReduce Job Execution, Hadoop scheduling, Hadoop cluster setup, configuration of Hadoop, starting and stopping Hadoop cluster.

3.1 INTRODUCTION (BIG DATA ANALYSIS)

Big Data is unstructured data that exceeds the processing complexity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the rule restricting behavior of our database architectures. This information comes from multiple, distinct, independent sources with complex and evolving relationships in a Big Data which is keep on growing day by day. There are three main challenges in Big Data which are data accessing and arithmetic computing procedures, semantics and domain knowledge for different Big Data applications and the difficulties raised by Big Data volumes, distributed data distribution and by complex and dynamic characteristics.

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on." Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data sets in areas including Internet search, finance and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations,biology and environmental research.

Big data framework is divided into three tiers as shown in figure 1, to handle the above challenges

Tier I which is data accessing and computing focus on data accessing and arithmetic computing procedures. Because large amount of information are stored at different locations which are growing rapidly day by day, hence for computing distributed large-scale of information we have to consider effective computing platform like Hadoop.

Data privacy and domain knowledge is the Tier II which focuses on semantics and domain knowledge for different Big Data applications. In social network, users are linked with each other that shares their knowledge which are represented by user communities, leaders in each group and social influence modelling and so on, therefore for understanding their semantics and application knowledge is important for both low-level data access and for high-level mining algorithm designs.

Tier III which is Big Data mining algorithm focus on difficulties raised by Big Data volumes, distributed data distribution and by complex and dynamic characteristics. Figure shows three tiers of big data analysis.

Big Data is now part of every sector and function of the global economy. Big Data a collection of datasets is so large and complex that is beyond the ability of typical database software tools to capture, store, manage and process the data within a tolerable elapsed time. A typical domain like stock market data are constantly generating a large quantity of information such as bids, buys and puts, in every single seconds.

3.2 CLASSIFICATION

It is one of the data mining technique that classifies unstructured data into the structured class and groups and it helps to user for knowledge discovery and future plan. Classification provides intelligent decision making. There are two phases in classification, first is learning process phase in which a huge training data sets are supplied and analysis takes place then rules and patterns are created. Then the execution of second phase start that is evaluation or test of data sets and archives the accuracy of a classification patterns. This section briefly describes the supervised classification methods such as Decision Tree and Support Vector Machine.

1. Decision Tree (DT)

Decision Tree is ideal to use as the filter to handle the large amount of data. DT is a basic way of classification can have satisfactory efficiency and accuracy of those datasets. Decision Tree algorithm is good at tuning between precision which can be trained very fast and provide sound results on those classification data.

Big Data are now rapidly expanding in all domains with the fast development of networking and increase in the data storage and collection capacity

2. Support Vector Machine (SVM)

Machine Learning has ability to enable the computer to learn that uses algorithm and techniques which perform different tasks and activities to provide efficient learning. Our main problem is that how can we represent complex data and how to exclude bogus data. Support Vector Machine is a Machine Learning tool used for classification that is based on Supervised Learning which classifies points to one of two disjoint half-spaces. It uses nonlinear mapping to convert the original data into higher dimension. Its objective is to construct a function which will correctly predict the class to which the new point belongs and the old points belong.

3.3 CLUSTERING

Organize a collection of n objects into a partition or a hierarchy (nested set of partitions) Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis.

However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur.

Components of a Clustering Task

Typical pattern clustering activity involves the following steps [Jain and

Dubes 1988]:

(1) pattern representation (optionally including feature extraction and/or selection),

(2) definition of a pattern proximity measure appropriate to the data domain,

(3) clustering or grouping,

(4) data abstraction (if needed), and

(5) assessment of output (if needed).

Clustering Algorithms

Hundreds of clustering algorithms are available;

many are “admissible”, but no algorithm is “optimal”

• K-means

• Gaussian mixture models

• Kernel K-means

• Spectral Clustering

• Nearest neighbor

• Latent Dirichlet Allocation

Why clustering is required

Not feasible to “label” large collection of objects

• No prior knowledge of the number and nature of groups (clusters) in data

• Clusters may evolve over time

• Clustering provides efficient browsing, search, recommendation and organization of data

3.4 HADOOP MAP REDUCE JOB EXECUTION

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Conceptually similar approaches have been very well known since 1995 with the Message Passing Interface standard having reduce and scatter operations.

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms.The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation, any gains are usually only seen with multi-threaded implementations. The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop

How Hadoop executes MapReduce (MapReduce v1) Jobs

To begin, a user runs a MapReduce program on the client node which instantiates a Job client object.
Next, the Job client submits the job to the JobTracker.
Then the job tracker creates a set of map and reduce tasks which get sent to the appropriate task trackers.
The task tracker launches a child process which in turns runs the map or reduce task.
Finally the task continuously updates the task tracker with status and counters and writes its output to its context.

When task trackers send heartbeats to the job tracker, they include other information such as task status, task counters, and (since task trackers are usually configured as file servers) data read/write status.

Heartbeats tell Job Tracker which Task Trackers are alive. When Job Tracker stops receiving heartbeats from a Task Tracker then:

Job Tracker reschedules tasks on failed Task Tracker to other Task Trackers.

Job Tracker marks Task Tracker as down and won't schedule subsequent tasks there

3.5 HADOOP SCHEDULING

Two schedulers are available in Hadoop, the Fair scheduler and the Capacity schedulers.

The Fair scheduler is the default. Resources are shared evenly across pools and each user has its own pool by default. You can configure custom pools and guaranteed minimum access to pools to prevent starvation. This scheduler supports pre-emption.

Hadoop also ships with the Capacity scheduler; here resources are shared across queues. You may configure hierarchical queues to reflect organizations and their weighted access to resources. You can also configure soft and hard capacity limits to users within a queue. Queues have ACLs to prevent rogues from accessing the queue. This scheduler supports resource-based scheduling and job priorities.

In MRv1, the effective scheduler is defined in $HADOOP_HOME/conf/mapred-site.xml. In MRv2, the scheduler is defined in $HADOOP_HOME/etc/hadoop/mapred-site.xml.

The goal of the fair scheduler is to give all users equitable access between pools to cluster resources. Each pool has configurable guaranteed capacity in slots. Each Pool is equal to the number of jobs. Jobs are placed in flat pools. The default is 1 pool per user.

The Fair Scheduler algorithm works by dividing each pool’s minimum map and reduce tasks among jobs. When a slot is free, the algorithm will allocate a job below the minimum share or most starved. Jobs from "over-using" users can be preempted.

You can examine the configuration and status of the Hadoop fair scheduler in the MCS, or by using the Web UI running on the job tracker.

The goal of the capacity scheduler is to give all queues access to cluster resources. Shares are assigned to queues as percentages of total cluster resources.

Each queue has configurable guaranteed capacity in slots.

Jobs are placed in hierarchical queues. The default is 1 queue per cluster.

Jobs within queue are FIFO. You can configure capacity.scheduler.xml per-queue or per-user.

This scheduler was developed at Yahoo.

3.6 HADOOP CLUSTER SETUP

1 Purpose

This document describes how to install, configure and manage non-trivial Hadoop clusters

ranging from a few nodes to extremely large clusters with thousands of nodes.

To play with Hadoop, you may first want to install Hadoop on a single machine (see Single

Node Setup).

2 Prerequisites

1. Make sure all required software is installed on all nodes in your cluster.

2. Download the Hadoop software.

3 Installation

Installing a Hadoop cluster typically involves unpacking the software on all the machines in

the cluster.

Typically one machine in the cluster is designated as the NameNode and another machine

the as JobTracker, exclusively. These are the masters. The rest of the machines in the

cluster act as both DataNode and TaskTracker. These are the slaves.

The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster

usually have the same HADOOP_HOME path.

3.7 CONFIGURATION OF HADOOP (CLUSTER)

The following sections describe how to configure a Hadoop cluster.

Configuration Files

Hadoop configuration is driven by two types of important configuration files:

1. Read-only default configuration - src/core/core-default.xml, src/hdfs/hdfs-default.xml

and src/mapred/mapred-default.xml.

2. Site-specific configuration - conf/core-site.xml, conf/hdfs-site.xml and conf/mapredsite.

xml.

To learn more about how the Hadoop framework is controlled by these configuration files,

look here.

Additionally, you can control the Hadoop scripts found in the bin/ directory of the

distribution, by setting site-specific values via the conf/hadoop-env.sh

Site Configuration

To configure the Hadoop cluster you will need to configure the environment in which the

Hadoop daemons execute as well as the configuration parameters for the Hadoop daemons.

The Hadoop daemons are NameNode/DataNode and JobTracker/TaskTracker.

Configuring the Environment of the Hadoop Daemons

Administrators should use the conf/hadoop-env.sh script to do site-specific

customization of the Hadoop daemons' process environment.

At the very least you should specify the JAVA_HOME so that it is correctly defined on each

remote node.

In most cases you should also specify HADOOP_PID_DIR to point a directory that can only

be written to by the users that are going to run the hadoop daemons. Otherwise there is the

potential for a symlink attack.

Administrators can configure individual daemons using the configuration options

HADOOP_*_OPTS.

3.8 STARTING AND STOPPING HADOOP CLUSTER

You can stop a currently running Hadoop cluster and start a stopped Hadoop cluster from the Data Director for Hadoop Web console.

Prerequisites

Verify that the cluster is provisioned
Verify that enough resources, especially CPU and memory resources, are available to start the virtual machines in the Hadoop cluster.

Procedure

1.Log in to the Data Director for Hadoop Web console.

2.Click Clusters in the left pane.

3.Select the cluster that you want to stop or start from the Cluster Name column.

4.Click the Actions icon.

5.Select Stop to stop a running cluster, or select Start to start a cluster.

If the cluster does not start, try to start the cluster from the CLI, which includes error messages and a log path for troubleshooting

Clustering and Cloud Computing

3. Big Data Analysis, Hadoop and Map Reduce

No comments:

Post a Comment