TOPICS: Big Data Analysis, Hadoop and Map
Reduce: Introduction, Clustering Big Data, Classification of Big Data, Hadoop
MapReduce Job Execution, Hadoop scheduling, Hadoop cluster setup, configuration
of Hadoop, starting and stopping Hadoop cluster.
3.1
INTRODUCTION (BIG DATA ANALYSIS)
Big Data is unstructured data that
exceeds the processing complexity of conventional database systems. The data is
too big, moves too fast, or doesn’t fit the rule restricting behavior of our
database architectures. This information comes from multiple, distinct,
independent sources with complex and evolving relationships in a Big Data which
is keep on growing day by day. There are three main challenges in Big Data
which are data accessing and arithmetic computing procedures, semantics and
domain knowledge for different Big Data applications and the difficulties
raised by Big Data volumes, distributed data distribution and by complex and
dynamic characteristics.
Big data is a broad term for data sets so large or
complex that traditional data processing applications are inadequate.
Challenges include analysis, capture, data curation, search, sharing, storage,
transfer, visualization, querying and information privacy. The term often
refers simply to the use of predictive analytics or certain other advanced
methods to extract value from data, and seldom to a particular size of data
set. Accuracy in big data may lead to more confident decision making, and
better decisions can result in greater operational efficiency, cost reduction
and reduced risk.
Analysis of data sets can find new correlations to
"spot business trends, prevent diseases, combat crime and so on."
Scientists, business executives, practitioners of medicine, advertising and
governments alike regularly meet difficulties with large data sets in areas
including Internet search, finance and business informatics. Scientists
encounter limitations in e-Science work, including meteorology, genomics,
connectomics, complex physics simulations,biology and environmental research.
Big data framework is divided into three tiers
as shown in figure 1, to handle the above challenges
Tier I which is data accessing and
computing focus on data accessing and arithmetic computing procedures. Because
large amount of information are stored at different locations which are growing
rapidly day by day, hence for computing distributed large-scale of information
we have to consider effective computing platform like Hadoop.
Data
privacy and domain knowledge is the Tier II which focuses on semantics and
domain knowledge for different Big Data applications. In social network,
users are linked with each other that shares their knowledge which are
represented by user communities, leaders in each group and social influence
modelling and so on, therefore for understanding their semantics and
application knowledge is important for both low-level data access and for
high-level mining algorithm designs.
Tier III which is Big Data mining
algorithm focus on difficulties raised by Big Data volumes, distributed data
distribution and by complex and dynamic characteristics. Figure shows three
tiers of big data analysis.
Big Data is now part of every sector and
function of the global economy. Big Data a collection of datasets is so large
and complex that is beyond the ability of typical database software tools to
capture, store, manage and process the data within a tolerable elapsed time. A
typical domain like stock market data are constantly generating a large
quantity of information such as bids, buys and puts, in every single seconds.
3.2
CLASSIFICATION
It is one of the data mining technique
that classifies unstructured data into the structured class and groups and it
helps to user for knowledge discovery and future plan. Classification
provides intelligent decision making. There are two phases in classification,
first is learning process phase in which a huge training data sets are supplied
and analysis takes place then rules and patterns are created. Then the
execution of second phase start that is evaluation or test of data sets and
archives the accuracy of a classification patterns. This section briefly
describes the supervised classification methods such as Decision Tree and
Support Vector Machine.
1.
Decision Tree (DT)
Decision
Tree is ideal to use as the filter to handle the large amount of data. DT is a
basic way of classification can have satisfactory efficiency and accuracy of
those datasets. Decision Tree algorithm is good at tuning between precision
which can be trained very fast and provide sound results on those
classification data.
Big Data are now rapidly expanding in all
domains with the fast development of networking and increase in the data
storage and collection capacity
2.
Support Vector Machine (SVM)
Machine Learning has ability to enable
the computer to learn that uses algorithm and techniques which perform different
tasks and activities to provide efficient learning. Our main problem is that
how can we represent complex data and how to exclude bogus data. Support Vector
Machine is a Machine Learning tool used for classification that is based on
Supervised Learning which classifies points to one of two disjoint half-spaces.
It uses nonlinear mapping to convert the original data into higher dimension.
Its objective is to construct a function which will correctly predict the class
to which the new point belongs and the old points belong.
3.3 CLUSTERING
Organize a collection
of n objects into a partition or a hierarchy (nested set of partitions) Clustering is the unsupervised
classification of patterns (observations, data items, or feature vectors) into
groups (clusters). The clustering problem has been addressed in many contexts
and by researchers in many disciplines; this reflects its broad appeal and
usefulness as one of the steps in exploratory data analysis.
However, clustering is a difficult
problem combinatorially, and differences in assumptions and contexts in
different communities has made the transfer of useful generic concepts and
methodologies slow to occur.
Components of a
Clustering Task
Typical pattern clustering activity
involves the following steps [Jain and
Dubes 1988]:
(1) pattern representation
(optionally including feature extraction and/or selection),
(2) definition of a pattern
proximity measure appropriate to the data domain,
(3) clustering or grouping,
(4) data abstraction (if needed), and
(5) assessment of output (if
needed).
Clustering Algorithms
Hundreds of clustering
algorithms are available;
many are “admissible”, but no algorithm is “optimal”
• K-means
• Gaussian mixture
models
• Kernel K-means
• Spectral Clustering
• Nearest neighbor
• Latent Dirichlet
Allocation
Why clustering is required
Not feasible to “label” large
collection of objects
• No prior knowledge of the number
and nature of groups (clusters) in data
• Clusters may evolve over time
• Clustering provides efficient browsing,
search, recommendation and organization of data
3.4
HADOOP MAP REDUCE JOB EXECUTION
MapReduce is a programming model and an associated
implementation for processing and generating large data sets with a parallel,
distributed algorithm on a cluster. Conceptually similar approaches have been
very well known since 1995 with the Message Passing Interface standard having
reduce and scatter operations.
A MapReduce program is composed of a Map() procedure
(method) that performs filtering and sorting (such as sorting students by first
name into queues, one queue for each name) and a Reduce() method that performs
a summary operation (such as counting the number of students in each queue,
yielding name frequencies). The "MapReduce System" (also called
"infrastructure" or "framework") orchestrates the
processing by marshalling the distributed servers, running the various tasks in
parallel, managing all communications and data transfers between the various
parts of the system, and providing for redundancy and fault tolerance.
The model is inspired by the map and reduce
functions commonly used in functional programming, although their purpose in
the MapReduce framework is not the same as in their original forms.The key
contributions of the MapReduce framework are not the actual map and reduce
functions, but the scalability and fault-tolerance achieved for a variety of
applications by optimizing the execution engine once. As such, a
single-threaded implementation of MapReduce will usually not be faster than a
traditional (non-MapReduce) implementation, any gains are usually only seen
with multi-threaded implementations. The use of this model is beneficial only
when the optimized distributed shuffle operation (which reduces network
communication cost) and fault tolerance features of the MapReduce framework
come into play. Optimizing the communication cost is essential to a good
MapReduce algorithm.
MapReduce libraries have been
written in many programming languages, with different levels of optimization. A
popular open-source implementation that has support for distributed shuffles is
part of Apache Hadoop
How Hadoop executes MapReduce (MapReduce v1) Jobs
- To begin, a user runs a MapReduce program on the client node which instantiates a Job client object.
- Next, the Job client submits the job to the JobTracker.
- Then the job tracker creates a set of map and reduce tasks which get sent to the appropriate task trackers.
- The task tracker launches a child process which in turns runs the map or reduce task.
- Finally the task continuously updates the task tracker with status and counters and writes its output to its context.
When task trackers send heartbeats to the job
tracker, they include other information such as task status, task counters, and
(since task trackers are usually configured as file servers) data read/write
status.
Heartbeats tell Job Tracker which Task Trackers are
alive. When Job Tracker stops receiving heartbeats from a Task Tracker then:
Job Tracker reschedules tasks on failed Task Tracker
to other Task Trackers.
Job Tracker marks Task Tracker as down and won't
schedule subsequent tasks there
3.5
HADOOP SCHEDULING
- Two schedulers are available in Hadoop, the Fair scheduler and the Capacity schedulers.
- The Fair scheduler is the default. Resources are shared evenly across pools and each user has its own pool by default. You can configure custom pools and guaranteed minimum access to pools to prevent starvation. This scheduler supports pre-emption.
- Hadoop also ships with the Capacity scheduler; here resources are shared across queues. You may configure hierarchical queues to reflect organizations and their weighted access to resources. You can also configure soft and hard capacity limits to users within a queue. Queues have ACLs to prevent rogues from accessing the queue. This scheduler supports resource-based scheduling and job priorities.
- In MRv1, the effective scheduler is defined in $HADOOP_HOME/conf/mapred-site.xml. In MRv2, the scheduler is defined in $HADOOP_HOME/etc/hadoop/mapred-site.xml.
- The goal of the fair scheduler is to give all users equitable access between pools to cluster resources. Each pool has configurable guaranteed capacity in slots. Each Pool is equal to the number of jobs. Jobs are placed in flat pools. The default is 1 pool per user.
- The Fair Scheduler algorithm works by dividing each pool’s minimum map and reduce tasks among jobs. When a slot is free, the algorithm will allocate a job below the minimum share or most starved. Jobs from "over-using" users can be preempted.
- You can examine the configuration and status of the Hadoop fair scheduler in the MCS, or by using the Web UI running on the job tracker.
- The goal of the capacity scheduler is to give all queues access to cluster resources. Shares are assigned to queues as percentages of total cluster resources.
- Each queue has configurable guaranteed capacity in slots.
- Jobs are placed in hierarchical queues. The default is 1 queue per cluster.
- Jobs within queue are FIFO. You can configure capacity.scheduler.xml per-queue or per-user.
- This scheduler was developed at Yahoo.
3.6
HADOOP CLUSTER SETUP
1 Purpose
This document describes
how to install, configure and manage non-trivial Hadoop clusters
ranging from a few
nodes to extremely large clusters with thousands of nodes.
To play with Hadoop,
you may first want to install Hadoop on a single machine (see Single
Node Setup).
2
Prerequisites
1. Make sure all required software is
installed on all nodes in your cluster.
2. Download the Hadoop
software.
3
Installation
Installing a Hadoop
cluster typically involves unpacking the software on all the machines in
the cluster.
Typically one machine
in the cluster is designated as the NameNode and another machine
the as JobTracker,
exclusively. These are the masters. The rest of the machines in the
cluster act as both
DataNode and TaskTracker. These are the slaves.
The root of the
distribution is referred to as HADOOP_HOME. All machines in the cluster
usually have the same
HADOOP_HOME path.
3.7
CONFIGURATION OF HADOOP (CLUSTER)
The following sections
describe how to configure a Hadoop cluster.
Configuration
Files
Hadoop configuration is
driven by two types of important configuration files:
1. Read-only default
configuration - src/core/core-default.xml, src/hdfs/hdfs-default.xml
and src/mapred/mapred-default.xml.
2. Site-specific
configuration - conf/core-site.xml, conf/hdfs-site.xml and conf/mapredsite.
xml.
To learn more about how
the Hadoop framework is controlled by these configuration files,
look here.
Additionally, you can
control the Hadoop scripts found in the bin/ directory of the
distribution, by
setting site-specific values via the conf/hadoop-env.sh
Site Configuration
To configure the Hadoop cluster you
will need to configure the environment in which the
Hadoop daemons execute as well as
the configuration parameters for the Hadoop daemons.
The Hadoop daemons are
NameNode/DataNode and JobTracker/TaskTracker.
Configuring the Environment of the Hadoop
Daemons
Administrators should use the
conf/hadoop-env.sh script to do site-specific
customization of the Hadoop
daemons' process environment.
At the very least you should
specify the JAVA_HOME so that it is correctly defined on each
remote node.
In most cases you should also
specify HADOOP_PID_DIR to point a directory that can only
be written to by the users that are
going to run the hadoop daemons. Otherwise there is the
potential for a symlink attack.
Administrators can configure
individual daemons using the configuration options
HADOOP_*_OPTS.
3.8
STARTING AND STOPPING HADOOP CLUSTER
You can stop a currently running Hadoop cluster and
start a stopped Hadoop cluster from the Data Director for Hadoop Web console.
Prerequisites
- Verify that the cluster is provisioned
- Verify that enough resources, especially CPU and memory resources, are available to start the virtual machines in the Hadoop cluster.
Procedure
1.Log in to the Data Director for Hadoop Web
console.
2.Click Clusters in the left pane.
3.Select the cluster that you want to stop or start
from the Cluster Name column.
4.Click the Actions icon.
5.Select Stop to stop a running cluster, or select
Start to start a cluster.
If the cluster does not start, try to start the
cluster from the CLI, which includes error messages and a log path for
troubleshooting
No comments:
Post a Comment