Get Latest Exam Updates, Free Study materials and Tips
Big Data in general is defined as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
i) Volume- vast 'volumes' of data is generated from many sources daily, such as business processes, machines, social media platforms, networks, human interactions, and many more.
ii) Variety- Big Data can be structured, unstructured, and semi-structured that are being collected from different sources.
iii) Velocity- Velocity creates the speed by which the data is created in real-time.The primary aspect of Big Data is to provide demanding data rapidly.
iv) Veracity- Veracity means how reliable the data is. It has many ways to filter or translate the data. Veracity is the process of being able to handle and manage data efficiently.
Traditional Data:
● Traditional data is the structured data which is being majorly maintained by all types of businesses starting from very small to big organizations.
● In traditional database system a centralized database architecture used to store and maintain the data in a fixed format or fields in a file
Big Data:
● Big data can be considered as an upper version of traditional data.
● Big data deals with too large or complex data sets which is difficult to manage in traditional data-processing application software.
● It deals with large volumes of both structured, semi structured and unstructured data.
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
Hadoop is used in Big data because it allows companies to manage huge volumes of data easily. It allows big problems to be broken down into smaller elements so that analysis could be done quickly and cost-effectively.
Hadoop has two major layers namely −
● Processing/Computation layer (MapReduce)
● Storage layer (Hadoop Distributed File System)
● Hadoop Common
● Hadoop YARN
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware.
Hadoop Common
These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN (Yet Another Resource Navigator)
This is a framework for job scheduling and cluster resource management.
The Hadoop Ecosystem has the following stages:
i) Data Management
ii) Data Access
iii) Data Processing
iv) Data Storage
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database
● Mahout, Spark MLLib: Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing
● Zookeeper: Managing cluster
● Oozie: Job Scheduling
Not a member yet? Register now
Are you a member? Login now