I found in a training course this conceptual example about how MapReduce works.
The sequence of steps is clear: input, splitting, mapping, shuffling and reducing.
Some basic concepts for my poor memory:
1.- Support vector machine (SVM)
SVM is the concept behind so many of the machine learning solutions the market offers.
You want to identify patterns, classify unstructured data, discover root causes, optimize forecasts… so many of the code written to do that implements this model.
The main goal of Digital Twins is the creation of a fully functional digitized product for testing, prototyping and collecting data for future product revolutions and maintenance.
Digital Twin is the culmination of all things IoT. It employs sensors, connectivity and analytics. The Digital Twin overlays analytics and operational data over time on the static design model.
Other event in NYC!, where two topics were discussed:
My learning on the basic concepts of Hadoop continues.
Pig has 2 basic elements:
Pig works along with other Hadoop elements as HDSF, MapReduce Framework, YARN…
You can create Macros in Pig Language, you can also access to the piggybank to use standard code.
The main difference between MapReduce V1 and V2 is the existence of YARN
Pig vs. SQL
In the process of understanding the basis of Hadoop I found a training course in AWS that is helping me to understand the concepts.
Hadoop platform is composed by different components and tools.
Hadoop HDFS: A distributed file system that partitions large files across multiple machines for high throughput access to data.
Hadoop YARN; A framework for job scheduling and cluster resource management.
Hadoop map reduce; A programming framework for distributed batch processing of large data sets distributed across multiple servers.
Hive; A data warehouse system for Hadoop that facilitates data summation, ad-hoc queries, and the analysis of large data sets stored in Hadoop – compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into MapReduce programs. Hive was initially developed by Facebook.
HBase; An open-source, distributed, column oriented store modeler created after google’s big table (that is property of Google). HBase is written in Java.
Pig ; A high-level data-flow language (commonly called “Pig Latin”) for expressing MapReduce programs; it’s used for analyzing large HDFS distributed data sets. Pig was originally developed at Yahoo Research around 2006.
Mahout; A scalable machine learning and data mining library.
Oozie; A workflow scheduler system to manage Hadoop jobs (MapReduce and Pig jobs). Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container.
Spark; It’s a cluster computing framework which purpose is to manage large scale of data in memory. Spark’s in-memory primitives provide performance up to 100 times faster for certain applications.
Zookeeper ; It’s a distributed configuration service, synchronization service, and naming registry for large distributed systems.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. They cover all these challenges and other ones.
Big Data can be defined as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Big data is basically about a new set of algorithms, topology and use of resources that enable us to gather, analyze and show more data.
Every day, we create thousand of bytes of data, the number grows in an exponential way.
In the past all data was stored in a relational database, so to handle it, it was a question of simple rules to select data from a database. Today 80% of data is unstructured, and this percentage is growing: videos, pictures, comments, GPS, transactions, locations…
Under this situation, individuals and companies face scenarios where the data is too big or it moves too fast or it exceeds current processing capacity. This limit to handle the data comes from:
Think about Facebook, in the terms mentioned above:
Well, Facebook, Yahoo, Twitter, EBay… they all use Hadoop.
I’m reading about Content Delivery Network (CDN) and I found the Apache Hadoop project. I have been shocked about the nature of the project, where this comes from and all the toolkit generated around it. It’s massive amount of information, but fascinating for me.
Hadoop project comes from the need of requiring more resources for a give goal. The solution has been to distribute the data and the processing of data. You need to process a huge amount of data with a simple computer that offers limited processing cycles, then you use combined group of computers to run these processes in less time.
The major resources considered while distributed processing system are: Processor time, memory, hard drive space, network bandwidth. For instance virtual servers is a sophisticated software that detects idle CPU capacity on a rack of physical server and parcels out the virtual environments to utilize it.
There are so many challenges on distributed processing when it’s applied at large scale, and Hadoop faces them. It’s important to mention these challenges to understand (or admire) what the Apache Hadoop project does.
In each of the mentioned cases, the distributed system should be able to recover from the component failure or transient error condition and continue to make progress.