Big data concepts

Some basic concepts for my poor memory:

1.- Support vector machine (SVM)

SVM is the concept behind so many of the machine learning solutions the market offers.

You want to identify patterns, classify unstructured data, discover root causes, optimize forecasts… so many of the code written to do that implements this model.

2.-Digital Twins

The main goal of Digital Twins  is the creation of a fully functional digitized product for testing, prototyping and collecting data for future product revolutions and maintenance.

Digital Twin is the culmination of all things IoT. It employs sensors, connectivity and analytics. The Digital Twin overlays analytics and operational data over time on the static design model.

Hadoop workshop

Other event in NYC!, where two topics were discussed:

First, Michael Hausenblas from Mesosphere, talked about YARN on Mesos, explaining how to build the environment and all the benefits you get. A quick demo about how to it works was the nice part.

The second, Steven Camina was introducing MemSQL, all features and all you can do. Some interesting comments:

  • Oracle solution is the best in class sql on memory of the market. If you can pay it, go for it.
  • SAP Hana and Oracle takes 90% of the market.
  • They are having discussions with hardware companies to see how they can partner.
  • There are snapshots done for ensuring the consistency of the data. For fast e-trading companies, there could be some inconsistencies that still are not solved.
  • He did a demo of the product with a sequence of inserts and selects. In the demo you can see how the AWS resources are maximized and how the database size grow and grow.


Hadoop, Pig

My learning on the basic concepts of Hadoop continues.

Pig has 2 basic elements:

  • Pig Latin, it’s a data flow language used by programmers to write pig programs
  • Pig Latin compiler: converts pig latin code into executable code. Executable code is in form of MapReduce jobs or it can spawn a process where virtual Hadoop instance is created to run pig code on single node.

Pig works along with other Hadoop elements as HDSF, MapReduce Framework, YARN…

You can create Macros in Pig Language, you can also access to the piggybank to use standard code.

The main difference between MapReduce V1 and V2 is the existence of YARN

Pig vs. SQL

  • Pig Latin is procedural, SQL is declarative.
  • In pig you can have bag of tuples and the can be duplicated; In SQL on a set of tuples, every tuple is unique.
  • In Pig you can have different number of columns.
  • Pig uses ETL natively; SQL requires a separate ETL tool.
  • Pig uses lazy evaluation. In RDBMS you only have instant invocation of commands.
  • In Pig there is not control statements as “if” and “else”.
  • Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline and you can store data at any point during a pipeline. Most RDBMS systems have limited or no pipeline support. SQL is oriented around queries that produce a single result.Apache_Hadoop_Ecosystem


Hadoop architecture, data processing

In the process of understanding the basis of Hadoop I found a training course in AWS that is helping me to understand the concepts.

  • Map: it’s a process to map all distributed blocks of data and send the requested job to each one of them. The job sent to the block is a task with a reference to that block.
  • Reduce: it’s an iterative process where all data output from a task is sent back to summarize or join all the information into an unique place.
  • #Replication factor = 3. This means that every data block is replicated 3 times. With it, if a data block fails, another is available.
  • Jobs are divided in tasks that are sent to different blocks.
  • The block size by default is 64Mb.


Hadoop components

Hadoop platform is composed by different components and tools.

Hadoop HDFS: A distributed file system that partitions large files across multiple machines for high throughput access to data.

Hadoop YARN; A framework for job scheduling and cluster resource management.

Hadoop map reduce; A programming framework for distributed batch processing of large data sets distributed across multiple servers.

Hive; A data warehouse system for Hadoop that facilitates data summation, ad-hoc queries, and the analysis of large data sets stored in Hadoop – compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into MapReduce programs. Hive was initially developed by Facebook.

HBase; An open-source, distributed, column oriented store modeler created after google’s big table (that is property of Google). HBase is written in Java.

Pig; A high-level data-flow language (commonly called “Pig Latin”) for expressing MapReduce programs; it’s used for analyzing large HDFS distributed data sets. Pig was originally developed at Yahoo Research around 2006.

Mahout; A scalable machine learning and data mining library.

Oozie; A workflow scheduler system to manage Hadoop jobs (MapReduce and Pig jobs). Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container.

Spark; It’s a cluster computing framework which purpose is to manage large scale of data in memory. Spark’s in-memory primitives provide performance up to 100 times faster for certain applications.

Zookeeper; It’s a distributed configuration service, synchronization service, and naming registry for large distributed systems.


Apache Hadoop

hadoop-logoApache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. They cover all these challenges and other ones.

  • Hadoop is scalable, it linearly adds more nodes to the cluster to handle larger data.
  • Hadoop is accessible, it runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2). If you are rich and can purchase an IBM Big Blue, then good for you, but the rest of the mortals, we have to utilize inexpensive servers that both store and process the data.
  • Hadoop is robust, : It is architected with the assumption of frequent hardware failure, so it has been implemented to handle these type of failures.
  • Hadoop is simple, it allows users to quickly write efficient parallel code. Hadoop’s accessibility and simplicity give it an edge over writing and running large distributed programs. Please do not mix with “easy”, it requires intelligence and knowledge.
  • Hadopp is versatile, it understands the “big data” challenge where today we do not know the data we will be required to analyze by tomorrow. Hadoop’s breakthrough, businesses, organizations, data and is able to analyze data that was recently considered useless.

Before Hadoop, big data

Big Data can be defined as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Big data is basically about a new set of algorithms, topology and use of resources that enable us to gather, analyze and show more data.

Every day, we create thousand of bytes of data, the number grows in an exponential way.

In the past all data was stored in a relational database, so to handle it, it was a question of simple rules to select data from a database. Today 80% of data is unstructured, and this percentage is growing: videos, pictures, comments, GPS, transactions, locations…

Under this situation, individuals and companies face scenarios where the data is too big or it moves too fast or it exceeds current processing capacity. This limit to handle the data comes from:

  • Volume: to overcome the excessive size of data requires scalable technologies and distributed approaches to querying or finding a given data.
  • Velocity: we need to organize the resources (CPUs, memories, networks…) to enable real-time data that enables us to make decisions at the right time.
  • Variety: the unstructured data makes that the known paradigms of data search are not valid anymore. Since this moment you can differentiate between “Hadoop people” and “RDBMS people”, they are a different thing.

Think about Facebook, in the terms mentioned above:

  • Volume: the amount of data introduced by all people, companies distributed around the world. Imagine how they should organize all information.
  • Velocity: As user I want all as fast as possible, imagine a facebook that takes 2 seconds to refresh a screen: nobody would use it.
  • Variety: Facebook is tracking data from different nature, and the number of natures is growing up.

Well, Facebook, Yahoo, Twitter, EBay… they all use Hadoop.

Before Hadoop, distributed processing

I’m reading about Content Delivery Network (CDN) and I found the Apache Hadoop project. I have been shocked about the nature of the project, where this comes from and all the toolkit generated around it. It’s massive amount of information, but fascinating for me.

Hadoop project comes from the need of requiring more resources for a give goal. The solution has been to distribute the data and the processing of data. You need to process a huge amount of data with a simple computer that offers limited processing cycles, then you use combined group of computers to run these processes in less time.

The major resources considered while distributed processing system are: Processor time, memory, hard drive space, network bandwidth. For instance virtual servers is a  sophisticated software that detects idle CPU capacity on a rack of physical server and parcels out the virtual environments to utilize it.

There are so many challenges on distributed processing when it’s applied at large scale, and Hadoop faces them. It’s important to mention these challenges to understand (or admire) what the Apache Hadoop project does.

  • One individual compute node may overheat, crash, experience hard drive failures, or run out of memory or disk space.
  • The networks can experience partial or total failure if switches and routers break down. The network congestion which causes data transfer.
  • Multiple implementations or versions of client software may speak slightly different protocols from one another.
  • If the input data set is several terabytes, then this would require a thousand or more machines to hold it in RAM.
  • Intermediate data sets generated while performing a large-scale computation can take several times more space than what the original input data.
  • Synchronization between multiple machines.

In each of the mentioned cases, the distributed system should be able to recover from the component failure or transient error condition and continue to make progress.