MapReduce example

I found in a training course this conceptual example about how MapReduce works. The sequence of steps is clear: input, splitting, mapping, shuffling and reducing. I continue gaining basic knowledge of Hadoop, which is at the moment very interesting.

Big data concepts

Some basic concepts for my poor memory: 1.- Support vector machine (SVM) SVM is the concept behind so many of the machine learning solutions the market offers. You want to identify patterns, classify unstructured data, discover root causes, optimize forecasts… so many of the code written to do that implements this model. 2.-Digital Twins The … Read more

Hadoop workshop

Other event in NYC!, where two topics were discussed: First, Michael Hausenblas from Mesosphere, talked about YARN on Mesos, explaining how to build the environment and all the benefits you get. A quick demo about how to it works was the nice part. The second, Steven Camina was introducing MemSQL, all features and all you can do. … Read more

Hadoop, Pig

My learning on the basic concepts of Hadoop continues. Pig has 2 basic elements: Pig Latin, it’s a data flow language used by programmers to write pig programs Pig Latin compiler: converts pig latin code into executable code. Executable code is in form of MapReduce jobs or it can spawn a process where virtual Hadoop … Read more

Hadoop architecture, data processing

In the process of understanding the basis of Hadoop I found a training course in AWS that is helping me to understand the concepts. Map: it’s a process to map all distributed blocks of data and send the requested job to each one of them. The job sent to the block is a task with … Read more

Hadoop components

Hadoop platform is composed by different components and tools. Hadoop HDFS: A distributed file system that partitions large files across multiple machines for high throughput access to data. Hadoop YARN; A framework for job scheduling and cluster resource management. Hadoop map reduce; A programming framework for distributed batch processing of large data sets distributed across … Read more

Apache Hadoop

Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. They … Read more

Before Hadoop, big data

Big Data can be defined as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Big data is basically about a new set of algorithms, topology and use of resources that enable us to gather, analyze and show more data. Every day, we … Read more

Before Hadoop, distributed processing

I’m reading about Content Delivery Network (CDN) and I found the Apache Hadoop project. I have been shocked about the nature of the project, where this comes from and all the toolkit generated around it. It’s massive amount of information, but fascinating for me. Hadoop project comes from the need of requiring more resources for … Read more