Big Data can be defined as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Big data is basically about a new set of algorithms, topology and use of resources that enable us to gather, analyze and show more data.
Every day, we create thousand of bytes of data, the number grows in an exponential way.
In the past all data was stored in a relational database, so to handle it, it was a question of simple rules to select data from a database. Today 80% of data is unstructured, and this percentage is growing: videos, pictures, comments, GPS, transactions, locations…
Under this situation, individuals and companies face scenarios where the data is too big or it moves too fast or it exceeds current processing capacity. This limit to handle the data comes from:
- Volume: to overcome the excessive size of data requires scalable technologies and distributed approaches to querying or finding a given data.
- Velocity: we need to organize the resources (CPUs, memories, networks…) to enable real-time data that enables us to make decisions at the right time.
- Variety: the unstructured data makes that the known paradigms of data search are not valid anymore. Since this moment you can differentiate between “Hadoop people” and “RDBMS people”, they are a different thing.
Think about Facebook, in the terms mentioned above:
- Volume: the amount of data introduced by all people, companies distributed around the world. Imagine how they should organize all information.
- Velocity: As user I want all as fast as possible, imagine a facebook that takes 2 seconds to refresh a screen: nobody would use it.
- Variety: Facebook is tracking data from different nature, and the number of natures is growing up.
Well, Facebook, Yahoo, Twitter, EBay… they all use Hadoop.