I’m reading about Content Delivery Network (CDN) and I found the Apache Hadoop project. I have been shocked about the nature of the project, where this comes from and all the toolkit generated around it. It’s massive amount of information, but fascinating for me.
Hadoop project comes from the need of requiring more resources for a give goal. The solution has been to distribute the data and the processing of data. You need to process a huge amount of data with a simple computer that offers limited processing cycles, then you use combined group of computers to run these processes in less time.
The major resources considered while distributed processing system are: Processor time, memory, hard drive space, network bandwidth. For instance virtual servers is a sophisticated software that detects idle CPU capacity on a rack of physical server and parcels out the virtual environments to utilize it.
There are so many challenges on distributed processing when it’s applied at large scale, and Hadoop faces them. It’s important to mention these challenges to understand (or admire) what the Apache Hadoop project does.
- One individual compute node may overheat, crash, experience hard drive failures, or run out of memory or disk space.
- The networks can experience partial or total failure if switches and routers break down. The network congestion which causes data transfer.
- Multiple implementations or versions of client software may speak slightly different protocols from one another.
- If the input data set is several terabytes, then this would require a thousand or more machines to hold it in RAM.
- Intermediate data sets generated while performing a large-scale computation can take several times more space than what the original input data.
- Synchronization between multiple machines.
In each of the mentioned cases, the distributed system should be able to recover from the component failure or transient error condition and continue to make progress.