Assume that, you have 200 TB of data to store and process with Hadoop. The configuration of each available DataNode is as follows:
In this case, the number of DataNodes required to store would be:
Now, let's assume you need to process this 200 TB of data using MapReduce.
And, reading 200 TB data at a speed of 200 MB/s using only 1 node would take:
- 8 GB RAM
- 20 TB HDD
- 100 MB/s read-write speed
In this case, the number of DataNodes required to store would be:
- Total amount of Data * Replication Factor / Disk Space available on each DataNode
- 200 * 3 / 20
- 30 DataNodes
Now, let's assume you need to process this 200 TB of data using MapReduce.
And, reading 200 TB data at a speed of 200 MB/s using only 1 node would take:
- Total data / Read-write speed
- 200 * 1024 * 1024 / 200
- 1048576 seconds
- 291.27 hours
- 291.27 / 30
- 9.70 hours