HOW TO select right hadoop cluster

12/10/2014

Assume that, you have 200 TB of data to store and process with Hadoop. The configuration of each available DataNode is as follows:

8 GB RAM
20 TB HDD
100 MB/s read-write speed

You have a Hadoop Cluster with replication factor = 3 and block size = 64 MB.
In this case, the number of DataNodes required to store would be:

Total amount of Data * Replication Factor / Disk Space available on each DataNode
200 * 3 / 20
30 DataNodes

Now, let's assume you need to process this 200 TB of data using MapReduce.
And, reading 200 TB data at a speed of 200 MB/s using only 1 node would take:

Total data / Read-write speed
200 * 1024 * 1024 / 200
1048576 seconds
291.27 hours

So, with 30 DataNodes you would be able to finish this MapReduce job in:

291.27 / 30
9.70 hours

0 Comments

HOW TO select right hadoop cluster

Leave a Reply.

AUTHOR

Archives

Categories