MapReduce is a programming model for processing and generating large data sets that can be executed in parallel across a large cluster of machines. MapReduce was first designed and implemented at Google to process large amounts of web related raw data such as data on user web searches, web sites crawled etc.
MapReduce has had a huge impact on the computing world by making it possible to process terabytes of data in parallel across clusters of community machines. MapReduce led to the popularity of Apache Hadoop, which is an implementation of MapReduce, and a host of other Big Data technologies.
Due to its core significance in Hadoop and Big Data, questions on MapReduce are frequently asked in Big Data interview questions. It is advisable to clearly understand and review the concepts of MapReduce before dealing with Hadoop specific questions.
MapReduce interview questions broadly fall into following categories - basic MapReduce concepts, performance tuning of MapReduce applications, fault tolerance of MapReduce applications and real world applications of MapReduce applications.
Below are some frequently asked interview questions on MapReduce.
MapReduce is a programming model for processing large datasets in parallel using multiple machines. MapReduce enables large scale data analysis and data mining using clusters of machines, taking much less time compared to other data processing methods.
MapReduce processes datasets in two phases - the Map phase and the Reduce phase. Each phase has key-value pairs as input and output. Depending on the data, the programmer can specify the types for the input and output. The programmer has to implement two functions - the map function and the reduce function, which defines how the data is processed in these two phases.
Map phase - Map phase is a data preparation phase in which the data is filtered and send to the reduce phase for further processing. The map function takes an input key-value pair and produces a set of intermediate key-value pairs. The MapReduce framework sorts and groups the data by the intermediate key before sending it as input to the Reduce phase.
Reduce phase - Reduce phase takes the output key-value pairs from the Map phase as input and does further processing on that data to compute and generate the final required output. The input key is the intermediate key that is output from the map function and value contains a list of values for that key.
The MapReduce framework splits input files into multiple pieces of fixed sizes, say 16 MB to 64 MB each. The splits are then send as input to multiple map tasks, which can be run parallelly on a cluster of machines.
The MapReduce framework splits the input data and starts copies of the MapReduce user program on a cluster of machines. One of the copies acts as the master program and the rest are worker programs. The master is responsible for assigning tasks to the workers. The master keeps track of the state of workers, and assigns each worker a map task or a reduce task.
The intermediate key-value pairs generated by the map task are buffered in memory and are periodically written to the local disc on which the map task runs. This data is partitioned into multiple regions on the disk, based on the hash of the key and number of reducers, by the partitioning function.
The locations of the buffered data on the local disc are send to reduce tasks, which use remote procedure calls to pull the data and process them.
The map tasks generate intermediate key-value pairs and store them on partitions on the local discs. Shuffling is the process of transferring this intermediate data generated on multiple machines to reducers. This data fetched by the reducers are sorted and grouped by the intermediate key, before the data is send as input to the reduce task.
The master is responsible for assigning map tasks and reduce tasks to workers. For each map task and reduce task, the master stores the state of the task - idle, in-progress or completed; and the identity of the worker machine that is running the task.
The master assigns reduce tasks to workers and has to send the locations of the intermediate key-value data that the map task generates. Hence for every completed map task, the master stores the locations and sizes of the intermediate key-value data produced by the map task.
MapReduce is resilient to machine failures. The master pings the workers periodically, and marks a worker as failed if it does not receive a response within a specific time period.
A map task or reduce task that was being processed by the failed worker is reset to idle state, so that the task is processed again by another worker.
A map task completed by the failed worker is reset to idle state, so that it is processed again by another worker. The map task has to be re-processed since it stores the intermediate output on the local disc. A reduce task completed by the failed worker need not be re-processed since the output is stored globally.
A straggler is a machine in the MapReduce cluster that takes an unusually long time to complete a map or reduce task, and the MapReduce operation is usually waiting for just this last task to complete. This could arise due to different issues on the machine such as low network bandwidth, low CPU, low memory etc.
MapReduce provides a mechanism to alleviate the problem of stragglers. When a MapReduce program is close to completion, the master can be set to schedule backup executions of the in-progress tasks. The task is them masked as complete when either the primary or the backup process is complete.
Counter is a facility provided in the MapReduce framework that can be used to count various events aggregated across the cluster machines. The counter is created in the MapReduce user program and can be incremented in the map or reduce task. The counter values from all worker machines are propagated to the master which aggregates the counter values and returns it to the user program after the MapReduce operation is complete.
In a multi-node clustered environment the efficiency of MapReduce jobs is limited by the bandwidth availability on the cluster. In such environments, combiner functions minimize the data transfered between the map and reduce tasks.
The combiner function acts on the output from the map task, and minimizes the data before it is send to the Reduce task.
The combiner task is set on the Job by calling the setCombinerClass() method on the Job instance.