星期五, 4月 20, 2018

FWD: Top 50 Interview Quiz for MapReduce

Ref: http://www.bigdatatrunk.com/top-50-interview-quiz-mapreduce/

 

Q1 What is MapReduce?

Answer: MapReduce is a parallel programming model which is used to process large data sets across hundreds or thousands of servers in a Hadoop cluster.Map/reduce brings compute to the data at data location in contrast to traditional parallelism, which brings data to the compute location.The Term MapReduce is composed of Map and Reduce phase.  The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples key/value pairs. The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job. The programming language for MapReduce is Java.All data emitted in the flow of a MapReduce program is in the form of Key/Value pairs.

Q2 Explain a MapReduce program.

Answer: A MapReduce program consists of 3 parts namely, Driver, Mapper, and Reducer.

The Driver code runs on the client machine and is responsible for building the configuration of the job and submitting it to the Hadoop Cluster. The Driver code will contain the main() method that accepts arguments from the command line.

The Mapper code reads the input files as <Key,Value> pairs and emits key value pairs. The Mapper class extends MapReduceBase and implements the Mapper interface. The Mapper interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the input key and value types, the second two define the output key and value types.

The Reducer code reads the outputs generated by the different mappers as <Key,Value> pairs and emits key value pairs. The Reducer class extends MapReduceBase and implements the Reducer interface. The Reducer interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the intermediate key and value types, the second two define the final output key and value types.

Q3 Mention what are the main configuration parameters that user need to specify to run MapReduce Job ?

 Answer:The user of MapReduce framework needs to specify the following:

·         Job’s input locations in the distributed file system

·         Job’s output location in the distributed file system

·         Input format

·         Output format

·         Class containing the map function

·         Class containing the reduce function

·         JAR file containing the mapper, reducer and driver classes

Q4 What Mapper does?

Answer: Mapper is the first phase of MapReduce phase which process map task.Mapper reads key/value pairs and emit key/value pair.Maps are the individual tasks that transform input records into intermediate records.The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

Q5  Is there an easy way to see the status and health of a cluster?

Answer: There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system.The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks. The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.

Q6 Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?

Answer:

·         apache.hadoop.mapreduce.Mapper

·         apache.hadoop.mapreduce.Reducer

Q7 Explain what is Sequencefileinputformat?

Answer: Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.

Q8 What are ‘maps’ and ‘reduces’?

Answer: ‘Maps’ and ‘Reduces’ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate a key value pair,that is, an intermediate output in local machine.’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output.

Q9 What does conf.setMapper Class do?

Answer: Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.

Q10 What are the methods in the Reducer class and order of their invocation?

Answer: The Reducer class contains the run() method, which call its own setup() method only once, it also call a reduce() method for each input and finally calls it cleanup() method.

Q11 Explain what is the purpose of RecordReader in Hadoop?

Answer: In Hadoop, the RecordReader loads the data from its source and converts it into key, value pairs suitable for reading by the Mapper.

Q12 Explain MapReduce and its needs while programming with Apache Pig

Answer: All programs in Apache Pig have been written usually in query language which is also called nowadays as Pig Latin. It is has some similarity with SQL language of query as well. In order get the query executed, you must also remember to make use of an engine that specialises in this. Queries are converted from pig engines into jobs and therefore MapReduce will act as an engine of execution which is required to run programs.

Q13 What are some typical functions of Job Tracker?

Answer: The following are some typical tasks of JobTracker:-

·         When Client applications submit map reduce jobs to the Job tracker

·         The JobTracker talks to the Name node to determine the location of the data

·         The JobTracker locates TaskTracker nodes with available slots at or near the data

·         The JobTracker submits the work to the chosen TaskTracker nodes

·         The TaskTracker nodes are monitored. If they do not submit heartbeat signals they are deemed to have failedand the work is scheduled on different TaskTracker

·         When the work is completed, the JobTracker updates its status

·         Client applications can poll the JobTracker for information

Q14What are the four basic parameters of a mapper?

Answer: The four basic parameters of a mapper are LongWritable, text; text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.

Q15 How can we change the split size if our commodity hardware has less storage space?

Answer: If our commodity hardware has less storage space, we can change the split size by writing the ‘custom splitter’. There is a feature of customization in Hadoop which can be called from the main method.

Q16 What is a TaskInstance?

Answer: The actual Hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.

Q17 What do the master class and the output class do?

Answer: Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output location.

Q18 What is the input type/format in MapReduce by default?

Answer: By default the type input type in MapReduce is ‘text’.

Q19 Is it mandatory to set input and output type/format in MapReduce?

Answer: No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as ‘text’.

Q20 How is Hadoop different from other data processing tools?

Answer: In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering about the volume of data to be processed. This is the beauty of parallel processing in contrast to the other data processing tools available.

Q21 What does job conf class do?

Answer: MapReduce needs to logically separate different jobs running on the same cluster. ‘Job conf class’ helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed.

Q22Is it important for Hadoop MapReduce jobs to be written in Java?

Answer: It is not necessary to write Hadoop MapReduce jobs in java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.

Q23 What is a Combiner?

Answer: A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.

Q24 What do sorting and shuffling do?

Answer: Sorting and shuffling are responsible for creating a unique key and a list of values.Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.

Q25 What are the four basic parameters of a reducer?

Answer: The four basic parameters of a reducer are Text, IntWritable, Text, and IntWritable.The first two represent intermediate output parameters and the second two represent final output parameters.

Q26 What are the key differences between Pig vs MapReduce?

Answer: PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store. As part of managing this data flow it moves data feeding it to process1, taking output and feeding it to process2. The core features are preventing execution of subsequent stages if previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing. While this can be done for any kind of processing tasks Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in Pig, some default ones are like ordering, grouping, distinct, count etc.

Mapreduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that it is easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Not all applications can be migrated to Map reduce but good few can be including complex ones like k-means to simple ones like counting unique in a dataset.

Q27 Why we cannot do aggregation or addition in a mapper? Why we require reducer for that?

Answer: We cannot do aggregation or addition in a mapper because sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.

Q28 What does a split do?

Answer: Before transferring the data from hard disk location to map method, there is a phase or method called the ‘Split Method’. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but reads data from the block and pass it to the mapper.Be default, Split is taken care by the framework. Split method is equal to the block size and is used to divide block into bunch of splits.

Q29 What does the text input format do?

Answer: In text input format, each line will create a line off-set, that is a hexa-decimal number. Key is considered as a line off-set and value is considered as a whole line text. This is how the data gets processed by a mapper. The mapper will receive the ‘key’ as a ‘LongWritable’ parameter and value as a ‘Text’ parameter.

Q30 What does a MapReduce partitioner do?

Answer: A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer thus allows evenly distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.

Q31 Can we rename the output file?

Answer: Yes we can rename the output file by implementing multiple format output class

Q32What is Streaming?

Answer: Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language.

Q33 Explain what is Speculative Execution?

Answer: In Hadoop during Speculative Execution a certain number of duplicate tasks are launched.  On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk.  Disk that finish the task first are retained and disks that do not finish first are killed.

Q34 Is it possible to start reducers while some mappers still run? Why?

Answer: No. Reducer’s input is grouped by the key. The last mapper could theoretically produce key already consumed by running reducer.

Q35 Describe reduce side join between tables with one-on-one relationship?

Answer: Mapper produces key/value pairs with join ids as keys and row values as value. Corresponding rows from both tables are grouped together by the framework during shuffle and sort phase.Reduce method in reducer obtains join id and two values, each represents row from one table. Reducer joins the data.

Q36 Can you run Map – Reduce jobs directly on Avro data?

Answer: Yes, Avro was specifically designed for data processing via Map-Reduce.

Q37 Can reducers communicate with each other?

Answer: Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm.

Q38 How can you set an arbitrary number of Reducers to be created for a job in Hadoop?

Answer:You can do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting.

Q39 What is TaskTracker?

Answer:TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job Tracker. Task Tracker also handles the data motion between the map and reduce phases.One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the status of the Task.If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.

Q40 How to set mappers and reducers for Hadoop jobs?

Answer: Users can configure JobConf variable to set number of mappers and reducers.job.setNumMaptasks() and job.setNumreduceTasks().

Q41 What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?

Answer:Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process.Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process.One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

Q42 What do you know about NLineOutputFormat?

Answer: NLineOutputFormat splits ‘n’ lines of input as one split.

Q43 True or false: Each reducer must generate the same number of key/value pairs as its input had.

Answer: False. Reducer may generate any number of key/value pairs including zero.

Q44 When is the reducers are started in a MapReduce job?

Answer: In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.

Q45 Name Job control options specified by MapReduce.Since this framework supports chained operations wherein an input of one map job serves as the output for other, there is a need for job controls to govern these complex operations. The various job control options are:

·         submit(): to submit the job to the cluster and immediately return

·         waitforCompletion(boolean): to submit the job to the cluster and wait for its completion

Q46  Decide if the statement is true or false: Each combiner runs exactly once.

Answer: False. The framework decides whether combiner runs zero, once or multiple times.

Q47 Define a straggler.

Answer: Straggler is either map or reduce task that takes unusually long time to complete.

Q48 Explain what is distributed Cache in MapReduce Framework ?

Answer: Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.

Q49 How JobTracker schedules a task?

Answer: The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

Q50 What is chain Mapper?

Answer: Chain Mapper class is a special implementation of Mapper class through which a set of mapper classes can be run in a chain fashion, within a single map task.In this chained pattern execution, first mapper output will become input for second mapper and second mappers output to third mapper, and so on until the last mapper.

 

沒有留言:

LinkWithin-相關文件

Related Posts Plugin for WordPress, Blogger...