By Umesh Singh
Last update: 02 Jun 2020, 17 Questions
Introduction to Hadoop
Hadoop Interview Questions

If you wish to learn more about Hadoop and want to pursue it as a career, we have prepared a list of the most frequently asked Hadoop Interview Questions. This will help you in gaining more knowledge on the subject and cracking a job interview requiring Hadoop as a significant skill.

Hadoop is a general-purpose networking system that allows users to process large amounts of data through a set of distributed nodes. In addition to that, Hadoop is a multi-tasking system capable of handling multiple data sets for numerous jobs and users at the same time.

Most Frequently Asked Hadoop Interview Questions

1. What is Hadoop Streaming?

Hadoop streaming is a functionality that is included with the Hadoop distribution. It allows users to create and run Map and Reduce jobs using any executable or a script as a mapper and the reducer.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /bin/wc


2. What are the features of Hadoop?

Here are some features of Hadoop which make ita popular choice among the software community:

  • It is open-source.
  • The Hadoop Cluster is highly scalable.
  • Hadoop provides users with a Fault Tolerance Mechanism
  • It offers high availability of data even in unfavorable conditions.
  • It is cost-effective.
  • It is known for swift data processing
  • It is based on Data Locality Concept.
  • Hadoop provides feasibility by processing unstructured data.
  • Hadoop ensures Data Reliability through the replication of data in clusters.
3. What are the configuration files in Hadoop?

Here is a list of Hadoop Configuration Files with their description

File Description
hadoop-env.sh It contains environment variables used in scripts to run Hadoop.
core-site.sh It contains configuration settings for Hadoop, such as Core I/O common to HDFS and MapReduce.
hdfs-site.sh It contains configuration settings for HDFS daemons, name nodes, secondary namenodes, and the data nodes.
mapred-site.sh It contains configuration settings for MapReduce daemons, such as the job trackers and the task trackers.
Masters It is a list of machines that run a secondary name node.
Slaves It is a list of machines that run data nodes and task-trackers.
4. What is data serialization in Hadoop?

The process of formatting structured data such that it can be converted to its original form is known as Data Sterilization. It is carried out to translate data structures into a stream of flowing data. This can then be transferred throughout the network or can be stored in any Database regardless of the system architecture.

5. What is Hadoop MapReduce used for?

In Hadoop, MapReduce is a sort of programming framework allowing users to perform distributed and parallel processing on extensive data sets in a controlled and distributed environment.

6. What is the difference between the distributed file system and the Hadoop distributed file system?
Distributed File System Hadoop Distributed File System (HDFS)
It is primarily designed to hold a large amount of data while providing access to multiple clients over a network. It is designed to hold vast amounts of data (petabytes and terabytes) and also supports individual files having large sizes.
Here files are stored on a single machine. Here, the files are stored over multiple machines.
It does not provide Data Reliability It provides Datta Reliability.
If multiple clients are accessing the data at the same time, it can cause a server overload. HDFS takes care of server overload very smoothly, and multiple access does not amount to server overload.
7. What is active and passive NameNode in Hadoop?

Active Namenode: It is the Namnode in Hadoop, which works and runs inside the cluster.
Passive Namenode: It is a standby Namenode having a similar data structure as an Active Namenode.

8. What is the difference between HDFS and NFS?
Network File System (NFS) HDFS
This is a protocol developed so that clients can access files over a standard network. This is a file system that is distributed among multiple systems or nodes.
It allows users to access files locally even though the files reside on a network. It is fault-tolerant, i.e., it stores multiple replicas of files over different systems.
9. What happens if two clients try writing into the same HDFS file?

In an HDFS system, when the first client contacts the NameNode for writing the file, NameNode grants the client to create this file. But, when the second client opens the same data for writing, NameNode confirms that one client is already given access to writing the file; hence, it rejects the second client's open request.

10. What are the different schedulers available in Hadoop?

Here are the different types of schedulers available in Hadoop:

  • The FIFO Scheduler
  • The Fair Scheduler
  • The Capacity Scheduler