Spark Interview Questions and Answers

Last updated on Feb 06, 2023
  • Share
Spark Interview Questions

Below, we have created a list of the most frequently asked Spark Interview Questions and Answers. Reading these can help you gain more knowledge and insights into this computing system. If you are looking for a job change or starting your career in Spark, this list of Spark Interview Questions can help you gain more confidence and eventually a job of your choice, in this field.

Apache Spark is a high-functioning, fast, and general-purpose cluster computing system. It provides high-functioning APIs in various programming languages such as Java, Python, Scala, and its prime purpose is to provide an optimized engine capable of supporting general execution graphs.

Apache Spark
What is Apache Spark? It is an open-source general-purpose cluster-computing framework. It gives over 80 high-level operators that make it handy to construct parallel apps and you can use it interactively from the Python, Scala, and SQL shells.
Latest Version 2.4.4 released on 1st September 2019
Created By Matei Zaharia
Written in Python, Scala, Java, SQL
Official Website https://spark.apache.org
Operating System Linux, Windows, macOS
License Apache License 2.0

Most Frequently Asked Spark Interview Questions

Here in this article, we will be listing frequently asked Spark Interview Questions and Answers with the belief that they will be helpful for you to gain higher marks. Also, to let you know that this article has been written under the guidance of industry professionals and covered all the current competencies.

Q11. What happens if RDD partition is lost due to worker node failure?
Answer

In Spark, if any partition of an RDD is lost due to the failure of a worker node, that partition can be re-computed using the lineage of operations from the original fault-tolerant dataset.

Q12. What is spark GraphX used for?
Answer

Here are the uses of GraphX in Spark:

  • It can be used for unifying ETL, exploratory analysis, and computation of iterative graphs within a single system.
  • It can be used to present data in the form of graphs and collections while transforming and joining charts with RDD.
  • It can be used for writing custom iterative graph algorithms with the help of Pregel API.
Q13. What is the difference between RDD and DataFrame?
Answer
RDD Dataframe
It is the representation of a set of records and an immutable collection of objects within distributed computing. It is used for storing data and is basically the equivalent to a table in a relational database with more precious optimization.
This is an array of reference for partitioned objects by representing a large set of data. It is a distributed collection of data in the form of named rows and columns
Here all the datasets are logically partitioned across servers to be computed across different nodes in a cluster. It has a matrix-like structure with different types of columns, such as numeric, logical, and so on.
This supports compile-time type safety, having been based on Object-Oriented Programming. If there is a non-existent column that the user tries to access, there is an attribute error but no scope for compile-time type safety.
Almost all data sources are supported by RDD Dataframes require data sources to be in the JSON, CSV, or AVRO format, whereas storage systems having HIVE, HDFS, or MySQL tables.
Q14. What is coalesce in Spark?
Answer

In Spark, Coalesce is just another method for partitioning the data into a data frame. This is primarily used for reducing the number of partitions inside a data frame. It is most commonly used in cases where the user wants to decrease the amount of partitions without any confusion of shuffle.

Q15. What are the benefits of Spark over MapReduce?
Answer

Here are some of the advantages of using Spark rather than Hadoop’s MapReduce:

  • Spark is relatively easier to program and requires a lot less of actual coding than MapReduce
  • Spark has an in-built interactive mode, whereas MapReduce is, by default, has only batch processing and does not have an in-built interactive mode.
  • Spark uses a data abstraction, RDD, to make the features more productive, whereas MapReduce does not have any concept
  • Spark executes batch processing jobs almost 10X to 100X times faster than MapReduce.
  • Spark is considered as a general-purpose cluster computing engine due to its various methods for data processing such as steaming, batch processing, and machine learning, whereas MapReduce only has a Batch Engine.
  • Spark consumes lower latency via partial or complete caching of results across various nodes whereas, MapReduce is disk-based and consumes a far higher latency.
Q16. What is the difference between coalesce and repartition in Spark?
Answer
Coalesce Repartition
It is used for definitely decreasing the number of partitions used in a Dataframe. This method can decrease or increase the number of partitions used in a Dataframe.
It uses the existing partitions to minimize the amount of data being shuffled in a Dataframe. It just creates new partitions and while doing a full shuffle.
The partitions through this method are of variable sizes. The partitions in this method are roughly the same sizes.
Q17. Why spark uses lazy evaluation?
Answer

Spark uses Lazy Evaluation because of the following reasons:

  • It increases the manageability of the program by dividing it into smaller operations, thereby reducing the number of passes on the data by grouping operations.
  • It increases the speed and saves computational and calculational overhead by computing only necessary values.
  • It reduces complexities in any program by allowing users to work with an infinite data structure while drastically reducing time and space overheads.
  • It optimizes the program by reducing the number of queries being run.
Q18. What is the difference between cache and persist in Spark?
Answer
Cache () Persist ()
While using this, the default storage level is MEMORY_ONLY for RDD and MEMORY_AND_DISK for Dataset. While using this, the user can use various storage levels for both RDD and Dataset.
Q19. How RDD can be created in Spark?
Answer

RDDs or Resilient Distributed Datasets are the fundamental data structure present in Spark. They are immutable and fault-tolerant in nature. There are multiple ways to create RDDs in Spark. They are:

  • Creating RDD from a Seq or List using Parallelize

           RDDs can be created by taking an existing collection from a driver’s program and passing it to the Spark’s SparkContext’s parallelize () method. Here’s an example:
      val rdd=spark.sparkContext.parallelize(Seq(("Java", 10000),
      ("Python", 200000), ("Scala", 4000)))
      rdd.foreach(println)

       Output
      (Python,100000)
      (Scala,3000)
      (Java,20000)

  • Creating an RDD using a text file

     Mostly, in production systems, users can generate RDDs from files by simply reading the data from the files. Let us see how:
     Val rdd = spark.sparkContext.textFile("/path/textFile.txt")
     The above line of code creates an RDD in which each record represents a line of code.

  • Creating RDDs from Dataframes and DataSets

     You can easily convert any DataFrame or DataSet into an RDD. It can be done by using the rdd() method. Here’s how:
     val myRdd2 = spark.range(20).toDF().rdd
     In the above line of code, toDF() creates a DataFrame, and by calling an RDD, the range of code returns with a newly created RDD.

Q20. How many types of RDD are there in Spark?
Answer

There are two types of RDD Operations in Spark. They are:

  • Transformation: It is a type of function in which a new RDD is created from an existing RDD.
  • Action: This is a type of function which is used when the user wants to work with an actual DataSet.
Reviewed and verified by Best Interview Question
Best Interview Question

With our 10+ experience in PHP, MySQL, React, Python & more our technical consulting firm has received the privilege of working with top projects, 100 and still counting. Our team of 25+ is skilled in...