RDDs or Resilient Distributed Datasets are the fundamental data structure present in Spark. They are immutable and fault-tolerant in nature. There are multiple ways to create RDDs in Spark. They are:

  • Creating RDD from a Seq or List using Parallelize

           RDDs can be created by taking an existing collection from a driver’s program and passing it to the Spark’s SparkContext’s parallelize () method. Here’s an example:
      val rdd=spark.sparkContext.parallelize(Seq(("Java", 10000),
      ("Python", 200000), ("Scala", 4000)))
      rdd.foreach(println)

       Output
      (Python,100000)
      (Scala,3000)
      (Java,20000)

  • Creating an RDD using a text file

     Mostly, in production systems, users can generate RDDs from files by simply reading the data from the files. Let us see how:
     Val rdd = spark.sparkContext.textFile("/path/textFile.txt")
     The above line of code creates an RDD in which each record represents a line of code.

  • Creating RDDs from Dataframes and DataSets

     You can easily convert any DataFrame or DataSet into an RDD. It can be done by using the rdd() method. Here’s how:
     val myRdd2 = spark.range(20).toDF().rdd
     In the above line of code, toDF() creates a DataFrame, and by calling an RDD, the range of code returns with a newly created RDD.

BY Best Interview Question ON 10 Jun 2020