What is the difference between RDD and DataFrame?
|It is the representation of a set of records and an immutable collection of objects within distributed computing.||It is used for storing data and is basically the equivalent to a table in a relational database with more precious optimization.|
|This is an array of reference for partitioned objects by representing a large set of data.||It is a distributed collection of data in the form of named rows and columns|
|Here all the datasets are logically partitioned across servers to be computed across different nodes in a cluster.||It has a matrix-like structure with different types of columns, such as numeric, logical, and so on.|
|This supports compile-time type safety, having been based on Object-Oriented Programming.||If there is a non-existent column that the user tries to access, there is an attribute error but no scope for compile-time type safety.|
|Almost all data sources are supported by RDD||Dataframes require data sources to be in the JSON, CSV, or AVRO format, whereas storage systems having HIVE, HDFS, or MySQL tables.|
BY Best Interview Question ON 10 Jun 2020