What is the difference between RDD and DataFrame?
RDD | Dataframe |
---|---|
It is the representation of a set of records and an immutable collection of objects within distributed computing. | It is used for storing data and is basically the equivalent to a table in a relational database with more precious optimization. |
This is an array of reference for partitioned objects by representing a large set of data. | It is a distributed collection of data in the form of named rows and columns |
Here all the datasets are logically partitioned across servers to be computed across different nodes in a cluster. | It has a matrix-like structure with different types of columns, such as numeric, logical, and so on. |
This supports compile-time type safety, having been based on Object-Oriented Programming. | If there is a non-existent column that the user tries to access, there is an attribute error but no scope for compile-time type safety. |
Almost all data sources are supported by RDD | Dataframes require data sources to be in the JSON, CSV, or AVRO format, whereas storage systems having HIVE, HDFS, or MySQL tables. |
BY Best Interview Question ON 10 Jun 2020