Spark API
If you are working with Spark, you will come across the three APIs: DataFrames, Datasets, and RDDs
What are Resilient Distributed Datasets?
RDD or Resilient Distributed Datasets, is a collection of records with distributed computing, which are fault tolerant, immutable in nature. They can be operated on in parallel with low-level APIs, while their lazy feature makes the spark operation to work at an improved speed. RDDs support two types of operations:
- Transformations - lazy operations that return another RDD, this RDD doesn’t compute unless action is performed on it. Some examples of transformations are map(), flatmap(), filter()
- Actions - operations that trigger computation and return values. Some examples of actions are count, top(), savetofile()
Disadvantages of RDDs
If you choose to work with RDD you will have to optimize each and every RDD. In addition, unlike Datasets and DataFrames, RDDs don’t infer the schema of the data ingested therefore you will have to specify it.
What are DataFrames?
DataFrames is a distributed collection of rows under named columns. In simple terms, it looks like an Excel sheet with Column headers, or you can think of it as the equivalent to a table in a relational database or a DataFrame in R or Python. It has three main common characteristics with RDD:
- Immutable in nature: You will be able to create a DataFrame but you will not be able to change it. A DataFrame just like an RDD can be transformed
- Lazy Evaluations: a task is not executed until an action is performed.
- Distributed: DataFrames just like RDDs are both distributed in nature.
Ways to Create a DataFrame
In Spark DataFrames can be created in several ways:
- Using different data formats. Such as loading the data from JSON, CSV, RDBMS, XML or Parquet
- Loading the data from an already existing RDD.
- Programmatically specifying schema
Disadvantages of DataFrames
The main drawback of DataFrame API is that it does not support compile time safely, as a result, the user is limited in case the structure of the data is not known.
What are Datasets?
A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. A Dataset can be created using JVM objects and manipulated using complex functional transformations. Datasets can be created in two ways:
- Dynamically
- Reading from a JSON file using SparkSession.
Disadvantages of DataSets
The main disadvantage of datasets is that they require typecasting into strings.