Easy tutorial on Spark SQL and DataFrames

In this tutorial, you will learn how to load a DataFrame and perform basic operations on DataFrames with both API and SQL.

I’m using colab to run the code.

First, we need to To download the required tools

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar -xvf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark

Then, add this statement to tell your bash where to find spark. To do so, configure your $PATH variables by adding the following lines

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

Start a spark session

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Manipulating dataframes

We need first to import the data file. There are different ways to import a file (from drive, from an URL, or from your local hard disk). Files.upload will upload a file from your hard disk, this is not# recommended for very large files (the example is based on the movielens csv file)

from google.colab import filesfiles.upload()

data = spark.read.format("csv")\
.option("delimiter", ",")\
.option("header", True)\
.load("ratings.csv")

data.show(5)

print((data.count(), len(data.columns)))

To take a glance at the data, we use the show() method. For instance, we can display first five rows:

data.select("movieId", "rating").show(5)

You can also filter the DataFrame based on some condition. Say, we want to choose movies with ratings lower than 3.0. To do this, run the following:

data.filter(data['rating'] < 3.0).show(3)

Another useful operation to perform on a DataFrame is grouping by some field. Let’s group our DataFrame by rating, check counts for each rating, and finally order the resulting counts by rating.

data.groupBy(data['rating']).count().orderBy('rating').show()

You can also calculate basic statistics for a DataFrame using describe() method. It includes min, max, mean, standard deviation, and count for numeric columns. You may specify columns or calculate overall statistics for a DataFrame. In our case, only rating column is suitable for statistics calculation.

data.describe("rating").show()

Using SQL API

#Now, we will use SQL to query the data. To begin with, we need to register a DataFrame as a temp view with the next command:

data.createOrReplaceTempView("ratings")

Let’s make the same filtering as before — we’ll select only movies with ratings lower than 3 using SQL:

spark.sql("select * from ratings where rating < 3").show(3)

You can download the code of this tutorial from this link

June 1, 2021 hela

Data Geek