Selecting Columns in Spark (Scala & Python)

Wafiq Syed
1 min readApr 4, 2020

Apache Spark offers several methods to use when selecting a column. For this tutorial, assume a DataFrame has already been read as df.

Here are several ways to select a column called “ColumnName” from df.

Scala Spark

// Scala
import org.apache.spark.sql.functions.{expr, col, column}
// 6 ways to select a columndf.select(df.col("ColumnName"))df.select(col("ColumnName"))df.select(column("ColumnName")) df.select(`ColumnName)df.select($"ColumnName")df.select(expr("ColumnName"))

PySpark


# Python
from pyspark.sql.functions import expr, col, column
# 4 ways to select a columndf.select(df.ColumnName)df.select(col("ColumnName"))df.select(column("ColumnName"))df.select(expr("ColumnName"))

expr Allows for Manipulation

The function expr is different from col and column as it allows you to pass a column manipulation. For example, if we wanted to list the column under a different heading, here’s how we’d do it.

// Scala and Python
df.select(expr("ColumnName AS customName"))

selectExpr

Spark offers a short form that brings great power — selectExpr. This method saves you from having to write “expr” every time you want to pass an expression.

// Scala and Python
df.selectExpr("*", "(ColumnName AS customName)")

Note, “*” means all columns.

--

--

Wafiq Syed

Analytical lead at Google, Master’s of Management of AI (MMAI)