Selecting Columns in Spark (Scala & Python)
1 min readApr 4, 2020
Apache Spark offers several methods to use when selecting a column. For this tutorial, assume a DataFrame has already been read as df.
Here are several ways to select a column called “ColumnName” from df.
Scala Spark
// Scala
import org.apache.spark.sql.functions.{expr, col, column}// 6 ways to select a columndf.select(df.col("ColumnName"))df.select(col("ColumnName"))df.select(column("ColumnName")) df.select(`ColumnName)df.select($"ColumnName")df.select(expr("ColumnName"))
PySpark
# Python
from pyspark.sql.functions import expr, col, column# 4 ways to select a columndf.select(df.ColumnName)df.select(col("ColumnName"))df.select(column("ColumnName"))df.select(expr("ColumnName"))
expr Allows for Manipulation
The function expr is different from col and column as it allows you to pass a column manipulation. For example, if we wanted to list the column under a different heading, here’s how we’d do it.
// Scala and Python
df.select(expr("ColumnName AS customName"))
selectExpr
Spark offers a short form that brings great power — selectExpr. This method saves you from having to write “expr” every time you want to pass an expression.
// Scala and Python
df.selectExpr("*", "(ColumnName AS customName)")
Note, “*” means all columns.