How to use google storage with PySpark

2 min readJul 30, 2020

This is not a guide on how to set up a spark or create a bucket on a google cloud platform. The documentation for setting up the Cloud Storage connector is lacking, so I decided to create this quick guide to access your google storage files with PySpark.

Go to this google storage connector link and download the version of your connector for your Spark-Hadoop version. In my case, I was using spark-2.4.6-bin-hadoop2.7, so I download the Cloud Storage connector for Hadoop 2.x.

Once the .jar file is downloaded, just put the .jar file into C:\%path/to/your/spark%\spark\spark-2.4.6-bin-hadoop2.7\jars.

Next thing, go to your google IAM & Admin > Service Accounts.

Create a service account, then create and download the .p12 keyfile.

Finally, you need to set the spark.conf so that you have the proper authentication.

spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("google.cloud.auth.service.account.email", "Your_service_email")
spark.conf.set("google.cloud.auth.service.account.keyfile", "path/to/your/files")

In some cases, you may need to set an environmental variable like this.

# windows
set GOOGLE_APPLICATION_CREDENTIALS="path/to/your/keyfile.p12"# Linux
export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/keyfile.p12"

Then you can access the file in your bucket using the read function.

df = spark.read.option("header",True).csv("gs://bucket_name/path_to_your_file.csv")df.show()

Hope you enjoy this quick guide. Have fun in your coding journey.

If you are interested in my newest data science project, check out my SBTN’s Player Comparison Tool and SBTN’s NBA K-Mean Cluster Analysis

How to use google storage with PySpark

Written by Jayce Jiang