How to use google storage with PySpark
This is not a guide on how to set up a spark or create a bucket on a google cloud platform. The documentation for setting up the Cloud Storage connector is lacking, so I decided to create this quick guide to access your google storage files with PySpark.
Go to this google storage connector link and download the version of your connector for your Spark-Hadoop version. In my case, I was using spark-2.4.6-bin-hadoop2.7
, so I download the Cloud Storage connector for Hadoop 2.x.
Once the .jar
file is downloaded, just put the .jar
file into C:\%path/to/your/spark%\spark\spark-2.4.6-bin-hadoop2.7\jars.
Next thing, go to your google IAM & Admin > Service Accounts.
Create a service account, then create and download the .p12
keyfile.
Finally, you need to set the spark.conf
so that you have the proper authentication.
spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("google.cloud.auth.service.account.email", "Your_service_email")
spark.conf.set("google.cloud.auth.service.account.keyfile", "path/to/your/files")
In some cases, you may need to set an environmental variable like this.
# windows
set GOOGLE_APPLICATION_CREDENTIALS="path/to/your/keyfile.p12"# Linux
export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/keyfile.p12"
Then you can access the file in your bucket using the read function.
df = spark.read.option("header",True).csv("gs://bucket_name/path_to_your_file.csv")df.show()
Hope you enjoy this quick guide. Have fun in your coding journey.
If you are interested in my newest data science project, check out my SBTN’s Player Comparison Tool and SBTN’s NBA K-Mean Cluster Analysis