This is not a guide on how to set up a spark or create a bucket on a google cloud platform. The documentation for setting up the Cloud Storage connector is lacking, so I decided to create this quick guide to access your google storage files with PySpark.
Go to this google storage connector link and download the version of your connector for your Spark-Hadoop version. In my case, I was using
spark-2.4.6-bin-hadoop2.7, so I download the Cloud Storage connector for Hadoop 2.x.
.jar file is downloaded, just put the
.jar file into
Next thing, go to…
In the previous tutorial, we created an advanced data extraction pipeline from Airflow and discussed the different types of data engineering frameworks. If you are interested in Google Cloud Platform with Airflow, you can check out the first and second posts of this blog series.
In this tutorial, we will build a visualization with the Twitter data we harvested in the previous blog post. I will show you how to create an app leveraging python viz libraries Plotly and Dash, all within the Flask framework. Visuals will include frequency bar charts and Word Cloud plots. …
In Part I, we learned how to set up Airflow with Google Cloud Platform (GCS) using Docker. We then implemented the standard operators and sensors concept to our google cloud storage, followed by performing a file clean-up procedure.
In Part II of this 4-part blog series, we will go over how to set up a Twitter scraper in Airflow and store the data in GCS, then automatically load it into BigQuery for further analysis. I will apply what is known as the “data engineering framework” to our airflow tweet pipeline, which will dynamically generate different instantiations of Twitter Airflow DAGs…
Data engineering is the foundational base of every data scientist’s toolbox. After all, before we could produce any meaningful analysis that adds business value, data must be obtained, cleaned, and shaped. Thus, a good data scientist should know enough about data engineering to understand his role and evaluate his contribution to the need of the company.
Despite the critical nature of data engineering, only a fraction of the educational programs appropriately emphasize this topic on an enterprise level. The lack of emphasis on data engineering in online education leaves students at a disadvantage when they begin their quest for a…
Data Engineer at Disney, who previous work at Bytedance.