This blog is co-posted on Medium.
Google Data Engineering in Berlin
It’s on a rainy Monday morning (29/01) at 5.00 am that myself and a colleague (Simon Picard) left Belgium for a four-day training given by Google in Berlin (course entitled “Data Engineering on Google Cloud Platform”). Our end objective is to become in the coming weeks “Google Certified Professional Data Engineer” and this training is 100% oriented towards achieving this goal.
We are welcomed at Google offices in Berlin. As expected, offices are in the best neighborhood of Berlin and very well designed with all facilities you can imagine. Training started at 9.00 am sharp and our instructor is a Spanish girl from Google London. She is the real Spanish cliche and her English accent couldn’t be more Latino. But most importantly, she has proven to be an excellent instructor along these four days.
Concepts taught on this first day were mainly an overview of the main modules used for data engineering: Dataproc, Dataflow, BigQuery, DataStorage, Pub/Sub and MLE. After this short overview, we started to deep dive into Dataproc by launching a cluster with one master and two workers. Pretty basic stuff but it showed us how easy it is to do parallel computing on GCP. After this we moved into Big Query where the basics were introduced to us.
We started the second day by going more in depth into Big Query. We queried a huge dataset (flight departure from US airports; composed of billions of rows) and we could notice the power of this data warehouse. Querying on a big dataset is extremely fast and mastering Big Query is mainly a question of mastering SQL
After lunch time, we started the heart of the training which is Dataflow. It’s somehow an improved “dataproc” where your cluster is managed automatically and efficiently by Google. Dataflow is a “processing module” that takes your data from a source and sends to a sink. The source is most of the time “Pub/Sub” which receives raw data from many places (database, IOT sensors, etc.). The sinks can be multiple but an optimal configuration would be to have the sink as BigQuery in order to do analytics on the data.
The day finished by running a Dataflow job in Java and Python. Indeed, Dataflow can be programmed in both languages and after discussion with the instructor, Google wants to push towards more Python.
Third day was a continuation of Dataflow and we then dived into machine learning. We started by connecting our Dataflow model to a Pub/Sub module where data from Pub/Sub would be fed into Dataflow in a streaming way. Data will then be processed and trasfered to the Machine Learning Engine (MLE) module where different models would be run.
The dataset used is one containing taxi journeys information in New York (for a period of two years) and the goal is to determine the average fare on a certain day. The dataset contains dozen of million of rows and we were introduced to the concept of feature engineering to enrich the dataset (part of preprocessing done through dataflow). By the end of the day we were launching some neural networks models (many models as we are testing different hyper-parameters) on the MLE module. These models would run for twelve hours and by the next morning we would get the results.
Our brains were starting to heat up a bit too much after three days of intense training and we therefore decided that some outdoor activity would be a good idea. We had heard about “Berlin Midnight Runners” and we joined this 10 km run by 7.00 pm. It was freezing and pouring but this physical suffering was exactly what we needed. We ended up tired but refreshed.
Here comes our last day. We woke up with some lumbago but really motivated to tackle this last day.
Our models had finished running and we were able to select the model that had performed the best on this sample (indeed only a sample of the whole dataset was taken in order to perform best model selection). With the hyper-parameters retrieved for this best model, we started to run our best model on the whole dataset. Results were pretty good and a significant improvement from our base model that was a regression model. Tensorflow was a clear winner and we could only admire the neat integration of dataflow with Cloud MLE. It enables anyone with basic data science knowledge to tackle advanced machine learning projects.
After checking and interpreting our results, rest of the day was classes on streaming in Dataflow and Pub/Sub. We had now covered all the Google modules that permits to manage a data science project from end to end:
- Ingesting phase
- Processing phase
- Storing phase
- Analyzing phase
As a conclusion, this was a great training on a professional point of view but also on a personal point of view. Meeting other international people that share your passion and spend some days in a great European city with a colleague have made this training a great memory. So a big thank you to my employer “Fourcast” and I will bring you this certificate in a few weeks ;)…
– Charles Verleyen