Skip to content

Data Engineering on Google Cloud Platform

Data Engineering with Google Cloud Platform. This four-day course provides participants with a hands-on introduction to designing and building data processing systems on Google Cloud Platform. 

Through a combination of presentations, demonstrations and hands-on work, participants will learn how to design data processing systems, build end-to-end data pipelines and extract information from data analysis. The course covers structured, unstructured and streaming data.

  • Duration: 4 days
  • Format: Face-to-face or distance learning
  • Prerequisites: Completion of the course Google Cloud Fundamentals:
    Big Data & Machine Learning or equivalent experience.
    Basic skills with a common query language such as SQL, experience with data modelling and ETL, application development using a common programming language such as Python, knowledge of machine learning and/or statistics.
  • Audience: Cloud architects, data architects, data engineers
    Price: Please contact us
  • More information in our training catalogue

The course in detail

 

Module 1: Introduction to data engineering

  • Explore the role of a data engineer.
  • Analyse data engineering challenges.
  • Get an introduction to BigQuery.
  • Explore data lakes and data warehouses.
  • Demo: Learn more about federated queries with BigQuery.
  • Examine transactional databases vs. data warehouses.
  • Demo: Search for personal data in your dataset with the DLP API.
  • Work effectively with other data teams.
  • Manage data access and governance.
  • Build production-ready pipelines.
  • Examine a case study of a GCP client.
  • Lab: Explore data analysis with BigQuery.

 

Module 2: Building a Data Lake

  • Get an introduction to data lakes.
  • Learn about data storage and ETL options on GCP.
  • Build a data lake using Cloud Storage.
  • Demo: Optimise costs with Google Cloud Storage classes and functions.
  • Secure Cloud Storage.
  • Store all data types.
  • Demo: Run federated queries on Parquet and ORC files in BigQuery.
  • Learn about Cloud SQL as a relational data lake.

 

Module 3: Building a data warehouse

  • Explore the concept of the modern data warehouse.
  • Get an introduction to BigQuery.
  • Demo: Query TB+ data in seconds.
    Start loading data.
  • Demo: Interrogate Cloud SQL from BigQuery.
  • Lab: Load data with the console and CLI.
    Explore the schemas.
  • Explore BigQuery public datasets with SQL using INFORMATION_SCHEMA.
  • Get to grips with schema design.
  • Demo: Explore BigQuery public datasets with SQL using INFORMATION_SCHEMA.

 

Module 3: Building a Data Warehouse (continued)

  • Focus on nested and repeated fields in BigQuery.
  • Lab: Explore tables and structures.
  • Optimise with partitioning and clustering.
  • Demo: Partition and group tables in BigQuery.
  • Overview: Explore batch and continuous data transformation.

 

Module 4: Introduction to building batch data pipelines

  • Define EL, ELT and ETL.
  • Examine quality considerations.
  • Learn how to perform operations in BigQuery
  • Demo: Use ELT to improve data quality in BigQuery.
  • Gaps
  • Use ETL to solve data quality problems.

 

Module 5: Running Spark on Dataproc Cloud

  • Explore the Hadoop ecosystem.
  • Run Hadoop on Dataproc GCS Cloud instead of HDFS.
  • Optimise Dataproc.
  • Workshop: Run Apache Spark jobs on Cloud Dataproc.

 

Module 6: Serverless data processing with Cloud Dataflow

  • Get an introduction to Cloud Dataflow.
    Explore why customers like Dataflow.
  • Look at dataflow pipelines.
  • Lab: Explore a simple Dataflow pipeline (Python/Java).
  • Lab: Explore MapReduce in a Dataflow (Python/Java).
  • Lab: Explore side inputs (Python/Java).
  • Discover Dataflow templates.
    Learn more about Dataflow SQL.

 

Module 7: Managing data pipelines with Cloud Data Fusion and Cloud Composer

  • Learn about visual creation of batch data pipelines with Cloud Data Fusion: components, user interface overview, building a pipeline and data exploration using Wrangler.
  • Lab: Build and run a pipeline graph in Cloud Data Fusion.
  • Orchestrate work between GCP services with Cloud Composer.
  • Explore the Apache Airflow Environment: DAG and operators, workflow scheduling.
  • Demo: Load event-driven data with Cloud Composer, Cloud Functions, Cloud Storage and BigQuery.
  • Lab: Get an introduction to Cloud Composer.

 

Module 8: Introduction to streaming data processing

  • Explore streaming data processing.

 

Module 9: Serverless messaging with Cloud Pub/Sub

  • Explore Cloud Pub/Sub.
  • Lab: Publish continuous data in Pub/Sub.

 

Module 10: Cloud Dataflow streaming features

  • Learn more about Cloud Dataflow streaming
  • features. Lab: Explore continuous data pipelines.

 

Module 11: BigQuery and Bigtable high-speed streaming features

  • Identify BigQuery streaming features.
  • Lab: Learn about continuous analysis and dashboards.
  • Explore Cloud Bigtable.
  • Lab: Explore continuous data pipelines to Bigtable.

 

Module 12: Advanced BigQuery features and performance

  • Identify analytic Window Functions.
  • Use WITH clauses.
  • Explore GIS functions.
  • Demo: Map the fastest growing postal codes with BigQuery GeoViz.
  • Identify performance considerations.
  • Lab: Optimise your BigQuery queries for performance. Lab: Create date-partitioned tables in BigQuery.

 

Module 13: Introduction to Analytics and AI

  • Explore the question ‘what is AI?’
  • Move from ad hoc data analysis to data-driven decisions.
  • Explore options for ML models on GCP.

 

Module 14: Predefined ML API model for unstructured data

  • Identify the difficulties of using unstructured data.
  • Explore ML API for data enrichment.
  • Lab: Use the Natural Language API to classify unstructured text.

 

Module 15: Big data analytics with Cloud AI Platform notebooks

  • Answer the question: ‘what is a notebook?’
  • Learn about BigQuery Magic and links to Pandas.
  • Lab: Explore BigQuery in Jupyter Labs on IA Platform.

 

Module 16: ML production pipelines with Kubeflow

  • Gain ways of doing ML on GCP.
  • Explore Kubeflow AI Hub.
  • Lab: Use AI models on Kubeflow.

 

Module 17: Creating custom templates with SQL in BigQuery ML

  • Utilise BigQuery ML for fast model building.
  • Demo: Train a model with BigQuery ML to predict taxi fares in New York.
  • Explore supported models.
  • Lab: Predict the duration of a bike ride with a regression model in BigQuery ML.
  • Lab: Get movie recommendations in BigQuery ML.

 

Module 18: Creating custom templates with Cloud AutoML

  • Answer the question: ‘Why AutoML?’
  • Explore AutoML Vision, AutoML NLP and AutoML Tables

Contact us Any questions? Or are you interested in our other Google Cloud services?  Our experts would be happy to help!