- Duration: 4 days
- Format: Face-to-face or distance learning
- Prerequisites: Completion of the course Google Cloud Fundamentals:
Big Data & Machine Learning or equivalent experience.
Basic skills with a common query language such as SQL, experience with data modelling and ETL, application development using a common programming language such as Python, knowledge of machine learning and/or statistics. - Audience: Cloud architects, data architects, data engineers
Price: Please contact us - More information in our training catalogue
The course in detail
Module 1: Introduction to data engineering
- Explore the role of a data engineer.
- Analyse data engineering challenges.
- Get an introduction to BigQuery.
- Explore data lakes and data warehouses.
- Demo: Learn more about federated queries with BigQuery.
- Examine transactional databases vs. data warehouses.
- Demo: Search for personal data in your dataset with the DLP API.
- Work effectively with other data teams.
- Manage data access and governance.
- Build production-ready pipelines.
- Examine a case study of a GCP client.
- Lab: Explore data analysis with BigQuery.
Module 2: Building a Data Lake
- Get an introduction to data lakes.
- Learn about data storage and ETL options on GCP.
- Build a data lake using Cloud Storage.
- Demo: Optimise costs with Google Cloud Storage classes and functions.
- Secure Cloud Storage.
- Store all data types.
- Demo: Run federated queries on Parquet and ORC files in BigQuery.
- Learn about Cloud SQL as a relational data lake.
Module 3: Building a data warehouse
- Explore the concept of the modern data warehouse.
- Get an introduction to BigQuery.
- Demo: Query TB+ data in seconds.
Start loading data. - Demo: Interrogate Cloud SQL from BigQuery.
- Lab: Load data with the console and CLI.
Explore the schemas. - Explore BigQuery public datasets with SQL using INFORMATION_SCHEMA.
- Get to grips with schema design.
- Demo: Explore BigQuery public datasets with SQL using INFORMATION_SCHEMA.
Module 3: Building a Data Warehouse (continued)
- Focus on nested and repeated fields in BigQuery.
- Lab: Explore tables and structures.
- Optimise with partitioning and clustering.
- Demo: Partition and group tables in BigQuery.
- Overview: Explore batch and continuous data transformation.
Module 4: Introduction to building batch data pipelines
- Define EL, ELT and ETL.
- Examine quality considerations.
- Learn how to perform operations in BigQuery
- Demo: Use ELT to improve data quality in BigQuery.
- Gaps
- Use ETL to solve data quality problems.
Module 5: Running Spark on Dataproc Cloud
- Explore the Hadoop ecosystem.
- Run Hadoop on Dataproc GCS Cloud instead of HDFS.
- Optimise Dataproc.
- Workshop: Run Apache Spark jobs on Cloud Dataproc.
Module 6: Serverless data processing with Cloud Dataflow
- Get an introduction to Cloud Dataflow.
Explore why customers like Dataflow. - Look at dataflow pipelines.
- Lab: Explore a simple Dataflow pipeline (Python/Java).
- Lab: Explore MapReduce in a Dataflow (Python/Java).
- Lab: Explore side inputs (Python/Java).
- Discover Dataflow templates.
Learn more about Dataflow SQL.
Module 7: Managing data pipelines with Cloud Data Fusion and Cloud Composer
- Learn about visual creation of batch data pipelines with Cloud Data Fusion: components, user interface overview, building a pipeline and data exploration using Wrangler.
- Lab: Build and run a pipeline graph in Cloud Data Fusion.
- Orchestrate work between GCP services with Cloud Composer.
- Explore the Apache Airflow Environment: DAG and operators, workflow scheduling.
- Demo: Load event-driven data with Cloud Composer, Cloud Functions, Cloud Storage and BigQuery.
- Lab: Get an introduction to Cloud Composer.
Module 8: Introduction to streaming data processing
- Explore streaming data processing.
Module 9: Serverless messaging with Cloud Pub/Sub
- Explore Cloud Pub/Sub.
- Lab: Publish continuous data in Pub/Sub.
Module 10: Cloud Dataflow streaming features
- Learn more about Cloud Dataflow streaming
- features. Lab: Explore continuous data pipelines.
Module 11: BigQuery and Bigtable high-speed streaming features
- Identify BigQuery streaming features.
- Lab: Learn about continuous analysis and dashboards.
- Explore Cloud Bigtable.
- Lab: Explore continuous data pipelines to Bigtable.
Module 12: Advanced BigQuery features and performance
- Identify analytic Window Functions.
- Use WITH clauses.
- Explore GIS functions.
- Demo: Map the fastest growing postal codes with BigQuery GeoViz.
- Identify performance considerations.
- Lab: Optimise your BigQuery queries for performance. Lab: Create date-partitioned tables in BigQuery.
Module 13: Introduction to Analytics and AI
- Explore the question ‘what is AI?’
- Move from ad hoc data analysis to data-driven decisions.
- Explore options for ML models on GCP.
Module 14: Predefined ML API model for unstructured data
- Identify the difficulties of using unstructured data.
- Explore ML API for data enrichment.
- Lab: Use the Natural Language API to classify unstructured text.
Module 15: Big data analytics with Cloud AI Platform notebooks
- Answer the question: ‘what is a notebook?’
- Learn about BigQuery Magic and links to Pandas.
- Lab: Explore BigQuery in Jupyter Labs on IA Platform.
Module 16: ML production pipelines with Kubeflow
- Gain ways of doing ML on GCP.
- Explore Kubeflow AI Hub.
- Lab: Use AI models on Kubeflow.
Module 17: Creating custom templates with SQL in BigQuery ML
- Utilise BigQuery ML for fast model building.
- Demo: Train a model with BigQuery ML to predict taxi fares in New York.
- Explore supported models.
- Lab: Predict the duration of a bike ride with a regression model in BigQuery ML.
- Lab: Get movie recommendations in BigQuery ML.
Module 18: Creating custom templates with Cloud AutoML
- Answer the question: ‘Why AutoML?’
- Explore AutoML Vision, AutoML NLP and AutoML Tables