The goal of this section is to introduce all relevant services for doing Machine Learning on Google Cloud. How these sections relate to the different parts of MLOps will be discussed in section 5.
3.1 Cloud Storage
Cloud Storage is a highly available blob storage service on GCP. It allows engineers to store any form of (un-)structured data. This makes it the perfect candidate to store any raw data that might be encountered.
Buckets can be configured to be multi-regional, dual-region or regional. The best configuration will be chosen based on the use case. Compliance, pricing and durability requirements can all affect this decision. In general, it is a good idea to keep data stored near to where it will be used.
Blobs are stored in a specific storage class. The storage class de ermines latency, availability (together with location) and cost. The following storage classes are available:
- Standard: for frequently accessed data or data that only needs to be stored for a short amount of time
- Nearline: for data that is accessed less than once a month, for example backups
- Coldline: for data accessed once every quarter or less. A common use case is disaster recovery
- Archive: for data accessed less than once a year
Cloud Storage can be managed either via the API or through its own CLI called gsutil. This allows for flexible and automated ingestion of data from a variety of sources. It is a common service for creating a “landing zone” on GCP, where data is first ingested into a data lake before being cleaned and processed.
3.2 BigQuery
BigQuery is one of Google’s flagship services. It is a Dremel based SQL database optimized for running analytical workloads with support for semi-structured data. The word analytical is important here. BigQuery was designed as a columnar database with high query performance for aggregations etc. It was NOT designed to be an operational database for an application and does not have good performance for indexing and mutating specific rows.
However, BigQuery is a very powerful tool for ML Engineers, Data Engineers and Data Scientists. It can dramatically speed up data exploration on large volumes of data. It has native integrations with Cloud Storage and many other GCP services. This makes it relatively easy to transform raw data that landed on Cloud Storage into a more structured format, either as a data warehousing effort or as feature engineering.
Another benefit is the performance for dashboarding workloads. BigQuery BI Engine can serve as a cache, lowering query latency. Google’s dashboarding tools (Looker and Looker Studio) natively integrate with BigQuery BI Engine, leading to a smooth end user experience.
BigQuery also has a built in ML tool called BigQuery ML, which allows engineers to easily train and use machine learning models through SQL. This is a powerful tool for rapid prototyping taking advantage of BigQuery’s distributed compute engine.
All these features make BigQuery a key service on any data project.
3.3 Vertex AI
Vertex AI is the service that encompasses all different AI services in Google Cloud. Services can be divided mainly into four categories:
- Managed APIs: these are ML APIs that provide machine learning as a service. Examples are the Vision API, Video Intelligence API among others. They can be used, without any need for your own data, out of the box to solve a variety of use cases.
- AutoML: AutoML services allow users with little experience to build machine learning models by providing their own data and let ting Google handle the training aspect. AutoML comes in various forms depending on the use case. Examples are AutoML Vision, AutoML Natural Language among others.
- ML Platform Services: these services allow users to create and manage their own Machine Learning models. It includes Vertex AI Training which allows you to run any training jobs on Google’s compute services.
- MLOps Services: these services aim to support tasks such as model lifecycle management, monitoring, automated deployment etc. Services include Vertex AI Pipelines, Model Registry, Feature Store, etc.
The aim of this section is not to provide an exhaustive list of all services, but rather to provide a brief explanation of all the services that are relevant in most implementations of ML projects.
3.3.1 Vertex AI Training
Vertex AI training is the backbone for most ML projects on GCP. It allows engineers to submit training jobs, which are Docker containers that execute specific training code. Because this is so generic it can actually be used to run any code similar to Cloud Run for example, but it is optimized for machine learning workloads.
Training jobs can use any container in the form of custom training. However, Google also provides pre-built containers for training Tensorflow, Pytorch, Scikit-Learn and XGBoost models. These containers have all the necessary dependencies installed for training their specific workloads and are guaranteed to work with Vertex AI Training.
Vertex AI Training allows you to choose which compute resources you want to use for your training jobs including hardware accelerators (GPUs and TPUs). It will then communicate which resources are available by setting specific environment variables.
Virtual machines for training can also be configured to use private IP for extra security. This allows tighter control of how the VMs can be accessed.
3.3.2 Vertex AI Pipelines
Most of the time, it can be complex to deploy models, handle every phase of your machine learning system. And lots of questions around the cost and scaling of your model can occur. Provisioning, maintenance of your computing resources is an enormous task to handle. One of the best way to handle it and minimize the time to deploy is by using serverless pipelines that connect each steps of your solution.
Vertex AI Pipelines does this in the best way by playing the role of an orchestrator which schedules different pipeline steps. All pipeline steps (also the ones that do tasks such as data processing) are run as training jobs. You can then easily automate your workflow and monitor your artifacts with the integrated service Vertex ML Metadata Store. Vertex AI Pipelines serves as a runner for two open source MLOps tools, namely Kubeflow and Tensorflow Extended. As the most generic and popular of the two is Kubeflow, the terminology used in this document matches that of Kubeflow.
3.3.3 Vertex AI Metadata Store
The metadata store in Vertex AI stores information on artifacts that were generated throughout a pipeline. What information is stored de pends on the type of artifacts and what the engineer decided to store. It allows users to trace back how specific artifacts were generated.
As an example, let’s say a model was created by a specific pipeline run. That pipeline run pulled in some data, modified it, did some basic quality analysis and then trained a model. The metadata store will en able engineers to trace back all the way from to the trained model to the exact state of the source data at training time. This enables them to investigate why the model is behaving a specific way. There could have been outliers in the training data for example.
An important clarification is that the metadata store only stores metadata about the artifacts. It does not actually store the artifacts themselves. The type of artifact will determine what the best place is to store the artifact is. This can be a Vertex AI Dataset, a model in the Model Registry, data on Cloud Storage, etc.
3.3.4 Vertex AI Model Registry
As mentioned in section 3.7, an important component is the model registry. This is where models are stored and versioned. Vertex AI has a dedicated service for this, which allows users to store different versions of specific models on the registry.
This allows engineers to easily roll back to previous versions and manage their rollout strategy.
3.3.5 Vertex AI Feature Store
Vertex AI Feature store is a database capable of serving features in two modes, offline or online. This allows it to serve data for training and online inference use cases as the online latency is very low. The same way, data can be ingested into the feature store through two mechanisms. Batch ingestion allows users to ingest data from BigQuery or from AVRO or CSV files on Cloud Storage. Streaming ingestion allows entities to be ingested one by one through an API.
Data in feature store is stored according the following data model:
- Entity: an entity maps to a business concept like a user, a product, etc.
- Feature: a feature is one attribute of the entity such as the age group of a user
- Feature Value: the value of a specific feature for a specific entity at a specific point in time
Here it is important to understand that feature store has an innate understanding of time. The value of a feature of an entity can change over time. For example, users will age, so their age group can change. Because feature store links feature values to specific timestamps, it is easy to get the value of a feature at a specfic point in time.
3.3.6 Vertex AI Workbench
Vertex AI Workbench offers ML Engineers a managed JupyterLab environment where they can experiment and do data exploration. The main benefits are that it integrates very well with other GCP services and can leverage more powerful compute than would be available locally. On top of this it can sync to Github for version control.
Vertex AI Workbench is powered by Google Compute Engine, which means it has the same cost and security benefits. Notebooks can be hosted with private IPs within a specific VPC subnet and instances can be started and stopped on demand.