Skip to content

What is Google Cloud Data Fusion?

Cloud Data Fusion by Google Cloud is the brand new, fully-managed data engineering product within Google Cloud Platform. It will help users to efficiently build and manage ETL/ELT data pipelines. It intents to shift the focus from code (where data engineers can spend lot of days/weeks building connectors from a source to a sink) to a focus on insights and action. Built on top of the open-source project CDAP, it leverages a convenient user interface for building data pipelines in a drag and drop manner.

Google Cloud Data Fusion at Google Next '19

Data Fusion is one of Google’s major novelties concerning data analytics, as announced at Google Cloud Next ’19. It comes at a time where companies struggle to deal with a huge amount of data spread across many data sources, and to fuse them into a central data warehouse. The key challenges of integrating all these data are as follows:

Data Fusion Google Cloud 1

Data Fusion is addressing these challenges by making it extremely easy to move data around, with two main focuses:

  1. build data pipeline without writing any code: as Data Fusion is built on top of the open-source CDAP project, it already comes with more than 100 connectors and it is constantly growing. Building a pipeline between a source and sink requires therefore only a few clicks.
  2. Do transformation without writing any code: Data Fusion comes with a set of built-in transformations that you can seamlessly apply to your data.

The following screenshot shows the interface with a simple pipeline. First step is the connector to the raw database, then there is a wrangling step that does some transformation on a set of columns, and finally the data is sent to two sinks: BigQuery for analytics purposes and Cloud Storage for backup of the data.

Data Fusion Google Cloud 2

Some of the other relevant features of Data Fusion are these (described by one of the early adopters):

  • Open-source: as mentioned above, it’s built on top of CDAP and it therefore enjoys a big community that keep on developing new connectors.
  • Accessible: thanks to the user interface, Data Fusion does not require you to have any kind of coding background.
  • Metadata: search integrated datasets by technical and business metadata. Track lineage for all integrated datasets at the dataset and field level.
  • Flexible: if you can’t do something through the UI, Data Fusion is extensible and you can add your own code to it.
  • GCP-native: fully managed, GCP-native architecture unlocks the scalability, reliability, security and privacy guarantees of Google Cloud.

Below is a list of business challenges where Data Fusion will excel:

Data Fusion Google Cloud 3

Data Fusion is providing a fabric which allows user to fuse a lot of different technologies and products that are available on Google Cloud Platform in a much easier, more accessible, secure and efficient manner – as shown on the following chart:

Data Fusion Google Cloud 4

Data Fusion is the new backbone for data analytics and will become in the months to come a major game-changer for doing data engineering. Our engineers at Fourcast are already familiar with this new GCP product and will be glad to give you a demo!


Want to know more about Data Fusion or other Google Cloud Platform solutions?
Visit our GCP page or just drop us a line!