Cloud Data Fusion by Google Cloud is the brand new, fully-managed data engineering product within Google Cloud Platform. It will help users to efficiently build and manage ETL/ELT data pipelines. It intents to shift the focus from code (where data engineers can spend lot of days/weeks building connectors from a source to a sink) to a focus on insights and action. Built on top of the open-source project CDAP, it leverages a convenient user interface for building data pipelines in a drag and drop manner.
Data Fusion is one of Google’s major novelties concerning data analytics, as announced at Google Cloud Next ’19. It comes at a time where companies struggle to deal with a huge amount of data spread across many data sources, and to fuse them into a central data warehouse. The key challenges of integrating all these data are as follows:
Data Fusion is addressing these challenges by making it extremely easy to move data around, with two main focuses:
- build data pipeline without writing any code: as Data Fusion is built on top of the open-source CDAP project, it already comes with more than 100 connectors and it is constantly growing. Building a pipeline between a source and sink requires therefore only a few clicks.
- Do transformation without writing any code: Data Fusion comes with a set of built-in transformations that you can seamlessly apply to your data.
The following screenshot shows the interface with a simple pipeline. First step is the connector to the raw database, then there is a wrangling step that does some transformation on a set of columns, and finally the data is sent to two sinks: BigQuery for analytics purposes and Cloud Storage for backup of the data.
Some of the other relevant features of Data Fusion are these (described by one of the early adopters):
- Open-source: as mentioned above, it’s built on top of CDAP and it therefore enjoys a big community that keep on developing new connectors.
- Accessible: thanks to the user interface, Data Fusion does not require you to have any kind of coding background.
- Metadata: search integrated datasets by technical and business metadata. Track lineage for all integrated datasets at the dataset and field level.
- Flexible: if you can’t do something through the UI, Data Fusion is extensible and you can add your own code to it.
- GCP-native: fully managed, GCP-native architecture unlocks the scalability, reliability, security and privacy guarantees of Google Cloud.
Below is a list of business challenges where Data Fusion will excel:
Data Fusion is providing a fabric which allows user to fuse a lot of different technologies and products that are available on Google Cloud Platform in a much easier, more accessible, secure and efficient manner – as shown on the following chart:
Data Fusion is the new backbone for data analytics and will become in the months to come a major game-changer for doing data engineering. Our engineers at Fourcast are already familiar with this new GCP product and will be glad to give you a demo!