Sharing BigQuery Data
Sharing dashboards and tables with a large group of people is always a tricky operation. On the one hand, you want to group all relevant information together in one table, so people can easily use it. On the other hand, maybe not all data attributes should be visible to everyone in the group.
Until now you either made multiple BigQuery tables for every subgroup with the same level of access or you used BigQuery Column-level access control. This work is sometimes all you need but a major drawback is that you can not use the same dashboards for all users. Why? The dashboard either points to the wrong table or it expects to see columns that certain users are not allowed to see and then the dashboard throws errors.
The result? Your data and dashboards are replicated all over the place and proper data governance becomes a big undertaking.
Dynamic Data Masking
Luckily Google listened to its users and recently announced a very exciting new BigQuery functionality called “Dynamic Data Masking” to solve this issue.
This feature allows you to:
- tag BigQuery columns with policy tags.
- to define who can access this tagged column and in which format. The supported formats are ‘the actual value, the hashed value, a default value and nullify the column’.
This means everybody can use the same table or dashboards. Some users will just see more data than others. For example, some users will see the actual value of the email address of a customer instead of a hashed value. Or they can see the actual amount a customer spent instead of a default value of 0.
This way you protect yourself from an (accidental) data leak without compromising on maintainability.
How it fits together with Row Level Security
We will not deep-dive into the technical setup of Dynamic Data Masking, simply because Google already provided great getting-started documentation. We will however focus on how this ties together with Row Level Security because this is one of the cases where 1+1=3.
A short recap about Row Level Security (RLS): the goal of RLS is to limit the number of rows a person can see to what he or she is allowed to see. Let’s say you have a table that contains salary information. In that case, it makes sense that only the direct manager(s) can see the salary of a certain employee.
You can achieve row-level security in multiple ways in BigQuery (see this blog) but for the rest of this blog, we will stick with the Authorized Views approach. These views will apply filtering logic at query-time and are very flexible to use so chances are high you can also leverage their power.
What is an authorized view? An Authorized View is a view that in itself is authorized to query certain tables without requiring that the user is authorized to view the table. This means you can put filtering logic inside the view that limits the amount of data a user can see without granting the user access to the table where all the data lives.
When combined with BigQuery’s function ‘SESSION_USER()’ you can make this filtering behaviour dynamic based on who’s asking for data. You can then show a lot of data for someone with high-security clearance and little data for someone with low-security clearance.
Let’s walk you through the example in the diagram below to demonstrate the potential. In this example, we have 2 tables that contain respectively visitor information and revenue information of clients. The data security team made a security policy that people can only see data in their level of clearance and the hashed or default values of the security level above. They also identified visitor addresses as data for security level 2 and revenue as data for security level 3. Furthermore, people can only see data from their own teams.
This sounds like a lot of requirements but it is actually quite easy to implement:
- You can simply tag the columns with their security level
- make a dynamic data policy rule that specifies how people can see the data
- create an authorized view that will filter the rows of the table that people can not see.
Because the original tables are not shared with people directly, only via the authorized views, no one can cheat and all security rules are enforced directly to all requests, regardless of the platform that people use to access the data.
With this way of working you can ensure your security team that their policies are propagated to all data consumers regardless of the place where they consume data from, Looker Studio, Power BI, Vertex AI .. you name it.
So if you ever found your engineering team and security team fighting over their data-sharing approach, give them some candy in the form of Dynamic Data Masking. You will make everyone’s day!