top of page

Actualités d'Argusa

  • Argusa

Latest from Snowflake: Unlocking Compliance and Usage Transparency through Column Lineage Analysis

Dernière mise à jour : 21 déc. 2023

Introduction

Earlier this year Snowflake added a new functionality called Access History. This feature will facilitate regulatory compliance and auditing, and will provide insights on popular and frequently accessed tables and columns.

Today we will explore the Column Lineage capabilities of this feature using some real words scenarios. but before, we will describe in detail what is made available under this feature.


What exactly is Snowflake’s Access History ?

Access History refers to when the user query reads data and when the SQL statement performs a data write operation, such as INSERT, UPDATE, and DELETE along with variations of the COPY command, from the source data object to the target data object. The user access history can be found by querying the Account Usage ACCESS_HISTORY view. The records in this view facilitate regulatory compliance auditing and provide insights on popular and frequently accessed tables and columns, because there is a direct link between the user (i.e. query operator), the query, the table or view, the column, and the data.

Each row in the ACCESS_HISTORY view contains a single record per SQL statement. The record contains

  • information about the source columns the query accessed directly and indirectly (i.e. the underlying tables that the data for the query comes from), and

  • the projected columns the user sees in the query result. If the query does not access or project a column in the query result, the view record does not record this column because that column data is not made available to the user.

Note that the Access History is an Enterprise Edition feature and Account Usage views can be only queried by the ACCOUNTADMIN role by default. In the last part of this article we will explain how to give access to this to others roles/users.

The ACCESS_HISTORY view contains the following columns:



COLUMN NAME

DATA TYPE

DESCRIPTION

QUERY_ID

TEXT

An internal, system-generated identifier for the SQL statement. This value is also mentioned in the QUERY_HISTORY View.

QUERY_START_TIME

TIMESTAMP_LTZ

The statement start time (UTC time zone).

USER_NAME

TEXT

The user who issued the query

DIRECT_OBJECTS_ACCEDDED

ARRAY

A JSON array of data objects such as user-defined functions (i.e. UDFs and UDTFs), stored procedures, tables, views, and columns directly named in the query explicitly or through shortcuts such as using an asterisk (i.e. *).

BASE_OBJECTS_ACCESSED

ARRAY

A JSON array of all base data objects used to execute a query, including columns, external functions, UDFs, and stored procedures.

OBJECTS_MODIFIED

ARRAY

A JSON array that specifies the objects that were associated with a write operation in the query.

OBJECT_MODIFIED_BY_DDL

OBJECT

Specifies the DDL operation on a database, schema, table, view, and column. These operations also include statements that specify a row access policy on a table or view, a masking policy on a column, and tag updates (e.g. set a tag, change a tag value) on the object or column.

POLICIES_REFERENCED

ARRAY

Specifies information about the enforced masking policy set on the column and the enforced row access policy set on the table, including policies set on intermediate objects or columns.


In particular this view maps OBJECTS_MODIFIED and the BASE/DIRECT_OBJECTS_ACCESSED by a query, therefore contains the column lineage for that query.

To demonstrate how to extract the column lineage of query we will use two examples using data from a fictional company selling gifts online in Switzerland.


Scenario 1:


You are the analytics manager and you want to know the impact of specific business changes, for example the name of a data element change. How will this impact any downstream reporting?

There is a ORDERS_W_PRICES table which depends on a view (which depends on ORDERS table) and the PRODUCT table in the database. Here is the query definition for this table.


CREATE OR REPLACE TABLE ORDERS_W_PRICES AS 
SELECT 
    O.ORDER_ID,
    O.SALE_DATE,
    O.CLIENT_ID,
    O.PRODUCT_ID,
    O.QUANTITY,
    P.SALE_PRICE * O.QUANTITY AS TOTAL_PRICE
FROM ORDERS_AGG O
JOIN PRODUCTS P ON O.PRODUCT_ID = P.ID
    AND O.SALE_DATE >= P.START_DATE
    AND O.SALE_DATE <= P.END_DATE;

You want to track how a data element change in the table ORDERS will affect the table ORDERS_W_PRICES and its downstream reporting.

It is now time to recover the column lineage of this query, but first we need the Query ID. The Query ID can be obtained from the Query Details,



the query history page, or from the following command :


SELECT LAST_QUERY_ID();

Once found we assign the Query ID to a variable


SET QID = 'your Query ID';

and we can explore the content of the ACCESSS_HISTORY view filtered for this Query ID.


SELECT*FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY WHERE QUERY_ID = $QID;


Please note that there is a few minutes of latency for the information to get logged in to the ACCESS_HISTORY view.


In particular, the column lineage can be found in the OBJECTS_MODIFIED column. When we click on the column to expand to result we get a JSON array of the lineage:


[
  {
    "columns": [
      {
        "baseSources": [
          {
            "columnName": "ORDER_ID",
            "objectDomain": "Table",
            "objectId": 166914,
            "objectName": "DEMO_LINEAGE.PUBLIC.ORDERS"
          }
        ],
        "columnId": 218019,
        "columnName": "ORDER_ID",
        "directSources": [
          {
            "columnName": "ORDER_ID",
            "objectDomain": "View",
            "objectId": 215599,
            "objectName": "DEMO_LINEAGE.PUBLIC.ORDERS_AGG"
          }
        ]
      },
      
      .... (other columns were removed to reduce the size of this JSON array.)
      
    ],
    "objectDomain": "Table",
    "objectId": 214387,
    "objectName": "DEMO_LINEAGE.PUBLIC.ORDERS_W_PRICES"
  }
]

In this JSON array we have 4 keys:

  • objectName : the name of the modified object

  • objectId: the ID of the object

  • objectDomain: the type of object

  • columns: columns of the modified object. For this field another JSON array is nested into it and contains the following informations:

    • baseSources : base objects used to compute the column

    • directSources: direct objects used to compute the column


The difference between a base and a direct object becomes clear when we look at the column ORDER_ID that comes from ORDERS_AGG, the direct object that is a view. This view is using the table ORDERS, hence ORDERS is the base object for ORDER_ID in ORDERS_W_PRICES.

This JSON array can be flattened into a table for an easier analysis of the column lineage with the following query:


SELECT 
    om.value:"objectName" as target_object_name,
    columns_modified.value: "columnName" as target_column_name,
    directSources.value: "objectName" as direct_Source_object_name,
    directSources.value: "columnName" as direct_Source_column_name,
    baseSources.value: "objectName" as base_Source_object_name,
    baseSources.value: "columnName" as base_Source_column_name
FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY t,
LATERAL FLATTEN(input => t.OBJECTS_MODIFIED) om,
LATERAL FLATTEN(input => om.value: "columns", outer => True) columns_modified,
LATERAL FLATTEN(
    input => columns_modified.value: "directSources",
    outer => true
  ) directSources,
LATERAL FLATTEN(
    input => columns_modified.value: "baseSources",
    outer => true
  ) baseSources
WHERE t.QUERY_ID = $QID;


The second last column above gives us the base table names in the ORDERS table, and we can clearly tract how a data element change in the table ORDERS will affect the table

ORDERS_W_PRICES and its downstream reporting.


Scenario 2:


You are part of the compliance team and you want to audit a specific table to check if the data stored is in line with policies and regulations. For instance, whether or not it contains personal and sensitive information.

Let’s modify our previous example by adding a table CLIENTS that we will join to the ORDERS_AGG view using the CLIENT_ID column, and use the ADRESS_CANTON column of CLIENTS to keep clients from the Geneva canton.


CREATE OR REPLACE TABLE ORDERS_W_PRICES AS 
SELECT 
    O.ORDER_ID,
    O.SALE_DATE,
    O.CLIENT_ID,
    O.PRODUCT_ID,
    O.QUANTITY,
    P.SALE_PRICE * O.QUANTITY AS TOTAL_PRICE
FROM ORDERS_AGG O
JOIN PRODUCTS P ON O.PRODUCT_ID = P.ID
    AND O.SALE_DATE >= P.START_DATE
    AND O.SALE_DATE <= P.END_DATE
JOIN CLIENTS C ON O.CLIENT_ID = C.CLIENT_ID
WHERE ADRESS_CANTON = 'GE';

If we rerun the previous query that presented the lineage in a tabular way we would obtain the exact same result. This is because we only used the column OBJECTS_MODIFIED from the ACCESS_HISTORY view, and the column ADRESS_CANTON is not used to compute any of the columns in ORDER_W_PRICES. This also applies to columns in the join, ID, START_DATE and END_DATE from PRODUCTS are also missing here.

We have to use the columns BASE/DIRECT_OBJECTS_ACCESSED to obtain the other objects accessed in the query. An example of a query using only the direct objects would be:


WITH OBJECT_MODIFIED AS (
    SELECT 
        t.QUERY_ID,
        om.value:"objectName" as target_object_name,
        columns_modified.value: "columnName" as target_column_name,
        directSources.value: "objectId" as direct_Source_object_id,
        directSources.value: "columnName" as direct_Source_column_name
    FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY t,
    LATERAL FLATTEN(input => t.OBJECTS_MODIFIED) om,
    LATERAL FLATTEN(input => om.value: "columns", outer => True) columns_modified,
    LATERAL FLATTEN(
        input => columns_modified.value: "directSources",
        outer => true
      ) directSources
),
DIRECT_OBJECTS AS (
    SELECT 
        t.QUERY_ID,
        om.value:"objectName" as target_object_name,
        directObjects.value : "objectName" as direct_Object_Name,
        directObjects.value : "objectId" as direct_Object_id,
        directColumns.value : "columnName" as direct_Object_column_Name
    FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY t,
    LATERAL FLATTEN(input => t.OBJECTS_MODIFIED) om,
    LATERAL FLATTEN(input => t.DIRECT_OBJECTS_ACCESSED) directObjects,
    LATERAL FLATTEN(input => directObjects.value: "columns", outer => True) directColumns
)
SELECT
    DO.target_object_name,
    OM.target_column_name,
    DO.direct_Object_Column_Name,
    DO.direct_Object_Name
FROM DIRECT_OBJECTS DO
LEFT JOIN OBJECT_MODIFIED OM ON DO.QUERY_ID = OM.QUERY_ID
    AND DO.direct_Object_id = OM.direct_Source_object_id
    AND DO.direct_Object_column_Name = OM.direct_Source_column_name
WHERE DO.QUERY_ID = $QID
ORDER BY OM.target_column_name;



And now the other columns also appear in the lineage.


If you are a compliance member and were worried about the possible inclusion of personal identifying information in the table ORDERS_W_PRICES after a filter using the CLIENTS table, you can relax as none of the CLIENTS columns are present.



Give access to the ACCESS_HISTORY to users without the ACCOUNTADMIN role


As mentioned previously the ACCESS_HISTORY can only be queried by users with the ACCOUNTADMIN role. The SELECT permission on this view cannot be granted at all. However you might need to give access to this new feature to users without the ACCOUNTADMIN role and below we explain how to do so.

To give access to this view to another role, someone with the ACCOUNTADMIN role can create another view on top of ACCESS_HISTORY and grant SELECT permission to the role in question.

Here is an example of such view and how the SELECT privilege on that view is given:


CREATE OR REPLACE VIEW DEMO_LINEAGE.ACCOUNT_USAGE.ACCESS_HISTORY AS
SELECT *
FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY
WHERE USER_NAME IN (
    SELECT GRANTEE_NAME FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_USERS
    WHERE ROLE = CURRENT_ROLE()
        AND DELETED_ON IS NULL
);

GRANT SELECT ON VIEW DEMO_LINEAGE.ACCOUNT_USAGE.ACCESS_HISTORY TO LINEAGE_DEMO_ROLE;
GRANT USAGE ON SCHEMA DEMO_LINEAGE.ACCOUNT_USAGE TO LINEAGE_DEMO_ROLE;


In addition to the “SELECT *” of the ACCESS_HISTORY we added a Row Security Filter to make sure the users can see only their queries. A Row Access Policy could have been used instead of writing the security filter inside the view.


Closing words

This article presented how to get the column lineage in a Snowflake database, we hope you can implement that in your own environment and master your impact analyses and compliance routines.









bottom of page