A short introduction to Data Vault and how to automate it with WhereScape
Dernière mise à jour : 25 janv.
DataVault is by now a well-known term in data-related environments, yet very few have a sense of what it really is and what are its use cases. This article will give you an introduction to the main components of DataVault and more importantly, an overview of the situations for which DataVault is useful.
What is Data Vault
Data Vault is a data structure and beyond that, it is a methodology for managing data projects. In this article, we will not treat the methodology aspects and focus on the data structure features. For more information I recommend the book “Building a scalable warehouse with DataValut 2.0” by D. Linstedt and M. Olschimke.
To understand why a structure like Data Vault was invented, let us start from the principles from which it was constructed.
· Flexibility : Business processes change over time, and data models have to be flexible enough to be able to quickly incorporate new data. This is ever more true in the era of Agile development where businesses expect new implementation to be done in a few weeks.
In DataVault, data is stored in its raw form as much as possible, with no corrections to make it fit a predefined data model. And therefore, business rules to clean the data and make it conform to the appropriate structure are pushed as late as possible in the data processing flow, with the result that most rules are virtualised: namely applied on the fly without the need of storing the results in memory. This allows a very flexible structure that can easily adapt to changes and evolutions of the underling business and therefore data structure.
· Auditability : Data and its usage is increasingly more regulated. This makes auditability very important. DataVault is built with auditability in mind. Storing data in the raw format allows to be able to always go back to the original data. In addition, DataVault requires the addition of a variety of audit fields to the data, and these fields allow to be able to know when the data were loaded and from which systems.
· Performance : With ever increasing amounts of data to manage, and the ever persistent demand of fast analytics, performance is key. DataVault includes a range of techniques to boost the performance of the data warehouse, most notably an extensive use of hashing. This also becomes necessary as the flexibility offered by Data Vault comes with the requirement of many joins.
Data Vault architecture
Before going into more depth, let us spend a few words about the architecture of a DataVault warehouse: it is composed of 3 layers.
· Load and stage : The loading and staging layer does not contain historized data. This is a stage where only new data is loaded or a full copy of the current state of the input sources.
· Core layer : The core layer contains the facts and dimensions that are historized. In the case of DataVault the data is store as much as possible in raw form in the core layer, which is important for auditability.
· Presentation layer : This latter layer it is a virtualised layer exposing views of the underling core layer. In DataVault, the core layer is not supposed to be directly exposed to end-users.
In more traditional Kimball’s star models, only the first 2 layers are used. In such models the core layer is directly exposed to end users and therefore serves as presentation layer.
One critical difference between DataVault and star models is that in a Kimball model, business rules are already implemented in the staging layer, while in a DataVault model data is stored as much as possible in its raw form, and business rules are virtualised and pushed to the final presentation layer.
Let us also introduce hashing, which is widely used in DataVault. Hashing is the transformation of a string of characters into a unique fixed-length string of characters that represents the original string.
For example, a meaningful business key, the name of a restaurant or the ID of an employee, can be transformed into a random unique list of characters which is used to index the tables.
Hashing is used in databases to index and retrieve items in a database because it is faster to find the item using the shorter and standardised hashed key than to find it using the original value.
Main entities of a DataVault model
In this section we describe the components of the core of a DataVault model. To simplify the understanding we will use the example of a restauration business running several restaurants.
The core of a DataVault model is composed of 3 entities:
· Hubs: A hub is a unique list of primary keys which represent business objects.
For example, a hub could represent restaurants, another dishes and a third one employees.
A hub only contains the business keys and audit information and no other property of the corresponding business object. You can therefore imagine a hub as a list of restaurant or employee IDs with attached information about when each key first appeared in the data warehouse and from what source system. DataVault also recommends to add a “Last seen date” field which could be useful for audit purposes.
Finally, a hash key is generated for each business key which serves the purpose of standardisation and performance just like the integer surrogate keys in a Kimball model.
A hub for restaurants would look like the following table
· Links: Links represent relationships between two or more hubs. For example a link could represent a work contract between a restaurant and its employees or it could represent an item of a menu, namely the fact that a particular dish is served at a specific restaurant.
The link therefore contains the primary key of all hubs that it connects, as well as audit information and the hash key of the combination of hub keys.
Links are the essence of the flexibility that DataVault offers as it is easy to add new links between existing hubs, creating new ways to look at and combine the data without having to make radical modifications to the existing data warehouse.
A link table representing contracts, namely linking employees and restaurants, would look like the following table
· Satellites: Satellites are the entities that contain the properties that describe the different keys present in hubs and links. For example a satellite of the “employee” hub, could store the address the employee and its social security number. Satellites can be also attached to links, we can for example imagine a satellite of the work contract link which contain the conditions of a work contract such as salary and agreed working hours.
Satellites contain the hash key of the elements it describes (this is the hash key previously defined into a hub or a link), the descriptive fields and audit information.
In a properly constructed DataVault model, different satellites should be created for each source system and for different change rates and security levels. For example, a separate satellite should be created to store location of a restaurant, that does not change very often, and the number of sitting places which could change depending of weather. Similarly, a separate satellite should be created for personal information that requires special security, and for work schedule information that is less sensitive.
For example a satellite containing information about restaurants would look like the following
In this case the structure defined up to here would correspond to the following:
In the diagram, each line represents a possible join performed on the hash keys that are also primary keys of each table. Of course there could be satellites linked to the contracts and employees as well.
In addition, there could be several satellites for each of the entities containing information relating to one part or all of the keys. For example Italian and French contracts could be very different and therefore containing different columns, hour worked in one case and days in the other, different form of benefits as 13th salary versus bonuses, etc. This information can be stores into 2 separate satellites tailored to the specific cases and then joined only to the keys they concern. Other reasons to use separate satellites would be for storing transient information that needs to be regularly purged by law and more permanent data.
Data Vault Automation : WhereScape
DataVault stores data in its raw form and the modelling can be described by a clear set of rules. These properties make DataVault an ideal candidate for automation. Several tools are available in the market that allow to build a logical model and then generate the SQL code necessary to create and maintain a DataVault data warehouse. The tool of our choice at Argusa is WhereScape.
WhereScape allows you to define the logical design of a DataVault model, namely the primary keys, how keys are linked, and which fields should go into which satellites. To give you an example, let us imagine a simple model of 2 tables, representing the restaurants and employees described earlier. We loaded these tables in WhereScape. As it can be seen in the image WhereScape automatically detected the structure of the data: columns names and data types. It also helps you infer primary keys (yellow key on the image) and relations between tables (blue keys on the image).
Finally and most importantly for building a DataVault design, WhereScape allows to flag fields as primary keys or as different satellites.
A predefined set or attribute types is given be the products (Fig.2) and more can be added at will.
The following example shows the DataVault design where fields are flagged.
· Primary keys are blue and will be transformed into hubs
· Foreign keys are also blue and will be drive the construction of links
· Two different types of satellites are required, for fast and slow changing attributes (or in the DataVault lexicon: high and low volatility
While these attributes can be flagged manually, WhereScape provides a tool to automatically flag fields based on naming and profiling rules.
After the fields are flagged, the program provides an handy one-click button that will convert the logical design to an actual DataVault model.
In the following image you can see the result. WhereScape created 2 hubs (blue tables), 1 link (red table) and 3 satellites (yellow tables). Naming conventions are respected everywhere and audit fields and hash keys have been automatically added.
Once the logical DataVault model is defined, WhereScape builds the SQL code required for running the data warehouse and allows for scheduling the tasks to load it. In this example, it took us only a few minutes to convert the logical model into code and load the database. You can see some of the results below, where tables are automatically build with the required hash keys and audit fields.
Satellite of restaurant
Our DataVault example is now ready to be delivered to its users. In addition, WhereScape can create, in a signle click of a button, full documentation of the data-model as a PDF or a website.
If you are interested in trying WhereScape or need coaching on a Data Warehouse implementation, please contact firstname.lastname@example.org.