Scaffolding with relative dates
Dernière mise à jour : 25 janv.
Very often it is needed to visualize data in an historical form. On the other hand most of the time storing data directly in historical form is inefficient and requires duplicating the same data for every historical point. For example it is often needed to count how many people were in a company over time but it is very inefficient to make a database where for every day the data of every employee is repeated… much better storing each employee data once with a begin and an end date. Scaffolding is a technique that helps you bridge the gap between an historical visualization and an efficient data storage.
Let’s take a concrete use case. Companies are very worried of their performance and often keep databases about project management to monitor the status of their projects. Some do it on Excel, some use PMO tools; in the end the result is the same: they have data about their projects. Start dates, target dates, end dates, milestones, project responsible. Since we want to monitor, we need to visualize this data in an historical form to obtain insights.
Here is a minimal dataset. It just contains the name of a project, start, target and end date.
And now let’s suppose that your boss assigns to you an apparently very simple task: monitor how many projects where late each month in the last 12 months.
Sounds easy enough but ehy, you quickly realize the date for your axis is not in the database...! In fact you have to compare the dates in the database with a reference date (and with each other). For this task you can use scaffolding, which essentially it is a technique that uses a join of your dataset with a reference dataset; in this case just a calendar!
First of all you have to define a calendar. The boss wants results by month and so we can make a calendar with all months between the minimum and maximum date of our data sample. This will be out reference data! Or in other words the quantity on our axis.
Now we have to join it with our data. In Tableau Desktop you can load both sources and then join them by creating 2 join calculations on both sides where the calculation is just “1”.
The join looks like this:
This join effectively creates a copy of all projects for each of the reference calendar lines. Now we can compare the target and end dates to the reference date.
To calculate if a project is late in a given month we can do the following calculation
Here the case where the end date is missing is considered as if the project has not finished yet.
To finish the visualization and make our boss happy let’s just put the reference date on the columns and the number of records on the lines and finally a filter on IsLate.
Ops… our company is getting some backlog… but we did our job at least.
Optimized scaffolding with a relative date
Scaffolding is nice but is can hit performance hard. Think about a large company with 10.000 projects that wants to monitor its project every day and the company has 60 years of data piled up. That would amount at 10.000 x 21900 ~ 200M lines. This will not run quickly unless you have a performant database…
On top, it is difficult to maintain. Because: what about the future? If I want my dashboard to still work in 10 years from now I have to prepare a calendar with all the future dates in it. And what about after that day?
But remember our boss is only interested in 12 months. So we don’t actually need to make copies of our data for every months, we just need 12 copies! And this will solve the maintenance problem too, you will see.
So now instead of making a real calendar let’s just scaffold using a relative month or in other words just integers: 0, -1, -2, -3, etc.
The join to the project data is exactly the same as before, with a join calculation 1=1.
Now we have to compare this integer to a date with a simple calculation
“Month” now contains the last 12 month starting from today.
(One other version to this would be to use a parameter to let the user choose the starting date for the graph.)
After this the calculations are exactly the same as before and we obtain the same plot
But next month the dates will move and the problem of future dates is solved. On top we just multiplied our dataset by 12 months. If this is still too much you can use Tableau Prep to preprocess the data and do the scaffolding join offline.
Pay attention that when you use scaffolding your data is multiplied. When you have the reference field (in most cases the calendar date) in the view everything works as normal but when you don’t have it you have to use “count distinct” to count elements. For example if you want to count how many projects you have in total you cannot count the lines in the database anymore because there are 12 copies of each project! You can do instead a count distinct of the project name.
Other uses of scaffolding
One different use-case of scaffolding could be to integrate missing data.
Let’s say you want to make a graph of how many product you sold but you did not sell every product yet. You will not see the new product appear at all…! But you want to see it, with a 0 next to it. Yes, indeed, bosses are very demanding...
In this case you can make a list of all products you have available and then join it with the sales data with a left join. Now if your sales are there they will appear, if they are not there a null will appear. Just convert nulls to 0s and the job is done!
This solution also uses a join of your data with a reference so it is scaffolding too!