What is a Dendrogram?
Dendrogram is basically a tree diagram with branches . It is a powerful tool used commonly to represent hierarchical clustering, and helps to understand the connection between categorical variables in a complex data set . The number of branches from the original node/trunk depends on the number of categories or subgroups that fall under the main group. The height of branches usually depict how similar/dissimilar the groups/clusters are.
How will it look?
Dendrogram can be configured into three ways for easy visualization-
Vertically
Horizontally
Radially
Examples:
Vertical dendrogram - where the connections between main variables and sub variables are illustrated vertically
Horizontal Dendrogram: where the connections between main variables and sub variables are illustrated horizontally
Radial Dendrogram: where the connections between main variables and sub variables are illustrated in a circular fashion. The main advantage of radial dendrogram is that it uses the graphic space more efficiently
Dendrogram can also be classified by levels
Single level
Multilevel
Single level:
As the word single suggests, there is just one node division
Here in the example, sepsis dataset is used to check how many patients fall under sepsis, onset sepsis and non sepsis category. ( we are showing that total patients in the data set is 40,336 and there are about 37,404 non sepsis patients, 2506 onset sepsis and 426 sepsis patients.)
Multi Level Dendrogram:
Multilevel dendrograms contain multiple branches or layers, utilized specifically for hierarchical groupings. In the provided diagram, Acute Respiratory Dysfunction Syndrome (ARDS) patients serve as subgroups. They enable exploration of how many sepsis patients also have ARDS.
Let's now delve into the step-by-step process of creating an interactive single and multilevel dendrogram in Tableau, using the sepsis dataset.
Dataset used: Sepsis dataset
Before getting in to the steps to create dendrogram, lets know what is Sepsis as we will be using the term often
Sepsis:
According to the Centers for Disease Control and Prevention, Sepsis is a life threatening emergency situation where the body reacts to infection in an extreme manner. It is imperative for everyone to know about the sepsis condition as most of the patients have sepsis before they seek medical attention.And about 1 in 3 people die during hospitalization.
Now lets take a look at the steps to follow to create a dendrogram:
Create a union in the dataset for data densification
Create necessary calculated fields
Design a single level dendrogram for sepsis category
#1. Union- Data densification
To create a dendrogram, first we had to densify the data by creating a union as the Tableau naturally doesn't support sigmoid curve graphs.
Union can be done in 2 ways-
A. First method, by creating an excel sheet with path 0 to 200 and creating a union between the dataset and the path excel. Here, we are adding several additional points between the start point 0 and end point 200 in such a way that when these points are connected, a curved line is formed.
Once, after the excel is created, we have to create a union between the dataset and excel using join .
After creating the join, we have to create bins of size 1 using the path variable that we got through union of path excel.
Find the Path in the Data pane,
Right click on Path , go to create and select Bins. Then, edit the size of bins to 1 and click ok.
B. Now, lets see the second way of data densification
For this method, u don't need a separate excel. Instead we are creating self join (union within dataset).
First, duplicate the dataset, then drag and drop duplicated dataset on the already existing dataset to create a union
After this step, a calculated field is created in the name of 'Path' using the drop down menu near the search bar in the tableau.
Formula used : IIF([Table Name] = 'Dataset.csv',0,200)
Then create a bin of size 1 for the Path variable following the same steps as above.
#2. Now, lets see step 2 - creating calculated fields:
There are a total of 9 calculated fields necessary for the single level dendrogram.
Please use calculated fields to create the following variables using the given calculations. The variables are used to get the curve in the dendrogram
Since the point of interest is to get the number of patients under each group, patient id variable which is already present in the dataset is used.
And to add the rounder bar along the dendrogram curve, the percentage, percentage adjusted fields are created.
i. Patient_id -A new variable is created using existing patient id variable to reflect the window sum of patients and it is divided by 2 because while data densification, all the values in the data would have got duplicated.
Formula used: WINDOW_SUM(SUM([Patient ID]))/2
By default window_sum will give the total of entire partition unless specific ranges are specified to windows.
ii. total patient id - it is the exact same calculation thats used for Patient_id. Even though the calculation formula is same, their usage will be little different when we use it in nested calculations.
Formula used: WINDOW_SUM(SUM([Patient ID]))/2
iii. Percentage
Formula used : [Patient_id]/[total_patient_id
iv. Percentage Adjusted:
Formula used:[Percentage]/WINDOW_MAX([Percentage])
v. Rank - This variable will help to create a unique ranking system based on the number of patients.
Formula used: RANK_UNIQUE([Patient_id],"desc")
Now let's use the above new variables to create X and Y variables for X and Y coordinates
vi. X - This function is used to create equal spacing of between the points generated during our data densification process. Here we are creating x axis in the range from -6 to 18 with step size 0.12
Formula used: ((INDEX()-1)*0.12)-6
vii & viiii. To calculate Y axis coordinate, we need a sigmoid formula function
Sigmoid
Formula used: 1/(1+exp(-[X]))
Y
Formula used: ([Sigmoid]*([Rank]-(WINDOW_MAX([Rank])+1)/2)/100)
Y coordinate is needed to make different lines for each subcategories.
ix. Size
Formula used:
IF [X]>=6 and [X]<=6 + (10 * [Percentage(Adjusted)])
then 1
else 0
END
Size variable is created to adjust the size of the rounded bars designed along the dendrogram curve according to the percentages of the sub categories .
#3. With all the calculated fields created and at our expense, lets now delve into creating single level dendrogram.
First, change the Marks from Automatic to line
Drag the category variable (in this example it is Sepsis classification, where the sepsis variable in the dataset is used to label the patients in to Sepsis, Non- Sepsis and Onset Sepsis).
Drag the created Path (bin) from the Data pane to the columns shelf and confirm whether 'Show Missing Values' from the menu is selected.
4. Then, once 'Show Missing Values' is selected, drag the 'Path bin' variable to the Detail in Marks
5. In the Data pane, find the variable X and drag to the columns shelf on the left side of the worksheet.
5. Likewise, find the Y variable and drag it to rows shelf.
6. Right click on X pill in the column shelf and select 'compute using' from the menu and then choose 'Path Bin'
7. Do the same to the Y variable in the rows shelf.
8. Once after compute using Path bin is done for both X and Y , the following curve like the picture below should appear
Now, lets work on the Nested calculations
A. Edit table calculations
First, right click on the y pill in the rows shelf and select 'Edit Table Calculations'.
i. Select Y variable under nested calculation
Then, from Table, Cell, Specific Dimensions in the compute using, select ' Specific Dimensions' and below from the Sepsis classification (category) and Path Bin, select only 'Sepsis classification'.
ii. Next, select Rank variable under the nested calculation
Then, from Table, Cell, Specific Dimensions in the compute using, select ' Specific Dimensions' and below from the Sepsis classification ( category) and Path Bin, select only 'Sepsis classification'.
Y and Rank is used to create separate lines for each category and is arranged in descending order
Dendrogram has now taken shape, now we can add Sepsis classification variable and patient id variable to the Label shelf in the Marks cards to name the lines. Note: to get the unique number of patients under each category, use the COUNTD function.
From the picture, we can see that the dendrogram is arranged in the descending fashion according to the number of patients in sepsis category.
Let us now add the cool feature to the dendrogram: rounded bars. To create them, follow the steps below.
A. Size Variable
i. Find the size variable from the Data pane and drag it to the size shelf in the Marks card
ii. Then, right click on this pill and click on 'compute using' from the menu and select 'Path Bin'
iii. Then again right click on the size variable pill and select nested calculations
iv. In the nested calculations, click on the total patient id variable
v. Then, select specific dimensions in the compute using shelf
vi. Below that among Path Bin and Sepsis classification (category), check both the boxes- Path Bin and Sepsis classification
vii. Sepsis classification variable should be at the top of Path Bin. Drag the Path Bin below Sepsis classification
vii. Rounded bars would have appeared as shown in the picture below.
B. Percentage(Adjusted)
i. In the Nested calculation drop down menu, select Percentage (Adjusted).
ii. Then in the Compute using shelf, select Specific Dimensions
iii. Below that among Path Bin and Sepsis classification (category), check both the boxes- Path Bin and Sepsis classification.
iv. Sepsis classification variable should be at the top of Path Bin. Drag the Path Bin below Sepsis classification
v. After following the steps, the rounded bar's length will vary according to the number of patients in each category.
Single level dendrogram is ready!! We can now beautify it by hiding the grid lines, x axis header and y axis header.
Learning about dendrogram from online materials and applying these techniques to the current dataset was both challenging and exhilarating . I hope this example of creating a dendrogram with a new dataset will aid your understanding of the concepts. If you found this article helpful, please give it a like/clap Thanks for reading and stopping by!
Those who are interested to try out interactive multilevel dendrogram, do take a look at my next blog "Multilevel Interactive Dendrogram for Data Analysis".
Comments