top of page
Search

# Simple Linear Regression Analysis Using Tableau

Regression analysis is a statistical technique used to examine the relationship between one dependent variable and one or more independent variables.

It is used to make predictions, ex., predict student’s exam score (dependent variable) based on the number of hours they study (independent variable), infer causal relationships, ex., understand the causal relationship between smoking (independent variable) and lung cancer incidence (dependent variable) to determine if and how smoking causes changes in lung cancer rates. and identify trends, ex., study of time-series data (independent variable) to grasp economic trends or seasonal patterns in sales (dependent variable), to understand of how things change over time.

Regression analysis is widely employed in various fields, including economics (ex., for economic forecasting and policy impact evaluation), finance (ex., for predicting stock prices and assessing financial risk), social sciences (ex., for analyzing behavioral patterns and educational outcomes), and healthcare (ex., for predicting patient outcomes and estimating disease risk factors).

Various Regression analysis techniques can be grouped based on criteria such as Linearity (Simple, Multiple), Purpose (Predictive, Explanatory), Assumptions (Parametric, Non- Parametric), Methodology (OLS, Ridge, Lasso etc.), Data characteristics (Time Series, Spatial), etc.

Tableau primarily focuses on data visualization and exploration rather than statistical analysis. Tableau offers built-in capabilities for basic regression analysis and trend lines, which when added to visualizations, can provide a visual representation of the direction and strength of the relationship between variables. Trend lines in Tableau can be linear, logarithmic, exponential, polynomial, or power. The Linear Trend Line, depicted as a straight line, represents simple linear regression (linear relationship), while the Logarithmic, Exponential, Polynomial, and Power Trend Lines, depicted as curved lines, represent forms of non-linear regression (non-linear relationships).

For more advanced models, integrating external statistical software such as Python with Tableau Desktop via TabPy provides extensive statistical and Machine learning libraries while still benefiting from Tableau's superior visualization capabilities. TabPy (Tableau Python Server) allows Tableau Desktop users to execute Python scripts directly within calculated fields and table calculations in Tableau, enabling advanced analytics and calculations using Python libraries such as pandas, NumPy, scikit-learn, and more. The results of these advanced models can then be visualized in Tableau as custom calculated fields, where the output (e.g., predicted values, classifications, clustering results) can be displayed in various chart types such as scatterplots, line charts, bar charts, and more.

We can also use calculated fields in Tableau Desktop or Tableau Public to preprocess data or derive variables needed for regression analysis. We can export data from Tableau Public or Tableau Desktop as CSV or Excel format and load it into external statistical software such as Python, R , SAS, or SPSS for regression analysis. Similarly, after performing the regression analysis, we can export the results (e.g., predicted values, residuals, coefficients) to a CSV or Excel file which can be imported back into Tableau Public or Tableau Desktop for visualization and reporting.

Types Of Trend Lines in Tableau

Regression Models

____________________________________________________________________________________

Regression models are statistical models used to estimate trends/relationships between variables.

In Tableau they are represented in the form of trend lines and require two continuous variables.

____________________________________________________________________________________

Regression line (Trend line)

A trend line, also known as a regression line or line of best fit, is a straight line that best represents the relationship between two variables in a scatterplot. It shows the general pattern or direction in which the data points are moving. It's commonly used in regression analysis to model the relationship between an independent variable (often denoted as x) and a dependent variable (often denoted as y).

The concept of the "line of best fit" applies to both straight line from linear regression and curved line from non-linear regression, with both aiming to minimize the residuals, which are the differences between the observed data points and the predicted values from the regression line. Minimizing the distance (sum of squared residuals) is a mathematical way to ensure the line you choose is as close to all data points as possible, making it the best representation of the underlying trend in the data. This process is the core idea behind regression analysis. This helps to identify and visualize the overall trend in the data, making it easier to understand relationships and make predictions.

It's typically represented by the equation:

y=mx+b

where:

y is the dependent variable.

x is the independent variable.

m is the slope of the line, representing the rate of change of y with respect to x.

b is the y-intercept, representing the value of y when x is 0.

To calculate 𝑚 (the slope) and 𝑏 (the y-intercept) for a set of data points (𝑥1, 𝑦1), (𝑥2, 𝑦2) …, (𝑥𝑛, 𝑦𝑛), we can use the following formulas:

where:

The process of finding the best-fitting line (also known as the regression line) using the method of least squares with the given sample data:

Steps in Simple Linear Regression

1.    Plot the data:

we can plot the given data points to visualize the relationship between x (independent variable) and y (dependent variable).

Values of variable y are the observed values.

Points :(1, 4), (2, 6), (3, 7), (4, 9), (5, 10)

2.     Calculate the Sums:

3.    Calculate slope m :

4.   Calculate the intercept b:

5. From the Regression Equation we get Regression line as:

Using this equation, for every value of x we derive y (predicted y values)

Plot the points (1, 4.2), (2, 5.7), (3, 7.2), (4, 8.7), (5, 10.2), (6, 11.7) and connect them with a straight line (regression line).

The least squared method minimizes the sum of the squared residuals (SSR).

6. Calculate the residuals:

Draw them as vertical lines from the actual data points to the regression line:

7. Find Squared Residuals:

8.  Find Sum of Squared residuals (SSR):

How Minimizing SSR Ensures Best Fit

Before we fit the regression line, we can calculate the SSR for any arbitrary line to understand the concept.

Comparison of SSR Values

·         Initial arbitrary line y = x: SSR = 91

·         Best- fitted regression line y = 1.5x + 2.7: SSR = 0.3

·         The SSR decreased from 91 to 0.3 by finding the best-fitting regression line. This demonstrates that the regression line 𝑦=1.5𝑥+2.7is much closer to all data points on average compared to the arbitrary line 𝑦=𝑥

·         The SSR value depends on the data and the variability of the data points around the regression line. Lower SSR indicates a better fit of the regression line to the data points and Higher SSR indicates a poorer fit of the regression line to the data points.

R-square:

R-square value, also known as the coefficient of determination, represents the proportion of the variance in the dependent variable that is predictable from the independent variable. In a scatterplot, variance helps visualize the spread or dispersion of data points along both the horizontal (x-axis) and vertical (y-axis) axes.  R-square value ranges from 0 to 1:

o    R-square value =1

The trend line perfectly fits the data, meaning all data points lie exactly on the line.

o    R-square value =0

The trend line does not explain any of the variance in the dependent variable, meaning it does not fit the data at all.

o    High R-square value (close to 1)

means that the independent variable x explains most of the variance in the dependent variable y. In terms of the spread of data points, a high R-square value means that the data points tend to cluster closely around the regression line, indicating that the model accurately captures the relationship between the variables.

o    Low R-square value (close to 0)

means that the independent variable x explains very little of the variance in the dependent variable y. In terms of the spread of data points, a low R-square value means that the data points are scattered widely around the regression line, indicating that the model does not effectively capture the relationship between the variables.

Relationship between Trendline and R-square:

R-square value provides a measure of how well the trend line (or regression line) represents the data points. i.e. It tells us how close the data is to the trendline (line of best fit or regression line). scatter plot of the data points with the trend line plotted over it helps visualize the fit. The closer the data points are to the trend line, the higher the R-square value. The higher the R-squared, the better the model fits our data.

Pearson correlation coefficient (r):

The Pearson correlation coefficient (often represented as r) is a measure of the linear relationship between two variables. i.e. this formula quantifies the linear relationship between two variables X and Y, with 𝑟 indicating the strength and direction of this relationship. It ranges from -1 to 1,

where:

·         A positive correlation indicates that as values of x increase, values of y also increase.

·         A negative correlation indicates the opposite—as values of x increase, values of y decrease.

·         The closer the correlation r, to -1 or 1, the stronger the relationship between x and y

·         If r is close to or equal to 0, there is a weak relationship or no relationship between the measures.

·        Generally, you can interpret r values this way:

o    +.70 or higher indicates a very strong positive relationship.

o    +.40 to +.69 indicates a strong positive relationship.

o    +.20 to +.39 indicates a moderate positive relationship.

o    -.19 to +.19 indicates no or a weak relationship.

o    -.20 to -.39 indicates a moderate negative relationship.

o    -.40 to -.69 indicates a strong negative relationship.

o    -.70 or lower indicates a very strong negative relationship.

The formula for calculating Pearson correlation coefficient (r) between the two variables x and y is:

Relation between r and R-square:

When dealing with simple linear regression with one independent variable, the R-square value is the square of the Pearson correlation coefficient r.

We can simply square the Pearson correlation coefficient r to get the R-square value.

Visual representation of the relationship between variables using Trend line or regression line in Tableau

Sepsis, a life-threatening condition, triggers a dysregulated immune response to infection, often resulting in widespread inflammation and coagulation abnormalities. Partial Thromboplastin Time (PTT), typically measured in seconds, determines the time for blood clotting, while fibrinogen concentration, measured in grams per liter (g/L), is essential in the clotting process. In sepsis, disruptions in coagulation pathways can lead to prolonged PTT and altered fibrinogen levels. Studying their correlation helps understand how clotting dynamics are affected in septic patients, with trend lines aiding in visualizing any associations between these variables in the scatterplot and assist in predicting patient outcomes, managing treatment, and improving clinical decision-making. Our dataset contains 2932 Sepsis confirmed patients.

Step 1) Create a calculated field using the CORR function and find Pearson correlation Coefficient r.

Step 2) Create Scatter plot to show the relationship between two sets of data.

Fibrinogen concentration (dependent variable) on Y (Vertical) axis and PTT (independent variable) on X (Horizontal) axis.

Fibrinogen concentration is a dependent variable. It is directly influenced by sepsis, as changes in fibrinogen levels respond to the presence or severity of the condition.

Partial Thromboplastin Time -PTT is an independent variable though potentially influenced by sepsis but can also be affected by other factors, making its relationship with sepsis less direct.

1.     Drag PTT to Columns and Fibrinogen to Rows.

2.     In the Analysis menu, uncheck Aggregate Measures

3.     Drag and drop Trend line in the Analytics pane to the Linear regression model of ‘Add a Trendline ‘dialogue box.

4.     Hover over the trend line to locate the R-Squared value.

Interpretation:

R-Square from the trend line is 0.0135971. It indicates that approximately 1.36% of the variance in the dependent variable (Fibrinogen concentration) can be explained by the independent variable (PTT).

The square root of R-Square is the Pearson Correlation Coefficient (r)= 0.116606604

The Pearson Correlation Coefficient (r) using CORR function is - 0.1166, indicating a weak negative correlation and a downward slopping line, inferring if one variable changes, other changes in opposite direction. And we can see that as the PTT (Partial Thromboplastin Time) increases the Fibrinogen concentration is decreased.

Monitoring PTT levels in sepsis patients could provide some insight into their Fibrinogen status. However, given the weak correlation, Fibrinogen concentration should be assessed using additional clinical parameters and laboratory tests to obtain a comprehensive understanding of the patient's coagulation status.

In conclusion, while Tableau primarily emphasizes data visualization, it also offers basic regression analysis tools. Users can create regression lines, assess model fit through residuals, and evaluate performance metrics like R-squared. These features enhance understanding and communication of data relationships within Tableau's visual environment.

References:

114 views1 comment

## 1 Comment

Rated 0 out of 5 stars.
No ratings yet