image source unsplash.com
A window function performs calculations on table rows that are related to the current row and use of the window function does not group results into a single row.
I am using SampleSuperstore data set for my examples and PostgreSQL database
What is SQL Aggregate Function
An aggregate function performs calculation on a set of values to return a single value. Aggregate functions are usually used with the GROUP BY clause of the SELECT statement. HAVING clause is used to write conditions on the value that is returned by the aggregate function. Aggregate functions return the same value each time that it is called with a specific set of input values on same dataset.
Aggregate function examples
SELECT region,
sum(sales) as total_revenue_region
FROM "SampleSuperstore"."Orders" where date_part('year', order_date) = 2021 group by region;
The above example is to get sum of sales by each region. It uses a simple aggregate function which is sum(sales) output by single column grouping. We can add multiple columns in the select statement as long as those columns are included in the group by clause as shown below.
SELECT region, category,
sum(sales) as total_revenue_region_category
FROM "SampleSuperstore"."Orders" where date_part('year', order_date) = 2021 group by region, category;
The above query uses two field for grouping the data and for each group the query return a single row as the result.
If we try to select other columns that are not in the group by clause, the aggregate function query will give error.
Here the Aggregate functions worked on a set of rows to return a single result value. If we need only a high level summary on the data the aggregate functions are useful.
What is SQL Window function
The SQL Window Function calculates an aggregate value based on a group of table records called window frame and return multiple rows for each group. FROM clause in the Query filters the data and those virtual table data are considered in a window function for operations. Multiple window functions can slice up the data in different ways by using OVER clauses. We can specify the window frame partition by using PARTITION BY clauses.
When we have the OVER clause with an empty parameter, it will operate on the whole set of rows. Below example does calculations sum, average , min, max and percentage using the whole set of rows.
select sales,
sum(sales) over () as sum,
avg(sales) over () as avg,
max(sales) over () as max,
min(sales) over () as min,
(sales::numeric/sum(sales) over())*100 as percentage
from "SampleSuperstore"."Orders"
When we expand the above example by adding PARTITION in the OVER clause , we reduce the window frame for the column value group specified in the PARTITION column. Below example does calculations sum, average, min, max and percentage for a set of window frame for each region partition.
select product_name, region, sales,
sum(sales) over (PARTITION BY region) as total_sales_per_region,
avg(sales) over (PARTITION BY region) as avg_sales_per_region,
max(sales) over (PARTITION BY region) as max_sales_per_region,
min(sales) over (PARTITION BY region) as min_sales_per_region,
(sales::numeric/sum(sales) over(PARTITION BY region))*100 as percentage
from "SampleSuperstore"."Orders"
We can add running total of sales in the above example by adding ORDER BY clause in the window frame. It orders the rows within a window frame partition. Below example does calculations sum, average, min, max and percentage for a set of window frame for each region partition and calculates running total for each row by using the ORDER BY clause. Here we can see that the value for running total sales column changes for each row even within a window frame.
select order_date, product_name, region, sales, sum(sales)
over (PARTITION BY region) as total_sales_per_region, sum(sales)
over (PARTITION BY region ORDER BY sales ASC) as running_total_sales_per_region,
avg(sales) over (PARTITION BY region) as avg_sales_per_region,
max(sales) over (PARTITION BY region) as max_sales_per_region,
min(sales) over (PARTITION BY region) as min_sales_per_region,
(sales::numeric/sum(sales) over(PARTITION BY region))*100 as percentage
from "SampleSuperstore"."Orders"
The below example explains the use of multiple columns in the PARTITION BY clause. The query outputs averages and totals for each region, in addition to that it includes values for each region and category combination as well.
SELECT product_name, sales, profit, region, category
, sum(sales) OVER () as "total_sales",
sum(sales) OVER (PARTITION BY region) as "total_sales_per_region",
avg(sales) OVER (PARTITION BY region) as "avg_sale_in_region",
sum(sales) OVER (PARTITION BY region,category) as "total_sales_per_region_category",
avg(sales) OVER (PARTITION BY region,category) as "avg_sale_in_region_category"
FROM "SampleSuperstore"."Orders" where date_part('year', order_date) = 2021
If ordering the rows is not important, the ORDER BY clause can be avoided in the window function. As shown in the above examples, it is also possible to avoid PARTITION BY clause, in that case there will be only one partition containing all the rows.
Conclusion
I have used Avg and Sum in my example, there are many other window functions like RANK(), LAG() or LEAD() available other than aggregates. Windows functions are useful when we need to analyze a set of data within a windows frame.
Comments