A window function performs a calculation across a set of table rows that are related to the current row. Window functions are the same as regular aggregate functions but unlike regular aggregate functions, window function does not cause rows to be grouped into a single output row.
The following query uses the Sum() aggregate function to calculate the total salary of all employees in the company:
SELECT SUM(salary) as sum_salary FROM employees;
As shown clearly in the output, all rows of the employee table are grouped in a single row.
In the following query, we have used the Sum() as a window function. It returns the sum salary of all employees along with the salary of each employee:
SELECT EmpId, EmpName, EmpDOB, Salary, SUM(Salary) OVER() as sum_salary
FROM employee;
This is the same as the regular sum aggregate function, but the OVER clause causes it to be treated as a window function and computed across an appropriate set of rows.
SQL window function syntax
The syntax of the window functions is as follows:
window_function_name (expression) OVER (partition_clause order_clause frame_clause)
A window function call always contains an OVER clause directly following the window function’s name. This is what distinguishes it from a regular function or aggregate function.
window_function_name: The name of the window function such as sum(), rank() etc..
expression: The target expression or column on which the window function operates.
OVER: The OVER clause determines how the rows of the query will be processed by the window function. The OVER clause consists of three clauses: partition, order, and frame clauses.
If the PARTITION BY clause is not specified, then the whole result set is treated as a single partition/group as we have seen in the above example.
The ORDER clause specifies the orders of rows in a partition on which the window function operates.
A FRAME is the subset of the current partition. It specifies the rows on which the window functions work. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row. A window frame is defined by a lower bound and an upper bound relative to the current row. The lowest bound is the first row, which is known as UNBOUNDED PRECEDING. The highest bound is the last row, which is known as UNBOUNDED FOLLOWING. For example, if we only want to get 5 rows before the current row, then we will specify the range using 5 PRECEDING.
Here is an example that shows how to compare each employee’s salary with the average salary in his or her department:
SELECT deptname, empno, salary, AVG(salary) OVER (PARTITION BY deptname) AS avg_salary FROM employee;
The first three columns in the output are directly from the table employee, and there is one output row for each row in the table. The fourth column represents an average taken across all the rows that are grouped by the department name value as the current row.
We can also specify the order in which we want rows to be processed by window functions using ORDER BY within OVER. Here is an example:
SELECT deptname, empno, salary, RANK() OVER (PARTITION BY deptname ORDER BY salary) as emp_rank FROM employee;
Here the RANK function generates the rank for each partition (in this case, deptname) ordered by the value defined in the order by clause(in this case, salary).
The window functions are applied on the ‘virtual table’ produced by the query’s FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses if any. For example, the rows removed after applying where clause will not be seen by the window functions, and window functions will be applied on the filtered rows.
When a query involves multiple window functions, we can write out each one with a separate OVER clause, but this is duplicative and error-prone if we are using the same window function. Instead, each windowing behavior can be named in a WINDOW clause and then referenced in OVER. For example:
SELECT SUM(salary) OVER win as sum_salary, AVG(salary) OVER win as avg_salary FROM employee WINDOW win AS (PARTITION BY deptname ORDER BY salary);
SQL window function types
The window functions are of three types:
Value Window Functions:
LAG() - Accesses the value stored in a row before the current row.
LAST_VALUE() - Returns the last value in an ordered set of values
LEAD() - Accesses the value stored in a row after the current row.
FIRST_VALUE() - Returns the first value in an ordered set of values
Aggregation Window Functions:
AVG() - Find the average of the expression in the partition.
COUNT() - Count the number of rows in the partition.
MAX() - Find the maximum of the expression in the partition.
MIN() - Find the minimum of the expression in the partition.
SUM() - Find the sum of the expression in the partition.
Ranking Window Functions:
CUME_DIST() - The cumulative distribution: the percentage of rows less than or equal to the current row.
NTILE() - Distributes the rows of a partition into a specified number of buckets. For example, if we perform the window function NTILE(5) on a table with 100 rows, then rows 1 to 20 will be in bucket 1, rows 21 to 40 in bucket 2, rows 41 to 60 in bucket 3, rows 61 to 80 in bucket 4 and row 81 to 100 in bucket 5.
RANK() - As the name suggests, the rank function assigns rank to all the rows within every partition. Rank is assigned such that rank 1 is given to the first row and rows having same value are assigned same rank. For the next rank after two same rank values, one rank value will be skipped.(1,2,3,3,5,6...)
DENSE_RANK() - It assigns rank to each row within partition. Just like rank function first row is assigned rank 1 and rows having same value have same rank. The difference between RANK() and DENSE_RANK() is that in DENSE_RANK() no rank is skipped.(1,2,3,4,5...)
PERCENT_RANK() -Assigns the rank number of each row in a partition as a percentage.
ROW_NUMBER() - Assigns a sequential integer to each row within the partition of a result set. Row numbers are not repeated within each partition.
The SQL window functions are very powerful and efficient. The SQL window functions are also often used in the everyday work of data scientists and data analysts. Hope this blog is helpful. Thanks for Reading.