In PostgreSQL, the CUME_DIST() function is a window function used for calculating the cumulative distribution of a value within a group of values. It returns the cumulative probability that a random variable is less than or equal to a specified value.
What is a WINDOW function ?
Window functions, also known as analytic functions or windowing functions, are a powerful feature in PostgreSQL (and other SQL databases) that allow you to perform calculations across a set of rows related to the current row within a query result set. These functions operate on a "window" of rows defined by a partition and an optional ordering.
Here are some key characteristics of window functions in PostgreSQL:
Partitioning: Window functions can be partitioned, meaning that they operate on subsets of rows within the result set defined by one or more columns. The partitioning divides the result set into distinct groups, and the function is applied independently to each group.
Ordering: Within each partition, rows can be ordered using one or more columns. This defines the order in which the function operates on the rows within each partition.
Window Frame: Optionally, you can define a "window frame" that specifies a subset of rows within each partition to which the window function applies. Commonly used window frame specifications include ROWS BETWEEN, RANGE BETWEEN, and GROUPS BETWEEN.
Aggregate and Analytic Functions: Window functions can be aggregate functions (e.g., SUM, AVG, COUNT) or analytic functions (e.g., ROW_NUMBER, RANK, LAG, LEAD). Analytic functions calculate a value for each row based on a set of related rows, while aggregate functions compute a single result from a set of input rows.
Syntax: Window functions are typically used with the OVER() clause, which specifies the partitioning, ordering, and window frame for the function. The OVER() clause follows the function call and is enclosed in parentheses.
Window functions are incredibly versatile and useful for performing complex analytical and reporting tasks within SQL queries. They provide a flexible way to perform calculations that would otherwise be challenging or impossible with standard SQL aggregate functions.
PostgreSQL offers a wide range of window functions for performing various analytical tasks. Here's a list of some commonly used window functions in PostgreSQL:
Aggregate Functions:
AVG(): Calculates the average value.
SUM(): Calculates the sum of values.
COUNT(): Counts the number of rows.
MIN(): Finds the minimum value.
MAX(): Finds the maximum value.
STRING_AGG(): Concatenates strings into a single string.
Analytic Functions:
ROW_NUMBER(): Assigns a unique sequential integer to each row.
RANK(): Assigns a rank to each row based on the ordering.
DENSE_RANK(): Assigns a dense rank to each row based on the ordering.
NTILE(): Divides the rows into a specified number of buckets.
LAG(): Accesses data from a previous row.
LEAD(): Accesses data from a subsequent row.
FIRST_VALUE(): Returns the first value in an ordered set of values.
LAST_VALUE(): Returns the last value in an ordered set of values.
PERCENT_RANK(): Calculates the relative rank of a value as a percentage.
CUME_DIST(): Calculates the cumulative distribution of a value.
These window functions can be extremely powerful for performing complex analytical tasks directly within SQL queries in PostgreSQL, allowing for efficient and flexible data analysis.
The syntax of CUME_DIST() Function is :
PARTITION BY clause :
The PARTITION BY clause divides rows into multiple partitions to which the function is applied and it is optional.
The CUME_DIST() function will treat the whole result set as a single partition if we don’t mention the PARTITION BY clause.
ORDER BY clause:
The ORDER BY clause sorts rows in each partition to which the CUME_DIST() function is applied.
Return Value :
The CUME_DIST() is a double precision value within the below range.
Now let’s go through the PostgreSQL CUME_DIST() example.
CREATE a table or use an existing table from the dataset.
Create a table named film with the below columns.
The syntax for creating a table:
The syntax for inserting data into the created table is below:
I will be using my existing table film from my data set which is exactly like below.
Below is a scenario to calculate the Cumulative Distribution!
"Can you provide a list of films with their IDs, titles, release years, rental rates, and ratings, along with the cumulative distribution of rental rates within each rating category, for films with IDs less than 50?"
OUTPUT :
In this example:
The PARTITION BY clause divided the rows into two partitions by rating.
The ORDER BY rental_rate clause orders the rental rates within each partition.
WHERE film_id < 50: This filters the rows from the film table where the film ID is less than 50.
So, for each row in the result set, the CUME_DIST() function calculates the cumulative distribution of the rental_rate within its rating category. This gives you the relative position of each film's rental rate within its rating, expressed as a value between 0 and 1 (inclusive), representing the cumulative probability. The WHERE clause ensures that only films with an ID less than 50 are included in the result set.
Now let's check on another Windows function ...
In PostgreSQL, the PERCENT_RANK() function is a window function used to calculate the relative rank of a value within a group of values as a percentage. It assigns a percentile rank to each row based on the ordering specified in the ORDER BY clause.
Here's how PERCENT_RANK() works:
I will be utilizing the same film table for this.
Scenario :
"Could you provide a list of films with their IDs, titles, release years, rental rates, and ratings, along with the percentile rank of each film's rental rate within its respective rating category?"
OUTPUT :
In conclusion, window functions in PostgreSQL offer a powerful set of tools for performing advanced analytics and gaining deeper insights into your data. Among these functions, CUME_DIST() and PERCENT_RANK() stand out as valuable tools for analyzing data distributions and understanding the relative positioning of values within partitions.
The CUME_DIST() function calculates the cumulative distribution of a value within a group of values, providing the cumulative probability that a random variable is less than or equal to a specified value. This function is particularly useful for understanding the spread of values within a dataset and identifying outliers.
On the other hand, the PERCENT_RANK() function calculates the relative rank of a value within a group of values as a percentage. It assigns a percentile rank to each row based on the ordering specified in the ORDER BY clause, allowing for a deeper understanding of the distribution of values within partitions.
By leveraging these window functions, PostgreSQL users can gain valuable insights into their data, perform complex analyses, and make more informed decisions based on a deeper understanding of data distributions and relative positioning of values.
Comments