Abdullah G. R Mastan

Dec 8

All You Need to Know About SQL Windows Function

The most basic premise of the Window Function is to perform comparative analysis, which gives rise to its alternate name, i.e., Analytic Functions!

The functionality of Window Functions and GROUP BY Clause is quite similar. Both functions aggregate data, however, the former presents the aggregation against each row, while the latter collapses and summarizes it.

For the time being, just visualize this result obtained through windowing. I'm sure it will give you a rough idea.

Notice how the salary for both departments has been aggregated (averaged) and displayed in the dept_avg_salary column, while the individual salaries remain intact in the salary column!

This is what I meant when I said “The former presents the aggregation against each row.”

Difference Between GroupBy and Window Functions

Sometime in the future, when you learn more about SQL Window Functions, you may ask…

“Wait, wait, wait, what's the difference between GroupBy and Windowing.”

Let's make a distinction between GROUP BY and Windows Function before that day comes. Things are going to get a bit technical!

For each user-specified group, the GROUP BY function applies user-defined aggregations, which collapses multiple rows into a single one. This helps in summarizing the data, specifically!

Referencing the first example, the rows were grouped, however, the rows that lead to obtaining the dept_avg_salary weren't removed. In other words, the rows weren’t collapsed.

Window Functions uses pseudo-GroupBY (I’ll explain this) and aggregation to summarize data, however, it leaves the table as it is, i.e., it does not collapse the data into a single row per group. Instead, the aggregation output is added next to each row/ data point.

SQL Window Function Syntax Breakdown

Before delving any further, I would like you to get to know the syntax used for Windowing. It’s pretty straightforward, making it easy to grasp!

Syntax:

SELECT

col_name1,

...,

WINDOW_FUN(col_name) OVER (PARTITION BY col_name ORDER BY col_name) AS alias

FROM table_name

Syntax Breakdown:

WINDOW_FUNC: The Aggregation you wish to apply to the data within the Window

OVER: ‘No Over clause, no Window’, meaning this clause is mandatory as it instantiates the Window!

PARTITION BY (optional): This is the same as GROUP BY, literally. It groups rows with similar values. Excluding it infers that the entire table should be considered as a single window/ group.

ORDER BY (optional): You know this one, it orders the data within individual windows.

Kindly hold your horses, this was just the introduction. Soon you’ll learn the intricacies that the above clauses hold!

Types of SQL Window Functions

All the windowing functions are condensed into 3 types, each one having its own set of unique functions!

Aggregate Window Functions
Ranking Window Functions
Value Window Functions

You need to know the difference between the above three functions, their capabilities, and usage, which I shall cover all!

This article will contain the most prominent Window Functions, while the rarely to never-used functions will be excluded!

Aggregate Window Functions

These functions simply aggregate data within the window and display the result across individual rows. The aggregate functions include SUM, AVG, COUNT, MIN, and MAX.

SUM()

The SUM() function adds the values of all rows of a column within a user-defined window. Let’s learn by example.

Question

Suppose you have a table called sales that contains information about sales transactions, including the order_id, product_id, sales_date, and sales_amount. You want to calculate the total sales amount for each product, along with the running total sales amount for each product, sorted by sales date. Using SQL, write a query to achieve this.

Solution

Let's first understand what the running total is.

Suppose you have a list of daily sales for a product:

- Day 1: $100

- Day 2: $150

- Day 3: $50

The running total for this data would be calculated as follows:

- Day 1: $100

- Day 2: (including Day 1): $100 + $150 = $250

- Day 3: (including Day 1 and Day 2): $100 + $150 + $50 = $400

Solution Code

SELECT

order_id,

product_id,

sales_date,

sales_amount,

SUM(sales_amount) OVER (PARTITION BY product_id ORDER BY sales_date) AS running_total_sales,

SUM(sales_amount) OVER (PARTITION BY product_id) AS total_sales

FROM sales

Code Breakdown

We use the SUM(sales_amount) function as a window function.

PARTITION BY product_id ensures that the sum is calculated separately for each product. It works exactly like the GROUP BY Clause.

ORDER BY sales_date is necessary since it sorts the rows by date in ascending order, i.e., from the 1st of the month unto the 30/31st of the month.

NOTE: How does a SQL Windows function work?

row1 is loaded into the window → Sum(row1) calculated → Result displayed.

row2 is loaded into the window (row1 still exists) → Sum(row1 + row2) calculated → Result displayed.

Similarly, all rows having the same product_id are loaded → (row1 + row2 + rowN) calculated → Results displayed.

The above steps repeat for all distinct product_ids.

Result

The running total can only be calculated using the ORDER BY Clause, while total_sales cannot be calculated using it.

AVG()

The AVG() function averages the values of all rows of a column within a user-defined window. Let’s learn by example.

Question

Write an SQL query to calculate the 30-day rolling average closing price for each stock symbol in the daily_stock_prices table. The result should include the stock symbol, price date, closing price, and the rolling average closing price.

Solution Code

SELECT

stock_symbol,

price_date,

closing_price,

AVG(closing_price) OVER (

PARTITION BY stock_symbol

ORDER BY price_date

) AS rolling_average

FROM daily_stock_prices;

Code Breakdown

AVG (closing_price): This part calculates the rolling average closing price using the AVG() function as a window function.

PARTITION BY stock_symbol: The PARTITION BY clause divides the data into partitions based on stock symbols, so the rolling average is calculated separately for each stock symbol.

ORDER BY price_date: The ORDER BY clause orders the data by price date to ensure it's processed in chronological order.

Result

One of the salient features of the rolling average is that it helps identify the outliers in the data. Suppose the closing_price on 06-01-2023 explodes, then the rolling_average on that day will also explode!

Ranking Window Functions

These Window functions are specially designed to rank the rows of a dataset. Any query that requires ranking based on sales, salary, purchases, etc. will utilize these functions.

You will grasp this topic once you go through a couple of examples.

ROW_NUMBER()

This function assigns a unique number to each row according to how the data is sorted! In other words, it assigns a row number to each row.

Question

Suppose you have a table named Employees with the following columns: EmployeeID, EmployeeName, and Salary. You want to retrieve a list of employees with their names, salaries, and a unique row number assigned to each employee based on their salary, in descending order of salary.

Solution Code

SELECT

EmployeeName,

Salary,

ROW_NUMBER() OVER (ORDER BY Salary DESC) AS RowNumber

FROM

Employees;

Solution Breakdown

ROW_NUMBER(): This function returns a unique number to each row!

ORDER BY Salary DESC:

This is an essential clause since it defines how the rows are to be ordered!
DESC first sorts the data!
Subsequently, rows are numbered from top to bottom.

Results

This is the simplest way to rank your dataset based on the column you pass in the ORDER BY clause. However, further down you’ll learn that there is one functionality that it lacks!

Can you guess it?

RANK()

The RANK() function ranks all rows within a window. It assigns the same rank to rows with the same score/ value, i.e. rows that have tied. Furthermore, in case of ties, subsequent ranks are skipped (There is an example ahead clearly explaining this point).

Question

Consider a table named orders with the following columns:

order_id: The unique identifier for each order.
customer_id: The ID of the customer who placed the order.
order_date: The date when the order was placed.

Write an SQL query to find the top 3 customers who have placed the most orders. If there are ties, assign the same rank to tied customers. Your query should return the customer_id, the count of orders they've placed, and their rank.

Solution Code

WITH CustomerOrderCounts AS (

SELECT

customer_id,

COUNT(*) AS order_count, MAX(order_date) AS latest_order_date

FROM Orders

GROUP BY customer_id

), RankedCustomers AS (

SELECT

customer_id,

order_count,

latest_order_date,

RANK () OVER (ORDER BY order_count DESC, latest_order_date DESC) AS rank

FROM CustomerOrderCounts

)

SELECT

customer_id,

order_count,

latest_order_date,

rank

FROM RankedCustomers

WHERE rank <= 3

Solution Breakdown

The RANK() window function is used to assign a rank to each row. In this case, it is used to rank customers based on the number of orders they've placed, with the highest order count receiving a lower rank, i.e., 1st.

It also assigns the same rank to rows with the same order count. In other words, if multiple customers have the same number of orders, they will share the same rank.

Result

NOTE: An important feature of RANK() is that in cases of ties, the following rank(s) is/ are skipped. Suppose 3 rows share the 3rd place, the following row will start from the 6th rank, meaning 2 Ranks are skipped.

Visualize:

NOTE:

If, N → No. of rows sharing the same place

Then, No. of ranks to skip → (N - 1)

Ex: 4 rows are tied on 3rd place.

No. of ranks to skip → 4-1 → 3

∴ The next rank → 7th place

NOTE: The ROW_NUMBER() function can also be used for ranking as we saw above, however, it lacks the functionality of handling ties!

DENSE_RANK()

This function is completely similar to the RANK() function explained above. The only difference lies in the ranking procedure. In cases of ties, DENSE_RANK() doesn’t skip the following rank.

YES, That’s it!!

Referencing the above example, the 2nd rank will be assigned after the 1st. Thus, no ranks are skipped.

Here's a task…. Solve the question in RANK() using DENSE_RANK()…

HINT: just interchange RANK()

Result:

You may be thinking… “Sure sir, I’ve grasped the difference between the two rankings due to my competency… however, when is it to be used??”

In my opinion, it’s a good practice to use DENSE_RANK().

Value Window Functions

These functions return the values of the next/ preceding row while the current row is being processed. This may be the most important function, since generally while performing analysis we are required to calculate differences between the current row and the next/ preceding row.

Let's learn by example!

LEAD()

This function returns the value of the next row for the current row being processed!

Syntax

SELECT

LEAD(col_name, offset, default) OVER (user_defined_Window)

FROM table_name

Offset: Consider it to be the index of the row starting from the row next to the current row. Default → 1.

NOTE: If the offset is set to 1, the next row will be chosen. If it's set to 2 the second row will be chosen and so on.

Default: The last row returned by LEAD() is always NULL (you’ll see why!). The value passed into this will be the replacement for NULL.

Question

A table named Sales exists with the following columns: OrderDate, ProductID, and Revenue.

Create a query that returns the ProductID, OrderDate, and the revenue difference (RevenueDiff) between the current order and the next order for each product. If there is no next order for a product, treat the RevenueDiff as 0.

Solution

WITH RankedSales AS (

SELECT

ProductID,

OrderDate,

Revenue,

LEAD(OrderDate) OVER (PARTITION BY ProductID ORDER BY OrderDate) AS NextOrderDate,

LEAD(Revenue) OVER (PARTITION BY ProductID ORDER BY OrderDate) AS NextRevenue

FROM Sales

)

SELECT

a.ProductID,

a.OrderDate,

CASE

WHEN b.ProductID IS NOT NULL THEN b.NextRevenue - a.Revenue

ELSE 0

END AS RevenueDiff

FROM

RankedSales a

LEFT JOIN

RankedSales b ON a.ProductID = b.ProductID AND

a.OrderDate = b.NextOrderDate

ORDER BY

a.ProductID, a.OrderDate;

Solution Breakdown

LEAD (Revenue,1,0): Retrieves the value of the next order's revenue for the same product.

1 → Return the value of the next row.
0 → Replace NULL with 0.

LEAD(OrderDate,1,0): Retrieves the value of the next order date for the same product.

1 → Return the value of the next row.
0 → Replace NULL with 0.

PARTITION BY: Divides the data into partitions based on the ProductID.

ORDER BY: Orders the dates in ascending order. Remember what will happen if this clause is omitted.

Result

The reason why the last row of every product_id group is 0 is that if it takes the value of the next row (i.e. from the next group), the Window function gets violated (PARTITION BY product_id)

LAG()

This function returns the value of the previous row for the current row being processed!

Syntax

SELECT

LAG(col_name, offset, default) OVER (user_defined_window)

FROM table_name

Offset: In this case, the nth previous row will be chosen according to the specified offset.

A task again…. Solve the question in LEAD() using LAG()…

HINT: just interchange LAG()

In this example too, the 1st row is 0 since it’s a violation to take a value from the previous group!

Conclusion

By now, I hope you have gained valuable insights into SQL Window Functions. Also, I hope you comprehended the fundamentals of these functions, including Aggregate, Ranking, and Value Window Functions. Besides, keep in mind the key distinctions between Window Functions and GroupBy, the syntax for their implementation, and their practical use cases.

All in all, Windowing is a powerful tool for performing comparative analysis in SQL!!

HAPPY CODING!!

Kockpit is here to help you

Business Form

Abdullah G. R Mastan

All You Need to Know About SQL Windows Function

Difference Between GroupBy and Window Functions

SQL Window Function Syntax Breakdown

Types of SQL Window Functions

Aggregate Window Functions

SUM()

AVG()

Ranking Window Functions

ROW_NUMBER()

RANK()

DENSE_RANK()

Value Window Functions

LEAD()

LAG()

Conclusion