SQL Explained: Ranking Analytics
If you have experience with one of the popular RDBMS management systems like Oracle, SQL Server, Postgres etc., you might at some stage have come across analytic, sometimes referred to as windowing, functions.
Analytics allow you to calculate aggregates and ranking sequences for groups of rows within your dataset. If you've wondered whether or not it's worthwhile investing some time in getting to know and use analytic functions, I can tell you that the answer to that is an emphatic Yes. They are extremely useful and allow SQL to do things that are difficult if not impossible without them.
In this article, we'll concentrate on a specific category of analytics known as ranking functions by examining four of SQL's most common ranking techniques. We'll explain what they are and provide examples of how they are used.
Ranking analytics syntax
The general format of a ranking analytic function in most modern SQL dialects is,
rank | dense_rank | row_number() | ntile over(partition_by_clause order_by_clause windowing_clause)
Let's briefly look at each part of the above statement in turn.
RANK | DENSE_RANK| ROW_NUMBER() | NTILE
These are the names of the ranking functions and most SQLs support four functions as shown. We'll talk more about each of the individual ranking functions shortly.
OVER(…)
The over
keyword is mandatory and indicates the start of the grouping definition i.e. which rows in the data table are you ranking.
PARTITION_BY_CLAUSE
The partition_by_clause
is optional and contains the names of the columns (one or more) in your dataset that you want to group on. If omitted, SQL considers all records in the table when running the ranking function.
ORDER_BY_CLAUSE
This order_by_clause
specifies how the columns that make up the grouping are to be ordered before being ranked.
WINDOWING_CLAUSE
Within any particular grouping, the windowing_clause
defines a range of records we want the ranking function to operate on with respect to the current row.
There are two different types of windowing,
- a row window i.e. one or more physical rows to look back and/or forward from the current record
- a range window which defines a value to subtract from or add to the current row's value to define a range of rows.
Common windowing frames are:
- ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (default value if none given)
- ROWS BETWEEN X PRECEDING AND CURRENT ROW
- ROWS BETWEEN CURRENT ROW AND X FOLLOWING
- RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
I added this explanation of the WINDOWING_CLAUSE for completeness but to be honest, it's unlikely you'll need to use it in your day-to-day job very often. 99% of the time, the default value is all you'll need. In fact, typically, the ranking analytic functions don't use the windowing_clause at all.
Ranking Analytics are kind of hard to explain just using words so the best way forward is to show you some examples of their use.
Setting up a test environment.
I use Oracle's live Sql website to run my tests. I explained how to get access to it and use this service in a previous article I wrote on using Grouping Sets, Rollup and Cube in SQL. It's completely free to set up and use. The link to that article is below.
Sample table creation & population
We only need one table for our examples and our data set is based on students and the test scores in the subject they're studying. The table consists of a student_id, student_name, subject and score.
CREATE TABLE student_scores (
student_id INT,
student_name VARCHAR(50),
subject VARCHAR(50),
score INT
);
INSERT INTO student_scores VALUES (1, 'Alice', 'Math', 95);
INSERT INTO student_scores VALUES (2, 'Bob', 'Math', 85);
INSERT INTO student_scores VALUES (3, 'Charlie', 'Math', 90);
INSERT INTO student_scores VALUES (4, 'David', 'Math', 80);
INSERT INTO student_scores VALUES (5, 'Eva', 'Math', 70);
INSERT INTO student_scores VALUES (6, 'Frank', 'Science', 88);
INSERT INTO student_scores VALUES (7, 'Grace', 'Science', 92);
INSERT INTO student_scores VALUES (8, 'Hannah', 'Science', 85);
INSERT INTO student_scores VALUES (9, 'Ivy', 'Science', 90);
INSERT INTO student_scores VALUES (10, 'Jack', 'Science', 82);
INSERT INTO student_scores VALUES (11, 'Kate', 'History', 78);
INSERT INTO student_scores VALUES (12, 'Leo', 'History', 88);
INSERT INTO student_scores VALUES (13, 'Mia', 'History', 84);
INSERT INTO student_scores VALUES (14, 'Nina', 'History', 90);
INSERT INTO student_scores VALUES (15, 'Oscar', 'History', 92);
select * from student_scores;
+------------+--------------+---------+-------+
| student_id | student_name | subject | score |
+------------+--------------+---------+-------+
| 1 | Alice | Math | 95 |
| 2 | Bob | Math | 85 |
| 3 | Charlie | Math | 90 |
| 4 | David | Math | 80 |
| 5 | Eva | Math | 70 |
| 6 | Frank | Science | 88 |
| 7 | Grace | Science | 92 |
| 8 | Hannah | Science | 85 |
| 9 | Ivy | Science | 90 |
| 10 | Jack | Science | 82 |
| 11 | Kate | History | 78 |
| 12 | Leo | History | 88 |
| 13 | Mia | History | 84 |
| 14 | Nina | History | 90 |
| 15 | Oscar | History | 92 |
+------------+--------------+---------+-------+
15 rows selected.
Now that we have our data, we can showcase the ranking functions.
RANK
The rank
function allows you to assign a sequential integer number to rows in your table, but you should be aware that by using rank,
sequence numbers are not necessarily consecutive.
The best way I like to think of this is to imagine runners in an Olympic sprint. If there is a dead heat for first place between two athletes, they are both assigned first place and win gold. The next athlete home, though, is not assigned the second position, they will be awarded the third position (i.e. the bronze medal).
This is exactly how rank apportions its sequence numbers, as can be seen clearly in the example below.
select subject, student_name, rank() over(order by subject) rnk
from student_scores;
+---------+--------------+-----+
| subject | student_name | rnk |
+---------+--------------+-----+
| History | Kate | 1 |
| History | Leo | 1 |
| History | Mia | 1 |
| History | Nina | 1 |
| History | Oscar | 1 |
| Math | Alice | 6 |
| Math | Bob | 6 |
| Math | Charlie | 6 |
| Math | David | 6 |
| Math | Eva | 6 |
| Science | Frank | 11 |
| Science | Grace | 11 |
| Science | Hannah | 11 |
| Science | Ivy | 11 |
| Science | Jack | 11 |
+---------+--------------+-----+
15 rows selected.
For a more realistic use case, we can bring out the scores for each subject ranked by the highest score in that subject.
SELECT student_name, subject, score,
RANK() OVER (PARTITION BY subject ORDER BY score DESC) AS rank
FROM student_scores;
+--------------+---------+-------+------+
| student_name | subject | score | rank |
+--------------+---------+-------+------+
| Alice | Math | 95 | 1 |
| Charlie | Math | 90 | 2 |
| Bob | Math | 85 | 3 |
| David | Math | 80 | 4 |
| Eva | Math | 70 | 5 |
| Oscar | History | 92 | 1 |
| Nina | History | 90 | 2 |
| Leo | History | 88 | 3 |
| Mia | History | 84 | 4 |
| Kate | History | 78 | 5 |
| Grace | Science | 92 | 1 |
| Ivy | Science | 90 | 2 |
| Frank | Science | 88 | 3 |
| Hannah | Science | 85 | 4 |
| Jack | Science | 82 | 5 |
+--------------+---------+-------+------+
15 rows selected.
Notice that the rank resets at the start of each subject grouping. Using the above data set as our input, it's easy to tweak it just to bring out the individual highest score within each subject.
select * from
(
SELECT student_name, subject, score,
RANK() OVER (PARTITION BY subject ORDER BY score DESC) AS rank
FROM student_scores
)
where rank = 1
+--------------+---------+-------+------+
| student_name | subject | score | rank |
+--------------+---------+-------+------+
| Alice | Math | 95 | 1 |
| Oscar | History | 92 | 1 |
| Grace | Science | 92 | 1 |
+--------------+---------+-------+------+
3 rows selected.
DENSE_RANK
The dense_rank
function is similar to the rank function in that it also allows you to assign a sequential number to rows. The difference is that denserank guarantees_ number sequences with no gaps.
Continuing our Olympic race analogy, under dense_rank conditions, the two dead heaters would still win gold but the next person home would be deemed to be in second place and get a silver medal.
You can see clearly what I mean if we substitute rank with dense_rank in our first ranking SQL above. We get this output,
select subject, student_name,dense_rank() over(order by subject) rnk
from student_scores;
+---------+--------------+-----+
| subject | student_name | rnk |
+---------+--------------+-----+
| History | Kate | 1 |
| History | Leo | 1 |
| History | Mia | 1 |
| History | Nina | 1 |
| History | Oscar | 1 |
| Math | Alice | 2 |
| Math | Bob | 2 |
| Math | Charlie | 2 |
| Math | David | 2 |
| Math | Eva | 2 |
| Science | Frank | 3 |
| Science | Grace | 3 |
| Science | Hannah | 3 |
| Science | Ivy | 3 |
| Science | Jack | 3 |
+---------+--------------+-----+
15 rows selected.
There's not much more to say about denserank. Just use it in place of the rank analytic if you absolutely must_ have gap-free ranking sequences.
ROW_NUMBER
The row_number analytic assigns each row within a partition a unique integer value. On the face of it, this sounds like the other two ranking functions but the key difference is that records that would otherwise be "ties" within a partition are assigned different ranks, and each rank within a partition is unique to that partition and also gap-free.
Again, using our Olympic race analogy, only one of the dead heaters would win gold, the other would win silver and the next person home would win bronze. Here's an example.
SELECT student_name, subject, score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS rn
FROM student_scores;
+--------------+---------+-------+----+
| student_name | subject | score | rn |
+--------------+---------+-------+----+
| Alice | Math | 95 | 1 |
| Grace | Science | 92 | 2 |
| Oscar | History | 92 | 3 |
| Ivy | Science | 90 | 4 |
| Nina | History | 90 | 5 |
| Charlie | Math | 90 | 6 |
| Frank | Science | 88 | 7 |
| Leo | History | 88 | 8 |
| Bob | Math | 85 | 9 |
| Hannah | Science | 85 | 10 |
| Mia | History | 84 | 11 |
| Jack | Science | 82 | 12 |
| David | Math | 80 | 13 |
| Kate | History | 78 | 14 |
| Eva | Math | 70 | 15 |
+--------------+---------+-------+----+
15 rows selected.
You can see that Grace has been ranked above Oscar even though they both scored the same result in their subject. And because their sort order by score is non-deterministic, it also means that their ranking is non-deterministic. If you ran the above same query again there is a chance their order, thus their rank values would be reversed.
Note, because we haven't specified a partition within the over() clause, the ranking is applied to the whole data set so record numbering starts at 1 and increments by 1 until all records have been processed.
In our final example with row_number(), we'll show how using a partition clause with it can be extremely useful if you need to de-duplicate a table of data. First of all, we need to create a bit more data to play with. I'm simply going to insert a duplicate row for three of the students, Kate, Alice and Oscar.
insert into student_scores select * from student_scores
where student_name in ('Alice','Kate','Oscar');
select *
from student_scores;
+------------+--------------+---------+-------+
| student_id | student_name | subject | score |
+------------+--------------+---------+-------+
| 1 | Alice | Math | 95 |
| 1 | Alice | Math | 95 |
| 2 | Bob | Math | 85 |
| 3 | Charlie | Math | 90 |
| 4 | David | Math | 80 |
| 5 | Eva | Math | 70 |
| 6 | Frank | Science | 88 |
| 7 | Grace | Science | 92 |
| 8 | Hannah | Science | 85 |
| 9 | Ivy | Science | 90 |
| 10 | Jack | Science | 82 |
| 11 | Kate | History | 78 |
| 11 | Kate | History | 78 |
| 12 | Leo | History | 88 |
| 13 | Mia | History | 84 |
| 14 | Nina | History | 90 |
| 15 | Oscar | History | 92 |
| 15 | Oscar | History | 92 |
+------------+--------------+---------+-------+
18 rows selected.
Now, by using rownumber() to rank over the student name and choosing any that are ranked > 1, we effectively identify the duplicate records_ in our data table. Here's the SQL to do that.
WITH RankedScores AS (
SELECT student_id, student_name, subject, score,
ROW_NUMBER() OVER (PARTITION BY student_id,
student_name, subject, score ORDER BY student_id) AS rn
FROM student_scores
)
SELECT student_id, student_name, subject, score
FROM RankedScores
WHERE rn > 1;
+------------+--------------+---------+-------+
| student_id | student_name | subject | score |
+------------+--------------+---------+-------+
| 1 | Alice | Math | 95 |
| 11 | Kate | History | 78 |
| 15 | Oscar | History | 92 |
+------------+--------------+---------+-------+
3 rows selected.
We can use these identified records to de-duplicate our table. Here's an example of how to use them to do an in-place delete that will restore our table to its original data set.
DELETE FROM student_scores
WHERE rowid IN (
SELECT rid
FROM (
SELECT rowid AS rid,
ROW_NUMBER() OVER (PARTITION BY student_id,
student_name, subject, score ORDER BY student_id) AS rn
FROM student_scores
)
WHERE rn > 1
);
3 row(s) deleted.
SELECT * FROM student_scores;
+------------+--------------+---------+-------+
| student_id | student_name | subject | score |
+------------+--------------+---------+-------+
| 1 | Alice | Math | 95 |
| 2 | Bob | Math | 85 |
| 3 | Charlie | Math | 90 |
| 4 | David | Math | 80 |
| 5 | Eva | Math | 70 |
| 6 | Frank | Science | 88 |
| 7 | Grace | Science | 92 |
| 8 | Hannah | Science | 85 |
| 9 | Ivy | Science | 90 |
| 10 | Jack | Science | 82 |
| 11 | Kate | History | 78 |
| 12 | Leo | History | 88 |
| 13 | Mia | History | 84 |
| 14 | Nina | History | 90 |
| 15 | Oscar | History | 92 |
+------------+--------------+---------+-------+
15 rows selected.
NTILE
The ntile
function allows you to divide a data set up into approximately equal-sized record groups. These groups are called "tiles" and all entries within the same tile are assigned the same ranking.
In the example below we want to subdivide our table into 4 groupings based on the student scores. In other words, each grouping (or tile) should have broadly similar scores within the confines of the actual data set.
The number of records within each grouping is not known in advance. As you can see we get the requested number of tiles in our output, but two of them have 3 records, and the other two contain 4 and 5 records respectively. All that SQL will guarantee is that, if possible, it will return the number of tiles requested.
SELECT
student_name,
subject,
score,
NTILE(4) OVER (ORDER BY score) AS tile_rank
FROM
student_scores;
+--------------+---------+-------+------------+
| student_name | subject | score | tile_rank |
+--------------+---------+-------+------------+
| Eva | Math | 70 | 1 |
| Kate | History | 78 | 1 |
| David | Math | 80 | 1 |
| Jack | Science | 82 | 2 |
| Mia | History | 84 | 2 |
| Bob | Math | 85 | 2 |
| Hannah | Science | 85 | 2 |
| Frank | Science | 88 | 3 |
| Leo | History | 88 | 3 |
| Ivy | Science | 90 | 3 |
| Nina | History | 90 | 3 |
| Charlie | Math | 90 | 3 |
| Grace | Science | 92 | 4 |
| Oscar | History | 92 | 4 |
| Alice | Math | 95 | 4 |
+--------------+---------+-------+------------+
15 rows selected
NTILE(4)
This specifies that you want to divide the rows into 4 tiles or groups.OVER (ORDER BY score)
Since we haven't specified a partition clause, that indicates to SQL that we want to consider all records in the table for our dataset, then within those records, we order them by thescore
column before dividing them into the four tiles.
Summary
In conclusion, I've highlighted the four most common SQL ranking functions in modern SQL systems and shown examples of their use. They all do a similar task in that they assign ranking values to table records based on their relative order, with RANK, DENSE_RANK, and ROW_NUMBER providing different ways to handle ties and assign sequence numbers. Meanwhile, NTILE comes into its own if you need to perform statistical or segmentation analysis on your data. I would urge you to try and use these functions as much as possible in your daily job as they are powerful tools for data analysis, manipulation and reporting.
_OK, that's all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._
I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.
If you liked this content, I think you'll find these articles interesting too.
SQL Explained: Grouping Sets, Rollup, and Cube
Build a MVP version of Google Translate with Groq and Llama3