SQL Explained: Common Table Expressions
Common Table Expressions (or CTEs as they are known) in SQL are temporary, named result sets containing intermediate data that are derived from another SQL query. Once you have data in a CTE, you can reference that data one or more times within the same query.
Given the above description, you might think that CTEs sound a lot like regular temporary tables in SQL and in some respects they are. So, why would you use a CTE rather than a temporary table? To answer this we need to take a look at two of the main disadvantages of temporary tables.
One is that temporary tables can contribute to more complex code, especially if they're used across different parts of a large SQL script. They need explicit creation, deletion and possibly indexes built on them, adding overhead to your SQL and session management.
Secondly, temporary tables consume physical storage which might be a consideration if space is tight and you have lots of them. Additionally, when looking at queries that use temporary tables it might not be clear what data is contained in the temporary table and where the data came from.
CTEs don't have the above issues. First off, they are ephemeral so that once the SQL session has ended the CTE goes out of scope and any memory it uses is freed up.
Secondly, you can see exactly what data is "in" a CTE. Its creation and population are right there in your SQL script.
Now, as well as their advantages, CTEs do have certain disadvantages and probably shouldn't be used in the following cases.
- You have to refer to the data that would be contained in a CTE more than once or twice in your query. This is because the CTE has to be re-populated each time it's referenced. However, if your CTE data volume is small, even this needn't be a bar from using one.
- Related to the above, CTEs cannot be indexed so will not be as performant as an indexed temporary table if the data volume is large.
So, to summarize, use CTEs in the following scenarios.
- You want to avoid using temporary tables
- The CTE data volume is relatively low
- You are referring to the CTE data in your query just once (maybe twice depending on the data volume)
One last advantage of CTEs I haven't mentioned yet is that many modern SQL dialects implement the ability to have recursive CTEs, i.e. CTEs that refer to themselves. Naturally, this means it's much easier to code recursive and hierarchical-based SQL queries. We'll see some examples of that later on.
Now that we have a fuller understanding of what CTEs are and do, it's time to take a look at some examples of their use.
NB: certain SQL operations shown below that include date and date arithmetic operations may not apply to all database SQL dialects. Check your SQL documentation if you're using a database other than Oracle
Setting up a test environment.
I use Oracle's live Sql website to run my tests. I explained how to get access to and use this service in a previous article I wrote on using Grouping Sets, Rollup and Cube in SQL. It's completely free to set up and use. The link to that article is below.
Sample table creation & population
For the non-recursive CTE examples, we'll use a table of customer transactions. Here are the table creation and data insert statements needed to re-create the input table and data.
CREATE TABLE transactions (
TransactionID INT PRIMARY KEY,
CustomerID INT,
Amount DECIMAL(10, 2),
TransactionDate DATE
);
INSERT INTO Transactions (TransactionID, CustomerID, Amount, TransactionDate)
VALUES (1, 1001, 150, TO_DATE('2021-01-01', 'YYYY-MM-DD'));
INSERT INTO Transactions (TransactionID, CustomerID, Amount, TransactionDate)
VALUES (2, 1002, 200, TO_DATE('2021-01-04', 'YYYY-MM-DD'));
INSERT INTO Transactions (TransactionID, CustomerID, Amount, TransactionDate)
VALUES (3, 1001, 100, TO_DATE('2021-01-04', 'YYYY-MM-DD'));
INSERT INTO Transactions (TransactionID, CustomerID, Amount, TransactionDate)
VALUES (4, 1003, 250, TO_DATE('2021-01-05', 'YYYY-MM-DD'));
INSERT INTO Transactions (TransactionID, CustomerID, Amount, TransactionDate)
VALUES (5, 1002, 300, TO_DATE('2021-01-05', 'YYYY-MM-DD'));
INSERT INTO Transactions (TransactionID, CustomerID, Amount, TransactionDate)
VALUES (6, 1003, 180, TO_DATE('2021-01-08', 'YYYY-MM-DD'));
INSERT INTO Transactions (TransactionID, CustomerID, Amount, TransactionDate)
VALUES (7, 1001, 190, TO_DATE('2021-01-08', 'YYYY-MM-DD'));
SELECT * FROM Transactions
TRANSACTIONID CUSTOMERID AMOUNT TRANSACTIONDATE
============= ================= ===============
1 1001 150 01-JAN-21
2 1002 200 04-JAN-21
3 1001 100 04-JAN-21
4 1003 250 05-JAN-21
5 1002 300 05-JAN-21
6 1003 180 08-JAN-21
7 1001 190 08-JAN-21
The syntax for a standard CTE is surprisingly simple, just,
WITH cte_name [(column_list)] AS (cte_query_definition) statement
Where:
cte_name
is the name given to the CTEcolumn_list
is an optional list of column names for the CTEcte_query_definition
is a query that defines the CTE's result setstatement
is a singleSELECT
,INSERT
,UPDATE
,DELETE
, orMERGE
statement that references the CTE
Test 1 – A simple CTE
Retrieve all customers who have a total spend of greater than $250
WITH CustomerTotals AS (
SELECT CustomerID, SUM(Amount) AS TotalSpent
FROM Transactions
GROUP BY CustomerID
)
SELECT CustomerID, TotalSpent
FROM CustomerTotals
WHERE TotalSpent > 250;
CUSTOMERID TOTALSPENT
========== ==========
1001 440
1002 500
1003 430
In this case, the non-CTE version of this query is also straightforward. Just this,
SELECT CustomerID, SUM(Amount) AS TotalSpent
FROM Transactions
GROUP BY CustomerID
HAVING SUM(Amount) > 250;
Test 2 – A more complex CTE
Where CTEs come into their own is if you have a query similar to this.
Retrieve customers whose average spend is greater than the average spend for all customers
Using CTEs, a query like this becomes relatively easy. The two aggregate values we require can naturally be split out into two individual CTEs and we then simply compare the results of both to determine what to retrieve.
WITH CustomerAverages AS (
SELECT CustomerID, AVG(Amount) AS AvgAmount
FROM Transactions
GROUP BY CustomerID
),
OverallAverage AS (
SELECT AVG(Amount) AS OverallAvg
FROM Transactions
)
SELECT a.CustomerID, a.AvgAmount
FROM CustomerAverages a, OverallAverage o
WHERE a.AvgAmount > o.OverallAvg;
CUSTOMERID AVGAMOUNT
========== =========
1002 250
1003 215
Compare this with the non-CTE version,
SELECT CustomerID, AVG(Amount) AS AvgAmount
FROM transactions
GROUP BY CustomerID
HAVING AVG(Amount) > (
SELECT AVG(sub.AvgAmount)
FROM (
SELECT AVG(Amount) AS AvgAmount
FROM transactions
GROUP BY CustomerID
) sub
);
To my mind, using a CTE makes it a lot more obvious what's going on and what the author of the SQL's intent is.
Test 3 – Recursive CTEs
Recursive CTEs are queries where one of the CTEs in a query refers to itself. Sharp-eyed readers may have noticed that it was possible to write our CTE example from above using a recursive CTE.
WITH CustomerAverages AS (
SELECT CustomerID, AVG(Amount) AS AvgAmount
FROM Transactions
GROUP BY CustomerID
),
OverallAverage AS (
SELECT AVG(AvgAmount) AS OverallAvg
FROM CustomerAverages
)
SELECT a.CustomerID, a.AvgAmount
FROM CustomerAverages a, OverallAverage o
WHERE a.AvgAmount > o.OverallAvg;
Since all the information we needed for the second CTE: OverallAverage was ** already contained in the first CTE: CustomerAverage**s, we could use the data in the first CTE in our calculations for the second CTE.
NB: In most SQL dialects, you specify that you're using a recursive CTE by putting the reserved keyword
recursive
in between thewith
clause and the CTE name, like this,
with recursive cte(col1, col2 etc...) as my_cte
...
...
Oracle assumes a CTE is recursive if it refers to itself in its query so we don't need the special recursive
keyword.
For our final example, we'll use a recursive CTE to fill in gaps in our original time-series data.
Referring back to our original table data, you'll notice that we are missing customer data entries for the 2nd & 3rd of January and the 6th & 7th of January.
Our task is to produce a report that shows the sum of the customer spend per day. Where there are no entries for a date, we return zero for the sum of the customer spend value for that date.
This is a good use case for recursive CTEs.
WITH DateRange (dt) AS (
SELECT MIN(TRANSACTIONDATE)
FROM Transactions
UNION ALL
SELECT dt + INTERVAL '1' DAY
FROM DateRange WHERE dt < (SELECT MAX(TRANSACTIONDATE) FROM Transactions)
),
AggregatedData AS (
SELECT TRANSACTIONDATE, SUM(AMOUNT) AS TOTAL_SPEND
FROM Transactions
GROUP BY TRANSACTIONDATE
)
SELECT dr.dt AS TRANSACTIONDATE,
NVL(ad.TOTAL_AMOUNT, 0) AS TOTAL_SPEND
FROM DateRange dr
LEFT JOIN AggregatedData ad ON dr.dt = ad.TRANSACTIONDATE
ORDER BY dr.dt;
TRANSACTIONDATE TOTAL_SPEND
=============== ===========
01-JAN-21 150
02-JAN-21 0
03-JAN-21 0
04-JAN-21 300
05-JAN-21 550
06-JAN-21 0
07-JAN-21 0
08-JAN-21 370
- The DateRange CTE: This part creates a continuous date range from the earliest to the latest transaction date ensuring there are no date gaps.
- The AggregatedData CTE: Sums up the transactions for each date.
- The Final SELECT: Joins the date range with the aggregated transaction data, using an
NVL
function to replaceNULL
values with 0, ensuring days without transactions show a total of 0.
Summary
In conclusion, Common Table Expressions (CTEs) are a versatile and powerful feature in SQL that can enhance the readability, maintainability, and execution of complex queries. Through this article, I've explored the basic structure and functionality of CTEs, looking into both non-recursive and recursive types to demonstrate their utility in a variety of scenarios.
The three test cases I included showed not only how CTEs simplify the management of complex SQL operations by breaking them down into more manageable parts but also showcased their strength in handling recursive data.
I would also argue that the use of CTEs allows for cleaner and more organized SQL scripts, making it easier for developers and analysts to write, debug, and optimize their database queries.
If you're still unsure about CTEs, think about how you would solve the problem shown in my Test 3 example without using them. I think you'll find that traditional SQL approaches to a problem like this are more cumbersome and less efficient.
_OK, that's all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._
I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.
If you liked this content, I think you'll find these articles interesting too.