SQL Optimizations - get most of it!
In this blog, we will see the SQL Order of Execution, query level optimizations, index optimizations, readability enhancement using CTE in place of subqueries and more. We need to consider how many times the query is run and optimize based on the requirement.
SQL Order of Execution
In SQL, the "order of execution" is the order in which clauses within a query are evaluated. This is often different to the order that the query is written in. Understanding this order can be useful when looking to debug and optimize queries. The order of execution of clauses is: FROM, JOINS/ON, WHERE (can be moved first - where possible along with SELECT - called predicate pushdown, filter at row level, does not allow column aliases or aggregate functions to be applied), GROUP BY (often used with aggregate functions, can use column aliases), HAVING (applies conditions and/or filters after aggregation takes place, does not allow column aliases; “having” can be pushed down to storage layer if no aggregate function and executed as where filter), WINDOW FUNCTION , SELECT (any functions are calculated, also column aliases are available here), DISTINCT, UNION, ORDER BY (sorts the resulting rows after filtering and aggregation, column aliases allowed ), LIMIT / FETCH / OFFSET
Seek (only what is required) is faster than scan (entire table). With the help of clustered index column in WHERE clause, there is physical order of records, helps to perform “clustered index seek”, the logical operation. This can be seen from the Explain Query Plan. If the query is run on a non-indexed column in the WHERE clause, the logical operation is a “clustered index scan”. Indexed column => seek; non-indexed column => scan.
SELECT TOP 3 … ORDER BY ..→ in this case, only TOP N Sort can be applied, instead of sorting all records and then taking the top records.
Having count(*) > 100 may follow filter, sort VS having count(*)>10 may follow soft, filter → internal query optimization.
In the explain query plan, try to reduce the steps that have more cost.
set showplan_all on : lets to see the plan in text.
Clustered index vs Non-clustered index: We cannot have two clustered indexes in a single table. We can have a composite clustered index instead with more than one column in the clustered index. Clustered index – data sorted on index. We can have more than one non-clustered index on a table - the process to get a record is two steps - first get the address from the non-clustered index and then get the record from the table. Non-clustered index requires extra storage. Clustered index is created along with the primary key, but we can have a clustered index without a primary key also.
Let us look at some Query Level Optimizations:
Choose Datatype wisely: Using Varchar(10) for storing fixed-length data (like Mobile Number) will incur a 20% higher query executing penalty when compared to using char(10) for fixed-length data.
Avoid using implicit conversion: When a query compares different data types, it uses implicit datatype conversion. Hence, in queries, one should avoid comparing different datatype columns and values, like: WHERE date >= "2022-01-01"
Avoid using Function-based conditional clause: When a query uses a function in the WHERE or JOIN clause on the column, it would not utilize the index, hence slowing down the execution.
-- WHERE date(ship_date) = '2022–01–01'
-- JOIN T2 ON CONCAT(first_name,' ',last_name) = 'jane doe'
Avoid using DISTINCT and GROUP BY at the same time: If you already have GROUP BY in your query, there is no need of having DISTINCT separately.
Avoid using UNION DISTINCT and SELECT DISTINCT at the same time: For queries with UNION DISTINCT, it removes duplicate records natively, and hence there is no need of using SELECT DISTINCT.
Don’t use SELECT * ever: Selecting unnecessary columns would be a waste of memory and CPU cycle. It is always better to select column names instead of * or extra columns.
Avoid sub-query where possible: Sub-query creates temp tables to store data and sometimes it creates temp tables on the disk thereby slowing the query execution. Prefer WITH clause over nested sub-queries.
Avoid using Order by in the Sub-query: Ordering in sub-queries is mostly redundant and leads to significant performance issues.
Don’t GROUP numbers: Avoid grouping by columns of type DOUBLE or FLOAT, as this may lead to unexpected behaviour due to rounding & precision issues.
Minimize use of SELF Joins: Self-joins are computationally more expensive and can be replaced with the Window function in many cases.
Don’t join tables with OR conditions: It can be optimized by using UNION ALL in place of OR-based joins.
Avoid joining with the not equal condition: When a query joins with a NOT EQUAL operator it searches all rows and uses a full table scan which is highly inefficient.
Avoid Full-Text Search: When a query searches for keywords with a wildcard in the beginning it does not utilize an index, and the database is tasked with searching all records for a match anywhere within the selected field. Hence, if needed prefer to use wildcards at the end of a phrase only.
SELECT user_name FROM test WHERE user_name LIKE '%abc%'
Use WHERE instead of HAVING: Prefer the usage of where instead of having as HAVING statements are calculated after WHERE statements. IN versus EXISTS: IN operator is more costly than EXISTS in terms of scans especially when the result of the sub-query is a large data set.
Let's look at some code snippets that enhance SQL Performance:
Use ‘regexp_like’ to replace ‘LIKE’ classes.
SELECT * FROM table1 WHERE
lower(item_name) LIKE ‘%samsung%’ OR
lower(item_name) LIKE ‘%xiaomi%’ OR
lower(item_name) LIKE ‘%iphone%’ OR
lower(item_name) LIKE ‘%huawei%’ – so on
SELECT * FROM table1 WHERE
Use ‘regexp_extract’ to replace ‘case-when like’
WHEN concat(‘ ‘,item_name,’ ‘) LIKE ‘%acer%’ then ‘Acer’
WHEN concat(‘ ‘,item_name,’ ‘) LIKE ‘%advance%’ then Advance
WHEN concat(‘ ‘,item_name,’ ‘) LIKE ‘%Alfalink%’ then ‘Alfalink’
AS brand FROM item_list
Convert a long list of IN clauses into a temporary table.
SELECT * FROM table1 as t1 WHERE itemid in (222,333,444,555…,999)
SELECT * FROM Table1 as t1
JOIN ( SELECT itemid FROM
(SELECT split(‘222,333,444,555,...,999’,’, ‘) as bar)
CROSS JOIN UNNEST(bar) AS t(itemid)
) AS Table2 as t2
ON t1.itemid = t2.itemid)
Always order your JOINs from largest tables to smallest tables.
SELECT * FROM small_table JOIN large_table
ON small_table.id = large_table.id
SELECT * FROM large_table JOIN small_table
ON small_table.id = large_table.id
Use simple equi-joins. Two tables with date string eg. ‘2022-09-01’, but one of the tables only has columns for year, month, day values.
SELECT * FROM Table1 a JOIN Table2 b
ON a.date = CONCAT(b.year,’-’,b.month,’-’,b.day)
SELECT * FROM Table1 a JOIN
(SELECT name, CONCAT(b.year,’-’,b.month,’-’,b.day) FROM Table1 b)
new on a.date = new.date
Always “GROUP BY” by the attribute/column with the largest number of unique entries/values.
SELECT main_category, sub_category, itemid, sum(price) FROM Table1
GROUP BY main_category, sub_category, itemid
SELECT main_category, sub_category, itemid, sum(price) FROM Table1
GROUP BY itemid, sub_category, main_category
Avoid subqueries in WHERE clause.
SELECT sum(price) FROM Table1 WHERE itemid in
(SELECT itemid FROM Table1)
WITH t2 AS ( SELECT itemid FROM TABLE2)
SELECT sum(price) FROM Table1 as t1 join t2 on t1.itemid = t2.itemid
Use Max in stead of RANK
SELECT * FROM
rank() over(order by predate desc) as rank FROM Table1)
WHERE ranking =1
SELECT userid, max(prdate) FROM Table1 group by 1
Use approx_distinct() instead of count(distinct) for very large datasets.
Use aprrox_percentile(metric, 0.5) for median.
Avoid UNIONs where possible. Use UNION ALL for better performance, which avoids finding distinct records.
Use WITH statements vs nested subqueries
Select only the columns you need - Use Column Names Instead of * in a SELECT Statement.
Filter table as you read data from the first time. This will reduce the amount of data to be processed. Better would be to filter even the partitioned column early in the query.
Avoid including a HAVING clause in SELECT statements
Eliminate Unnecessary DISTINCT Conditions
Consider using an IN predicate when querying an indexed column
Use EXISTS instead of DISTINCT when using table joins
Try to use UNION ALL in place of UNION
Avoid using OR in join conditions
Avoid functions on the right-hand side of the operator
Remove any redundant mathematics
SQL Optimization Dictionary using Execution Plan: Statistics time I/O, Spooling, Hashmatch, Keylookup, Sub queries minimization, Partition Elimination, SARGable query, Index seek, Bad views
When seeing the execution plan, we have to reduce logical reads more than physical reads. Logical reads are from memory, physical reads are from hard disk. Also try to minimize scan count for the query where possible.
Check io statistics, and in some cases time statistics using “set statistics io,time on”
First look for spooling → temporary saving of data to make computation easy. When Lazy spool comes into play, when there is duplicate aggregation. Try optimizing the query that runs before lazy spooling takes place. Spooling is not always bad. In the case of Recursive CTE, Lazy spool cannot be avoided - ignore.
Next look for Hash Match → unsorted data. Grouping also does some sorting. When we use both Group By and Order By, there is twice sorting. In the case of Hashmatch, try to use indexes and if indexes are there - try to see why indexes are not used. Also, when we use the function on the join column, a hash match comes into play - try reducing the cost of Hash Match operation.
Key lookup → missing column from the index. Solution - add missing column to the index. Sometimes, when a suggestion is to add columns, key lookup does not help, as adding a large number of columns can degrade performance for updates, inserts etc.
Bad Views → Views are often used to simplify queries and increase reusability. But sometimes they end up performing badly. E.g. just to get one more column to add to the view results, costs more. Solution - create a new view or alter an existing view.
Unnecessary Subqueries → e.g. instead of writing separate subqueries to get each column, use CROSS APPLY - for each record from the left query, apply the right subquery. CROSS APPLY only gets intersection records, to solve this, use outer apply - like left join.
Partition Elimination - important - makes a huge impact on performance. All indexes should align to the partition scheme. Queries should align to partitioning. Thumb rule is, parameter variables should have the same data type as column, don’t use functions on partition columns in where clause and join columns. Look for Actual partitions, Actual partitions accessed in the execution plan - try reducing the number of Actual Partitions.
SARGable query: SARG - Search Argument. Non SARGable queries (Zombie queries). Main reason for Hashmatch. Adding function in WHERE clause, makes the output of function non deterministic. E.g. LIKE ‘%abc%’
Eliminating functions in WHERE clause, helps in Hash Match, Partition Elimination, SARGable Queries.
Avoid external grouping, Cross apply vs UNPIVOT, rownumber (less data => not preferred) vs offset & fetch (to get serial data), rank vs subquery, power of OVER clause.
All about indexing:
A good index will: Filter the data efficiently, Eliminate the need to sort the data, Answer your query with data found in the index.
Rules for indexes:
Use statement plans to see how your database services your query. To see this, run the query, prefixed by EXPLAIN ANALYZE. The statement plan is read right to left, bottom to top. The “actual row count” helps see the number of records in each stage.
Filter your query with your index to avoid a full table scan.
Example: CREATE INDEX email_idx ON messages (email);
Sort your query with your index to speed up the query.
When the query is like
WHERE email=’firstname.lastname@example.org’ ORDER BY datetime DESC,
Use composite index on email, datetime as below:
CREATE INDEX email_datetime_idx ON messages (email, datetime)
Cover your query with your index to avoid the table entirely.
If the query is: SELECT datetime FROM messages, and an index is created on datetime, then no need to go to the table. There is only one stage, revscan. This is an example of covered query, that optimizes reads.
Three things to keep in mind for using indexes:
First, indexes have to be kept in sync with the table rows they point to. That means that writes to a row have to go to the table and to the index. And because you don’t want the index to be eventually consistent, this means you need a transaction, along with its associated overhead. So don’t build too many indexes on a table, because you’ll have to write to potentially all of them when writing data.
Second, if you index a large field (such as the “body” field in the messages, above), you have a large index, so try to avoid that.
And third, always use your EXPLAIN plans to check that things are working the way you expect.
How to use indexes for better workload performance:
Use the Statements Page to avoid full scans
Use Index Statistics to find indexes with low or no usage
Use WHERE filters on table partitions to make use of partition pruning to read in less data.
Apply WHERE filters in the same order as your partition and cluster keys - this will ensure you make full use of ordered ranges of data
Create partitions and clustered indexes/columns based off usage patterns - date and timestamps/datetime columns are great as partitions usually to minimize the amount of data that’s read, additional cluster columns can be used for frequently used WHERE filters
Reduce the amount of data used for table joins using filtering and pre-aggregation where possible
Look for skewed value distributions on your join columns - you may want to handle hot spots and problem areas separately
Table order in INNER JOIN operations may be important in some database implementations - for example BigQuery for example performs a broadcast join when the left table is huge and the right table is smaller and can be sent to every split part for better parallelism, as opposed to a regular hash join if the tables are the other way round
Bucket joins or repartitioning data on the fly can be very helpful (if your db implementation has this capability)
Don’t use ORDER BY in CTEs or sub-queries
Apply complex transformations later on where possible instead of earlier in the query - this should theoretically use less data in most cases if you are applying joins and WHERE filters upstream.
Avoid using datatype conversion in where or join clause (where date(insert_date) = '2022–01–01')
Using DISTINCT and GROUP BY in the same query is a very big NO
Use EXISTS instead of 'IN'
Common Table Expression (CTE) Query Performance - SQL WITH clause: to simplify complex queries. Basically CTE helps to modify subqueries into CTE for readability of query.
An example to get list of customers who are premium customers (customers who placed more than average number of orders placed by all customers)
Query1: WITH total_orders (order_customer_id, total_orders_per_customer) as
(select order_customer_id, count(*) as total_orders_per_customer
From orders group by order_customer_id)
, average_orders() as – Query2
(select avg(total_orders_per_customer) as avg_orders_per_customer) from total_orders)
Select * from total_orders join average_orders on total_orders.total_order_per_customer > average_orders.avg_orders_per_customer – Query3
After with clause only the first statement considers the CTE expression for further evaluations, stored temporarily - all above 3 queries should be run together. Query3 can be insert, select, update or delete. There can be one or more temporary resultsets in CTE.
Example of subquery: query3(query2(query1)). The same can be written in CTE as query1, then query2, then query3
The above 3 queries in CTE in subqueries form is below:
select * from (select order_customer_id, count(*) as total_orders_per_cusetomer from orders group by order_customer_id) total_orders
(select avg(total_orders_per_customer) as avg_orders_per_customer from (
select order_customer_id, count(*) as total_orders_per_customer from orders group by order_customer_id x) average_orders
on total_orders.total_orders_per_customer > average_orders.avg_orders_per_customer
We can see in the above example query, how the redundancy in subqueries is tackled with better readability in CTE.
CTE also helps in performance gain, reducing the number of repeated queries to be run among other improvements in performance.
In CTE expression, the column names are mandatory when there are columns with the same names. Also, the columns mentioned are optional - but a good practice, and should match actual columns returned.
CTE can improve the performance in some cases - e.g. where reusable queries are used, and degrade in other cases - e.g. doing materialization and then indexes are not performant on CTE - but internal optimization comes into play when the final queries needs indexing, also depending on the database used and their versions. Recent versions of databases are enhanced to optimize CTEs.
Optimizations can be in terms of hardware, database, or query optimizations, data tuning.
Finally, a SQL mindmap to quickly check on available commands in SQL. Hope you find the SQL Optimizations interesting and fun!!!