SQL Performance Tuning and Rules of Thumb, Secrets of the Masters Revealed

0
1919

I have interviewed hundreds of developers over the years, and it is incredible how topical is the knowledge that developers know about this essential critical skill called SQL. It is the one skill that every full stack developer must learn and understand. 

Transactional systems are not going away even though there is a lot of talk about big data and noSQL databases. Relational Database Management Systems (RDMS) will remain at the heart of transactions. RDMS fit the CAP theorem where you only get 2 of the three and, in this case, Consistency across reading and Performance.

  • Consistency: Every read receives the most recent write or an error
  • Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
  • Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

RDMS drives many mission-critical ACID (Atomicity, Consistency, Isolation, Durability) database transactions. When you master SQL, you can dive into analytics and data warehousing, where Performance dealing with millions of rows is expected. Schema databases and SQL still drives analytics at the core. The objective is to squeeze a few more transactions per minute, to get data back to the user ASAP in this era of “I want it now.” 

I am IBM DB2 certified and first developed many of my tricks geared toward the DB2 platform. Many of the points below hold on to many databases. I have translated the same theories to Sybase, Oracle, and MySQL. With that said, let us start exploring.

SELECT Clause

  • Select only the columns needed by your program. Do not use SELECT *.  There are 2 main reasons, you are making the database force more network traffic to your application and you now have to use more CPU to move each column into your program.
  • AVOID USING SELECT DISTINCT without a supporting index unless the number of qualifying rows is less than 5000.  The reason being you will be sorting all the qualifying rows and this can be expensive as the data set grows.
  • Use the AS feature when selecting derived columns. The reason is documentation.
  • Use the COALESECE feature to avoid null indicators variables for columns allowing nulls and results in sets that can be null.  SUM (COL1) is null if null rows qualify.  Here are facts from the one IBM Gold Consultant and my mentor Richard Yevich (RIP).  It reduces the code to maintain and improves access paths if used in predicates.
  • Use CASE over DECODE because it is more meaningful and more feature rich.  CASE complies with ANSI SQL. DECODE is proprietary to Oracle.  CASE is a statement where as DECODE is a function and CASE is used in relational operators where as DECODE is used in equality operators.  CASE can be used in a where clause but you cant use DECODE in the where clause.  DECODE works with expressions that are scalar values only. CASE can work with predicates and subqueries in searchable form.  CASE expects datatype consistency, DECODE does not and CASE manages nulls in a special way.

SQL> select case

2 when null is null

3 then ‘NULL’

4 else ‘NOT NULL’

5 end null_test

6* from dual

SQL> /

FROM Clause

  • USE TABLE EXPRESSION feature over VIEWS. The reason is for documentation and performance analysis.

WHERE Clause

  • Use WHERE Clause to select the rows needed.  There are 2 main reasons, you are making the database force more network traffic to your application and you now have to use several times more CPU to process the result set.
  • Use <> rather than NOT EQUAL. The reason is <> is universally accepted.
  • Only use Literal in a SQL statement when it never changes.  If it changes then use a host variable to the move the literal to the host variable in the host language prior to the SQL statement. The reason is the program will be easier to maintain when the variable changes.
  • Avoid arithmetic expression (including concatenation) in predicates where there are no other stage 1 predicates.  Instead, store expression in host variable and use the host variable in the WHERE statement.  The Reason is to make the predicate stage 1.  Complex predicates are Stage 2.  If they only further filter then they are great however if they are the only predicate index/table scans are the only access paths possible.
  • Avoid negative logic if possible.  The reason is that positive logic is much easier to comprehend.
  • If more than 25% of the rows will be returned, induce a tablespace scan by using the +0 or CONCAT ‘ ‘ techniques. The reason is that due to the prefetching, the database can often scan the entrie table faster than reading 25% of the index pages + 25% of the data pages.

JOINS

  • Ensure the use of a join predicate to avoid a Cartesian product of rows returned.  Reason: Missing Join predicates cost CPU and it might not not give the expected results.
  • Use Joins instead of subselects.  The reason is that the database join operation algorithms are always faster.
  • Do not code RIGHT OUTER JOINS.  Switch the order of the tables in the from clause and code LEFT OUTER JOINS.  This will reduce the confusion with EXPLAIN output (only displays Left and Full joins)
  • Do not code INNER JOINTS with the ON clauses.  Older version of many databases process the where clauses after the join is complete unless table expressions are used.  Stay with the “,” and no ON clause so that normal join algorithms are used.

ORDER BY/GROUP BY

  • Use the column name in the ORDER BY and GROUP BY clauses. Do not use the column relative number from the SELECT clause.  Reason is documentation and ease of reading code.
  • Do not ORDER BY a column that has only one value in the return rows.  There is no need to use CPU and I/O to sort something that does not need sorting.

UNION

  • use UNION ALL when duplicates do not exist or when the number of duplicates is small and can be eliminated in the program of if duplicates are needed.  Reason is to eliminate the sort.

SUBQUERIES

  • Use Select * for correlated subqueries when using EXISTS or NOT EXISTS. The reason is that no rows are returned, only a flag set to true or false
  • Avoid using IN phrase.  Use JOIN or EXISTS in a correlated subquery.  The reason is to avoid creating and sorting a work file.

EXISTANCE CHECKING TECHNIQUES

Do not use SELECT COUNT(*) to check for existance.  Reason is it cost CPU and I/O to count all the rows when the actual count is not needed.  While COUNT(*) scans the entire table, counting up all entries matching your condition, EXISTS() will exit as soon as it sees the result it needs.

SELECT columns needed

FROM tables needed

WHERE EXISTS

( SELECT * FROM table WHERE condition)

correlate the subquery if the checking should be done for each row.  Leave it non correlated if the checking only needs to be done once.

CURSOR PROCESSING

  • Close cursor when the last row is processed rather than at the end of the program unless it is the last action of the program.  If no more processing happens after the cursor then do not close the cursor but let the program close the cursor to release the locks sooner and improve concurrency.
  • Retain cursor positioning after Commit processing to reduce program code by eliminating the need for browsing logic.
  • If cursors are use to drive updates use FOR UPDATE or WHERE CURRENT OF CURSOR to create row level processing.  If updates have to be ordered switch to WHERE KEY =:HVKEY. The reason FOR UPDATE OF and ORDER BY are mutually exclusive.
  • Always check the row counter or timestamp in the UPDATE statement to warn users if the data has changed.  The checking the counter or time stamp in a SELECT statement allows data integrity problems to slip in.

GENERAL TECHNIQUES

  • Build small tables, such as those used for code definition, in memory or application tables if they are used repetitively.  The reason is that is save CPU and I/O by having the database read it only once.
  • Ensure restart logic is coded for applications requiring high availability.
  • For SQL used in applications requiring high concurrency, the SQL should be retried a few times before sending an error message.  5 times is a good start.
  • Do not start coding SQL with out knowing all the indexes available on the table.  Reason is that all possible index-able local and join predicates should be coded to ensure index usage.
  • Run an EXPLAIN on all SQL statements built prior to the release from development. Granted that production and development can have different plans depending on the data.  Analyze results.  The reason is that you can steer the optimizer in the right direction using the two tuning techniques of (OPTIMIZE FOR n ROWS and +0, CONCAT ‘ ‘)

ADHOC QUERIES

Views can be used for these requests.  The reasons is that in insulates the user from tables changes over time.  It can hide SQL complexity, enforce security or pre-join tables.

Stephen Choo Quan

Stephen is a double threat holding both SAP business objects certified architect as well as being a certified IBM DB2 Database Developer. read More…