The Quality of Your SQL
Gavin Nour
nourg@acslink.net.au
The following article was first published in the November 1995 issue of the Informix
User Group NSW Newsletter.
Some organisations are suprised when I suggest that all SQL should be passed through
Quality Assurance (QA) by analysing the output of the command SET EXPLAIN ON.
This article will discuss why we should QA SQL, explain the role of the Informix
optimiser (including how to read the output from SET EXPLAIN) and offer some tips to
improve SQL performance.
The subject should be pertinent to all development where an Informix database is used,
regardless of the tool. The SQL development could be via an Informix tool such as 4GL,
I-SQL, ESQL or NEWERA or it could be via a third party product accessing an Informix
database.
Firstly, why QA your SQL? Generally I recommend placing strict controls on ALL SQL
development. This becomes especially important today as we are finding the size of
databases are growing at an alarming rate. The larger our databases become the more
visible our mistakes become and the more problems we have. Large databases are less
forgiving, especially in the area of performance. Paying careful attention in the
development phase will of course reward us in the future.
Quality assurance will not only help us avoid performance problems, it will also help
us avoid other serious problems such as filling the root database space unnecessarily with
large temporary tables (possibly causing other INFORMIX applications to fail) or filling
UNIX filesystem space with large sort files (possibly causing any UNIX process to fail) or
even filling our logical logs with long transactions and having to restore the database
from an archive.
One of the best ways to avoid such problems is to enforce SQL development standards.
Usually the best approach is to separate the SQL Quality Assurance process from the
Development process. Many large organisations do this and have a team of staff devoted to
general QA others have just one person. My opinion is that usually the DBA is in the best
position to QA SQL statements as he/she is most familiar with the database and knows how
the engine goes about its task. Also the DBA should be aware of the SQLs being executed as
this may highlight the need to alter the database schema (eg.; add an index). Regardless
of who performs the QA, the most important thing is that it gets done and that both the
developer and the person responsible for QAing the SQL understands what to look out for.
Undoubtably the best tool to use for optimising a query or for the QA process is the
command SET EXPLAIN ON.
We should remember first that although we can optimise the queries proper database
design plays a major part in the query performance and often the redesign of tables and an
effective indexing strategy will solve many problems. Maybe a future article will discuss database
schema optimisation. This article will focus on the SET EXPLAIN ON option and
endeavour to explain (pun intended) how this command can help you optimise queries. First
we have to;
Understanding the Informix Optimiser
What is the optimiser?
The optimiser is the part of the INFORMIX engine that has the job of trying to
anticipate the best route to take before running a given query. Because there are often
many paths that can be taken to get to the data, some paths will take much longer than
others. This is especially true when there are many tables involved in a join. Developers
usually develop the query and then leave the task of finding the best route to the
optimiser and assume that the engine will use the best method available. Unfortunately
because the optimiser does a good job most of the time we tend to forget that the engine
does not always make the right decision. Sometimes we know more than the engine does and
can see a better way of doing things. The output from SET EXPLAIN ON explains how the
optimiser has chosen to access the data. With this information available we may discover
the optimiser has chosen a path that will be time consuming and we can takes steps to
restructure the query. Sometimes a small alteration will influence the optimiser to take
an alternative route. Other times we realise that an index can be added to improve
performance or we may just find out that it is fine the way it is but we need to add more
temporary space to cater for a sort.
How does the optimiser make its decision?
The answer to this is very complex. The main aim of the optimiser is to reduce I/O by
limiting the amount of data to search through in order to obtain the requested data in the
most efficient way. The optimiser makes its decisions based on information in the system
catalogues. In version 5 this information consists of :
- the number of rows in each table used in the query (Systables.nrows)
- how many pages are used for data and how many pages are used for indexes
(systables.npused)
- whether columns values are unique or not.(sysconstraints)
- what indexes exist (sysindexes)
- whether the indexes are ascending or descending (sysindexes). Not req in V7.
- whether the data is in the same order as the index; i.e., clustered (sysindexes.clust)
- how many levels there are in an index
- what the second largest and second lowest values are for each column. This gives the
optimiser a rough idea of the range of values (syscolumns.colmin and colmax). Version 7
can obtain more detail about the distribution of data (sysdistrib table).
Using all of this information the optimiser will determine all possible routes and then
weigh each method with an estimated cost. The cost will be based on several considerations
including disk access, CPU resources required and network access. In the process the
optimiser will determine in what order to join tables, whether to perform sequential
scans; whether it should create temporary tables; whether it can use an index for the
select list, for the filter, for the sort or for a group by etc. Once the optimiser
selects a plan which it believes is the most cost effective it passes the query on for
processing and if SET EXPLAIN ON is in effect the chosen method will be recorded in a
file.
To illustrate how important it is for the optimiser to have the right information
available we can look at a very simple query and a decision that would have to be made by
the optimiser to perform a join :
Lets say we have two tables, one with 500,000 rows (tab2) and the other with 200 rows,
both with a unique index on the joining column. A simple select to find related
rows from the two tables would be:
SELECT * FROM tab1, tab2
WHERE tab1.col1=tab2.col2
Hopefully the optimiser would choose to select from the smaller table first and then
join to the second using the index. This would result in 1000 disk reads (assuming one row
per page and it takes three index reads and one data read per row in table two). ie. 200
for table one plus 200 x 4 for table two.
Now, lets say the optimiser did not have accurate information available about the
number of rows and the number of unique values and chose to select fromtable two first.
This would result in two million reads! (assuming it takes just two index reads and one
data read per row in table 1). ie. 500,000 plus 500,000 x 3.
Of course if the optimiser had even less accurate information and didn't know about an
index and used a sequential scan for each row returned from table one we would have to
perform around one billion reads! (200 * 500,000).
We can see from this example using a very simple query that the wrong information and
the wrong decision could have dramatic effects on performance. Eg; 1000 reads verses
1,000,000,000. If there are many tables involved along with more filter conditions in the
WHERE clause the decision process becomes much more complex and the importance of accurate
statistics is magnified.
How accurate are your statistics?
It is very important to remember that the information in the system catalogues used by
the optimiser is only updated when the UPDATE STATISTICS command is executed. The
optimiser is only as good as the information it has been provided with, so, rule number
one is: execute UPDATE STATISTICS as regularly as possible. More often for very dynamic
tables than for tables which rarely change. If the system catalogue information is out of
date the optimiser may make the wrong decision and severe performance problems could be
experienced. It is surprising to see how many sites fail to do this.
We can execute UPDATE STATISTICS for the whole database, for individual tables, for
columns in a table and for stored procedures.
What I usually recommend is that, instead of relying on memory, automate the execution
of UPDATE STATISTICS every night via a cron command. Be careful when running the UPDATE
STATISTICS command whilst other applications are running because the system tables are
momentarily locked when the catalogues are updated. This can result in other processes
receiving errors. Do not update too often because the command takes time.
When does the engine perform the optimisation?
The optimisation occurs whenever an SQL statement is prepared, providing there are no
host variables. If there are host variables the optimiser does not have all the
information required until the variables are passed (when the cursor is opened) and in
this case the optimisation occurs on the OPEN statement. So, with standard SQL (i.e. not
prepared) the query is optimised whenever it is executed. If an SQL statement is used
repeatedly within a program it is best to PREPARE the SQL statement to improve
performance. This way the SQL is only optimised once within a program instead of every
time it is used (unless specifically requested to reoptimise by using the re-optimisation
keyword). With Stored Procedures the SQL is optimised when the procedure is created or
when UPDATE STATISTICS is executed for that stored procedure.
Suddenly slow?
Even after running UPDATE STATISTICS regularly in production we may notice one day that
a query which previously took 10 minutes suddenly takes one hour. This can be the result
of the optimiser suddenly choosing a new path based on new information in the system
catalogues. This may mean the query needs restructuring to influence the optimiser to make
a better decision. Because of this we should have the ability to turn SET EXPLAIN ON for
any program in production instead of being forced to recompile the program to find out the
query plan. I usually recommend building a simple function into all programs which enables
either a parameter to be passed or an environment variable to be set to turn SET EXPLAIN
ON.
Development Verses Production
Remember, not only can the statistics used by the optimiser change but also if you are
testing a query in development on a small database the statistics may be quite different
to what is in production. A query may appear to perform well in development but once it is
in production the optimiser may choose (rightly or wrongly) to take a totally different
route. The only way to get around this problem in the QA process is to run the query
against a test database which is of a similar size to production with the same data, or
alternatively (providing the database schemas are the same) one might consider manually
updating some of the columns in the system catalogue tables after running UPDATE
STATISTICS on the test database just so that the optimiser makes the same decisions,
however this is definitely not recommended and INFORMIX may not support you if corruption
is caused.
Controlling the optimiser
Optimisation level
In version 5.x and above we have the ability to influence how much information the
engine attempts to obtain. We can do this by using the command SET OPTIMISATION HIGH or
LOW. HIGH (which is the default) causes the engine to examine all access paths and LOW
causes the engine to eliminate less likely options at an earlier stage in order to improve
optimisation time. LOW can be useful if many tables are used in a join (e.g. five or more)
as the optimisation time can hinder performance. The downside to using OPTIMISATION LOW is
that an access path may be eliminated too early when in fact it would have been the most
suitable option.
OPTCOMPIND
Beginning with version 7 we have a little more control over the optimiser. Previous
versions assumed that if an index existed this would be the most cost efficient access
path. Now we can specify that we would like the optimiser to compare the cost of an index
read verses the cost of a table scan so that the most cost efficient path can be chosen.
This is done by setting OPTCOMPIND=2 (default). We can still emulate the functionality of
previous versions by setting OPTCOMPIND=0. Setting OPTCOMPIND=1 instructs the optimiser to
work like it would with it set to 2 unless REPEATABLE READ is set in which case it works
like it would if it was to 0. The reason for this option is that a sequential scan with
repeatable read will effectively place a shared lock on the whole table (as it scans all
rows) during the read. Usually undesirable!
Obtaining data distribution for the optimiser
Beginning with version 6, INFORMIX introduced data distribution analysis and storage in
order to help the optimiser make more informed decisions. Data distribution analysis is
used to give the optimiser a better understanding of the values contained in a column,
e.g.; how many unique values there are in each area of the table. This is done by sampling
the data in a column and storing information about sections of the table in various bins.
This information can be extremely valuable to the optimiser when dealing with large
tables. To generate distribution for a column the UPDATE STATISTICS command is used. The
amount of data sampled is controlled by the keywords MEDIUM and HIGH e.g. UPDATE
STATISTICS HIGH FOR table(column). MEDIUM merely samples the data and is very quick
whereas HIGH evaluates all the rows in the table and is therefore slower but more
accurate. LOW is like using previous versions of UPDATE STATISTICS and does not obtain any
distribution data at all. The distribution analysis can be influenced further by the use
of the keyword RESOLUTION and by specifying a resolution value (number of bins) and a
confidence value (level of sampling). Refer to the manual for more information on using
these parameters.
An UPDATE STATISTICS strategy for ONLINE DSA Version 7
The recommended UPDATE STATISTICS strategy for INFORMIX DSA version 7 is to do the
following (in the same order):
1) run UPDATE STATISTICS MEDIUM on all tables using DISTRIBUTIONS ONLY without listing
columns and using the default RESOLUTION parameters. Prior to 7.10.UD1 it is better to
actually list the columns that do NOT head an index for better performance. 2) run UPDATE
STATISTICS HIGH for all columns that head an index or the columns that are definitely part
of a query. Execute a separate command for each column. 3) run UPDATE STATISTICS LOW for
all other columns in composite indexes.
Step one is very fast as it just samples data, but still gathers useful distribution
information. This is used first to obtain information about all the columns that are not
in an index. From 7.10.UD1 it is not worth specifying each column due to the speed of
MEDIUM. Step two is to get as much distribution information as possible for all columns
which are important for query performance (eg joining columns). Because the index can be
used (in version 7) the statistics can be gathered relatively quickly, however do note
that only one column should be specified at a time with
HIGH so that it can make use of an index. In the final step the remaining columns in
all the indexes which are not at the beginning of the index can be referenced in the one
statement (per table) using the LOW parameter.
Do not re-run UPDATE STATISTICS for large static tables.
Once the UPDATE STATISTICS steps have been completed you can view any data distribution
profiles by using the -hd option with dbschema, e.g.: dbschema -d databasename -hd
tablename
TIP: This command is a very useful tool to help determine a good fragmentation
strategy.
Examining the Optimiser's Choice
Once we are comfortable knowing that we have supplied the optimiser with enough
information we can see what query plan the INFORMIX optimiser has chosen. To do this use
SET EXPLAIN ON within the query or within the NewEra, 4GL or esql program. When this is
set the optimiser will write output for all queries (for the same process) to a file
called "sqexplain.out" in the current directory (usually,
however the filename and location depends on the operating system and whether the query is
executed on a remote host). Once the query is submitted it is at the point of no return.
The only way to examine the optimiser output without completing the query is to hit the
interrupt key just after the query starts (after the status line in dbaccess says explain
set).
Typical SET EXPLAIN output:
QUERY:
------
select cust_id, order.* from orders, customers where order_date
> "01/12/1995" AND order_date < "01/01/1996" AND
customers.cust_id = orders.cust_id order by order_date DESC
Estimated Cost: 10
Estimated # of Rows Returned: 200
Temporary Files Required For: Order By
1) informix.orders: INDEX PATH
(1) Index keys: order_date
Lower Index Filter: informix.orders.order_date > "01/12/1995"
Upper Index Filter: informix.orders.order_date < "01/01/1996"
2) informix.customers: INDEX PATH
(1) Index keys: cust_id (Key-Only)
Lower Index Filter: informix.customers.cust_id = informix.orders.cust_id
Understanding the SET EXPLAIN output:
Query: {LOW}
This section of the output shows the actual query which was optimised. LOW is displayed
if SET OPTIMISATION is currently set to LOW. Note that sqexplain.out is appended to if the
file already exists.
Estimated Cost:
This value is simply a number the optimiser has assigned to the chosen access method.
The value is not meaningful except to the optimiser as it bears no relationship to real
time. It cannot be compared to the estimated cost of other queries and is best ignored. It
can be used however to compare changes made for the same query (e.g.; an index change).
Estimated # of Rows Returned:
This is the optimisers estimate based on information in the system catalogue tables.
Remember that the catalogue information is fairly limited (especially prior to version 7)
so this value will often be inaccurate (more so if the query involves a join). In ONLINE
DSA Version 7 distribution information can be obtained for the data which helps the
optimiser estimate the number of rows more accurately.
Temporary Files Required For: Order By Group By
When this is displayed there is a GROUP BY or an ORDER BY statement in the query and
the optimiser has determined that there is no corresponding index available to obtain the
data in the required order. A temporary file will be created to order the result set. This
could be very large (depending on the size of tables) so check available disk space and be
aware of the affect this sort could have on performance. Indexes cannot be used when the
columns to be sorted are from more than one table. Note that in version 7 the optimiser
can choose to traverse an index in the direction of the ORDER BY regardless of whether the
INDEX is in the same order as the ORDER BY. Prior to version 7 the ability of the
optimiser to use the index for an ORDER BY was dependant on whether the
ASCENDING/DESCENDING values on the index and the ORDER BY matched.
1) owner.table: INDEX PATH (Key-Only)
This is the table that the optimiser has chosen to read first (indicated by the 1 ).
Subsequent table accesses (for a nested loop join for example) will be displayed further
down in the explain output and will be indicated by a higher number. For each row returned
at this level the engine will query the tables at a lower level. INDEX PATH indicates an
index will be used to access this table.
The (Key-Only) notation (with ONLINE only) indicates that only the index will be read
and the actual data value (row) will not be read from this table. Key only access would
generally be very efficient (prior to version DSA 7.2 ) due to the smaller size of the
index compared to the row. Not only is the read for the data row eliminated but more index
key values are likely to fit on the one page which in turn reduces I/O. This type of
access is only achieved if no columns are selected from the same table. Tip: Avoid the use
of SELECT * if possible and only select the required columns. Note that with ONLINE DSA
7.2 key only reads are in fact slower in most cases due to the read ahead capabilities.
(1) Index keys: column_name Lower
Index Filter: owner.table.column > x Upper
Index Filter: owner.table.column < y
Column_name is the name of the column to be used in the INDEX PATH read Lower Index
Filter shows the first key value (x) where the index read will begin. Upper Index Filter
shows the key value (y) where the index read will stop.
1) owner.table: SEQUENTIAL SCAN (Serial, fragments: ALL)
In the above case all rows will be read from this table using a sequential scan.
The section in brackets relates to version 7. If Parallel is displayed instead of
Serial then the engine will perform a parallel scan (this behaviour is influenced by the
PDQPRIORITY setting). The ALL notation indicates all fragments have to be scanned because
the optimiser could not eliminate fragments after examining the WHERE predicates. NONE
indicates the opposite, that is the optimiser has eliminated all fragments and therefore
none need to be examined. A number (or list of numbers) indicates the engine will only
examine the fragment/s listed (numbers are relative to the order in the sysfragments
table) Pay special attention if the sequential scan is being performed at a lower level in
the query plan (indicated by a higher number) as this could mean the whole table is being
scanned for each row returned in a previous step. Often this is one of the warning bells
when optimising/QAing a query. Sequential scans are not so bad when they are for small
tables or when they are in the first step of a query plan providing the engine does not
have to scan a large table in order to retrieve just a fraction of the table.
AUTOINDEX PATH: owner.table.column
Used more in Version 4. In order to avoid sequential access a temporary index is built
on this column to perform a join. Used if an index does not exist on this join column and
is generally an indication that you need a permanent index.
SORT SCAN: owner.table.column
Used in combination with a sequential scan when no index is available on the join
column. This column will be sorted for later use with a join (See MERGE JOIN).
MERGE JOIN Merge Filters: owner.table.column = owner.table.column
A Merge join will be used to join the results of the two previous selections sets which
were prepared for a join. Once the join columns are obtained in the appropriate order
(possibly via a SORT SCAN if an index does not exist) the server sequentially reads both
result sets and merges them before accessing the rows. Considered faster than a nested
loop join in many cases.
DYNAMIC HASH JOIN (Build Outer) Dynamic Hash Filters: owner.tab1.col =
owner.tab2.column ...
Version 7 only. A Hash join will be used to join the two preceding tables in the
explain output. The Build Outer notation tells us which table is used first. The Filter
shows how the tables will be joined. Where some complex queries cannot use an index the
hash join takes over. A Hash join is also used instead of a sort-merge join and is
considered more efficient. Whereas a sort-merge join sorts both tables a hash join
typically only sorts one. Hash joins are favoured with large amounts of data especially
for PDQ with fragmentation. Rows are placed in a hash table after using an internal hash
algorithm. The cost of a hash join can be lower than using an index, especially when more
than around 15% of data from a large table needs to be scanned. When the data is not
clustered (in a different order to the index) the cost of traversing the index in addition
to retrieving the actual rows (in a different order) is quite high compared to a table
scan with a hash join. OPTCOMPIND=2 (which is default) will cause the optimiser to
consider hash joins instead of an index. Note OPTCOMPIND ought to be set to 1 when
REPEATABLE READ is being used (and arguably should be the default).
SQL Query Quality Assurance and Optimisation
(Some Tips, Some Warnings, Some Things To Check and Some Things to Avoid):
- Avoid sequential scans on a large table if it is not in the first position of the query
plan as the whole table scan is repeated for every row in the preceding table. This can
severely affect performance. This not only affects the query in action but can also impact
other tasks by changing the recently used pages in shared memory (as it keeps reading the
whole table into shared memory). This increases disk I/O all round as other processes are
forced to read from disk. Consider adding an index if the query cannot be restructured and
performance is slow.
- Avoid potentially large temporary sort files. This can consume all available CPU
resources, increases disk I/O and consume all available disk space. Consider adding an
index on the columns being sorted. A hint: if the optimiser is not using an existing index
for the ORDER BY this may be because the column being ordered is not in the WHERE clause.
In this case the optimiser can be influenced to use the index instead of creating a
temporary file by adding a dummy WHERE condition on the ORDER BY column (e.g.; AND
order_num>=0). Prior to version 7 the index has to be in the same order as the ORDER
BY. When sort files are used check $PSORT_DBTEMP and $DBSPACETEMP settings, these can help
significantly to improve performance by enabling the engine to use more than one disk for
the sort file.
- Correlated subqueries. Subqueries that make reference (in the WHERE clause) to a
selected column in the main query can severely affect performance. This causes the
subquery to be executed repeatedly for each main table row. Be very cautious when using
statements like EXISTS with large tables as logical logs can fill very quickly in the
process. Temporary table space is allocated and logged for each row in the main table
(even if no rows are returned in the final select). The worst affect of this is filling
the logs and having to restore from an archive. Correlated subqueries should be rewritten
to use a join wherever possible. Some more complex rewrites may involve joining to a
temporary table.
- OR statements on different columns can prevent the optimiser from being able to use an
existing index. If an index does exist and the optimiser chooses a sequential scan in the
query plan, consider using a UNION statement (one for each OR condition) as this will
provide the opportunity for the optimiser to use the index.
- If the query is slow even when an INDEX PATH has been chosen on all levels of the query
plan, do not assume the optimiser has made the right decision. Check the query plan to see
if the tables are being filtered in the right order. The aim is usually to eliminate as
many rows as possible in the early stages of the query, but unfortunately the optimiser
does not always have enough information to do this correctly (especially in versions prior
to 6). Using UPDATE STATISTICS HIGH in version 6 and above will give the optimiser more
information about the data distribution so that the right tables are eliminated first.
Also note the INDEX reads are not necessarily the best. For example a sequential scan of a
table can be faster than an index scan if the data pages have to be retrieved and the
pages are not in the same physical order.
- Converting data types and comparing character columns is very expensive. E.g.;
tab.character_col=tab2.integer_col. If possible consider changing the column types to be
numeric. Remember if the join columns are character types the columns have to be compared
byte by byte for each row. Although version 7 handles conversions better, the cost of the
conversion overhead is still not considered by the optimiser.
- Look out for the WHERE predicates which may not be able to use indexes. These include:
OR, LIKE or MATCHES with a wildcard at the beginning (e.g.; MATCHES *NOUR), functions
(e.g.; MONTH, DAY, LENGTH), negative expressions ( e.g.; != NOUR), non initial substring
searches (e.g.; postcode[4,5]>10).
- Except for very old versions of the engine the order of tables in a SELECT list or the
order of elements in the WHERE clause will not have an effect, however in some cases when
the optimiser believes the cost is the same for two different paths, the order of the
statements in the WHERE clause may have an effect. Some tricks have been suggested in the
past (such as repeating predicates in the WHERE clause) to try and force the optimiser to
use a particular index, however this is no longer recommended and will not work with
future versions. E.g. with version 7 the optimiser query rewrite feature eliminates
duplicates in the WHERE clause.
- Avoid long transactions in logged databases. We probably all know by now that long
transactions threaten to destroy our databases by filling the logical logs. Warning bells
should ring with the following statements: LOAD statements, INSERT INTO xx SELECT yy FROM
zz, UPDATE or DELETE statements spanning many rows. As an aside consider locking the table
in exclusive mode to avoid excessive lock usage (hindering performance and even worse
running out of locks).
- Check the placement of WHERE conditions with joining columns to see if all combinations
of the WHERE predicate have been included. For example the WHERE predicate; tab1.x=tab2.x
and tab1.x >1000 would probably cause the optimiser to use the index on tab1.x, however
the index on tab2.x might be much more appropriate. Adding the condition; and tab2.x >
1000 would give the optimiser more choices to evaluate. Another example is tab1.x=tab2.x
AND tab2.x=tab3.x. The optimiser in ver 5 would not consider joining tab1 directly to
tab3, so adding tab1.x=tab3.x would again provide more choices. Note however that the
optimiser in ONLINE DSA V7 rewrites queries so that all possible combinations are examined
and these suggestions are not applicable.
- Select only the columns required, this reduces the communication between the front end
and backend and reduces I/O. Avoid the temptation to SELECT * (all columns when they not
required).
- Use temporary tables when a subset of data is being reselected with different WHERE
predicates. For example a SELECT with the WHERE clause: orderdate> 01/12/1995 and x=1
might be followed by a SELECT with a WHERE clause: orderdate> 01/12/1995 and x=2 in
this case if the table is very large performance could be improved by first selecting all
rows WHERE orderdate>01/12/95 into a temporary table and then performing subsequent
selects on the temporary table.
- Use temporary tables to influence the optimiser to take the route that you know is best.
This can be done by first selecting the rows you want from a large table into a temporary
table and then joining the temp table to the rest of the tables. The optimiser may use a
different query plan knowing that the temporary table is much smaller than the original
large table.
- Consider creating indexes on temporary tables. Often overlooked just because the table
is temporary.
- When using temporary tables for subsequent selects with ORDER BYs, create the temporary
table using an ORDER BY. E.g.; SELECT x FROM y ORDER BY x INTO TEMP temp_tab. This syntax
is not available in earlier versions of the engine (pre 4.1).
- Consider running UPDATE STATISTICS (within the application) for large temporary tables.
Again this is often overlooked just because the table is temporary.
- Use the WITH NO LOG statement when explicitly creating temporary tables, this will help
performance by eliminating the overhead of writing to the logical logs. This also avoids
the possibility of creating a long transaction with a large temporary table. In version 7
temporary tables can be created in a special dbspace which is not logged. Use this feature
whenever possible.
- Time commands with the UNIX time or timex command or use; SELECT CURRENT FROM systables
WHERE tabid=1 before and after the query being tested/optimised.
- When testing/timing queries remember that the second time the query is executed it is
much more likely to be faster because the pages read the first time round will probably
still be in memory. Be aware that this may distort the test results. Use tbstat -p to
monitor disk/buffer usage. When timing queries consider restarting the instance between
tests in order to re- initialise shared memory.
- Fragmentation and Parallel queries. It is important to fragment the data in a way that
makes it easy for the optimiser to determine which fragments are active. Making the WHERE
predicates clear enough to enable optimiser to eliminate fragments will reduce I/O and
enable parallelism. PDQ and fragmentation is a whole new subject in its own right (perhaps
another article). Just remember if you are using ONLINE DSA version 7 the trick is to
divide and conquer!
To Summarise
The Informix Cost based optimiser does an excellent job in doing what it is supposed to
do, that is shielding us from the task of having to think about the best way to retrieve
our data. For this reason we often take it for granted. Just imagine if we had to make the
same complex decisions as the optimiser does every time we wanted to access our data.
Nevertheless, it is important that we understand the optimiser and the consequences of the
decisions it makes. More importantly we need to provide the optimiser with the information
it needs to perform its job effectively.
Informix are constantly refining the optimiser and finding new ways to improve
performance. Some of the comments made in this article may not be valid in future
releases. However this article was checked for accuracy (in Nov 95) by the brains behind
the optimiser (at INFORMIX, Menlo Park, California) and I wish to publicly thank that
person for his valuable time.
I hope this article has been of value to you. Any criticisms, comments or suggestions
would be welcome.
Gavin Nour is the Director of Select Software Solutions Pty Ltd. He has ten years
Informix experience covering a variety of Industries and applications. He was a DP Manager
for a leading financial organisation in the United Kingdom, a technical Support Manager
for a leading Informix Distributor in Australia and spent two years at a leading Informix
Software house in Australia before forming his own company specialising in Informix
services. Gavin is committed to the Informix solution and to helping the Informix
community. He is the Secretary of the NSW Informix User Group and is on the board of
Directors for the International Informix User Group.
|