Section author: Adam Scott <>

1.12.4. Optimizing queries

  • Most of the time, if you reduce the number of rows the query “scans”, you will get your results back faster.

    See the discussion on the LIMIT clause in SQL Basics and Gotchas

  • Instead of SELECT *, select only the specific columns of interest. This is true in general. Instead of selecting 20 columns, select only the ones you need. If you are choosing between two columns to select, pick the one that is smaller in size.

  • If the query JOINs one or more tables, be sure the columns you are joining on are indexed.

    To find out if a column is indexed, one way is to lookup the table and see if the column is in bold here Data Lab query webpage

  • To take advantage of indexed columns, be sure to not wrap them in functions such as AVG(), SUM(), or any other function call.

SELECT expnum, fieldid  FROM smash_dr1.exposure e
 WHERE CAST(e.expnum AS INTEGER) > 187566;

Since we put CAST() around e.expnum, even if the expnum column is indexed, it will be slow as the index will be ignored. (This example would not be significant since the exposure table has 5,809 rows–related to the first point above to reduce the number of rows the query scans).

  • When querying an indexed column, if you can add a WHERE clause and you can also query another indexed column mentioned in the WHERE clause, most often the results return more quickly than just using a single indexed column.

Consider the abstracted SELECT.

SELECT <<something>> FROM <<something>> WHERE
<<condition 1>> AND
<<condition 2>> AND
<<condition ...>> AND
<<condition N>>;

The more conditions you can provide on any indexed column, generally the better.

  • Sorting is a relatively expensive operation. If you do not need to order the results, removing the ORDER BY will greatly speed up the results. (GROUP BY also forces a sort.)
  • Indexing columns having only one value in their rows is a waste since all rows satisfy (the database still has to scan through all the rows). Thus, using a WHERE clause on an indexed column having unique values will perform better than an indexed column with fewer unique values.
  • If you search for values in a non-indexed column, the database will search through every row. Use a CSTORE table instead. Look for the table name with a _cs on the end, indicating it is a CSTORE table. CSTORE is for a columnar storage format that works faster on non-indexed columns. (Technically they use an index but it is of a very different type).

Here is a query on a non-CSTORE table on an un-indexed column.

SELECT ra, dec FROM ls_dr3.tractor t
 WHERE g_r between 0.3 and 1.1;

Here, where g_r is not indexed (as of this writing), the system will scan across over 330 GB of data, requiring over 10 minutes to run. (The entire tractor table is over 524 million rows).

If there is a ls_dr3.tractor_cs table, the cs indicates it is a CSTORE table. The following query on ls_dr3.tractor_cs might run in less than a minute. (Note the ls_dr3.tractor_cs table.)

SELECT ra, dec from ls_dr3.tractor_cs t
 WHERE g_r between 0.3 and 1.1;

Why not make everything a CSTORE table? The main reason, as of this writing, is CSTORE does not support something called function-based indexes which are critical for high-speed positional queries. We look forward to the future when this becomes available.

  • If you are exploring the data, try reducing the number of calculations. Or try using simpler calculations: Division is more time consuming than multiplying. Some division operations can be changed into multiplication (e.g. x/10 = x * 0.10)
  • Avoid doing string or text manipulation as they are CPU-intensive operations.
  • If you are doing all of the suggestions above and the query is still slow, you may have hit an optimization wall where we are limited by hardware performance. In the future look for more parallel options to run queries across multiple CPUs and nodes. Presently we have parallel query in place which works well for aggregate queries on large tables, though it still has limited use. Future versions will support parallel operations on even more queries.

A Data Lab Jupyter notebook on CStore vs Row Store performance has further details and provides practical guidance.