1.11. SQL gotchas

1.11.1. Respect your LIMITS

One table may have hundreds of billions of rows. And you are only interested in a few. This is where the LIMIT clause in SQL is so important to reduce your query time and load on resources.

It is recommended to always include a LIMIT clause as you explore datasets, and just in case.

For example, to get the g magnitudes of 100 objects in SMASH:

select id, gmag from object limit 100;

This returns in less than a second under quiet conditions.

If we had not added the limit 100, the query would run for many minutes, maybe hours and return back over 1.1 billion rows.

1.11.1.1. Using Limits with Joins

In this example we have a LIMIT 10 (the last line of the query). We used a LIMIT so we are safe right?

SELECT L.objid, L.ra, L.dec, L.g, L.r, L.z, L.w1, L.w2, L.g_r, L.r_z, L.z_w1, L.w1_w2, L.type, S.class
  FROM ls_dr3.tractor_primary AS L
 INNER JOIN ls_dr3.dr3_specobj_dr13 AS L2
            ON L.objid = L2.objid
 INNER JOIN sdss_dr13.specobj AS S
            ON  L2.objid  = CAST(S.specobjid AS BIGINT)
      WHERE (L.decam_nobs_2>1 and L.decam_nobs_3>1 and L.decam_nobs_5>1
           and L.g != 'nan' and L.g != 'inf' and L.r != 'nan' and
           L.r != 'inf' and L.z != 'nan' and L.z != 'inf'and
           L.z_w1 != 'nan' and L.w1_w2 != 'nan')
LIMIT 10;

When the database performs this query, it has to satisfy the INNER JOIN on both joins before it can apply the LIMIT 10, which means the database has to scan millions of rows before returning 10. Here, ls_dr3.tractor_primary has 478 million rows, ls_dr3.dr3_spec_obj has 4.4 million rows, and sdss_dr13.specobj has 4.4 million rows. (If they were around 1000 rows each, the impact of not using limit for most types of queries is negligible. Most types of queries encountered in the wild, anyway.)

So along with using LIMIT s, keep in mind the number of rows each table has.

Incidentally (not immediately obvious) is a missing join in this example. When building your query sometimes you might forget to add a join. This can happen while experimenting with queries. As it turns out, the database planner is expecting to work with over 14 trillion rows (we know this from running an EXPLAIN on it). So this query runs for a very long time when we only wanted back 10 rows. Even with a missing join we can protect oursleves by using the limit clause.

What is the missing join? brickid. To join ls_dr3.tractor_primary with ls_dr3.dr3_specobj_dr13 use the columns objid and brickid.

But how do we use the limit clause to protect ourselves here and elsewhere? Inline views.

1.11.1.2. Inline Views

If only we had a way to tell the database to limit the rows back from each table before we join them.

Let’s explicitly tell the database we want only 10 rows from ls_dr3.tractor_primary.

SELECT x.objid, x.ra, x.dec, x.g, x.r, x.z, x.w1, x.w2 ,x.g_r, x.r_z, x.z_w1, x.w1_w2, x.type, S.class
  FROM
      (select * from ls_dr3.tractor_primary L
                where (L.decam_nobs_2>1 and L.decam_nobs_3>1 and L.decam_nobs_5>1
                       and L.g != 'nan' and L.g != 'inf' and L.r != 'nan' and L.r != 'inf'
                       and L.z != 'nan' and L.z != 'inf'and L.z_w1 != 'nan' and
                       L.w1_w2 != 'nan')

       LIMIT 10) AS x

   INNER JOIN ls_dr3.dr3_specobj_dr13 AS L2
           ON x.objid = L2.objid
   INNER JOIN sdss_dr13.specobj AS S
           ON L2.objid  = CAST(S.specobjid AS BIGINT)
   LIMIT 10;

Let’s break down what we did here. We created an inline view with this portion of the statement.

(select * from ls_dr3.tractor_primary L
                where (L.decam_nobs_2>1 and L.decam_nobs_3>1 and L.decam_nobs_5>1
                       and L.g != 'nan' and L.g != 'inf' and L.r != 'nan' and L.r != 'inf'
                       and L.z != 'nan' and L.z != 'inf'and L.z_w1 != 'nan' and
                       L.w1_w2 != 'nan')

       LIMIT 10) AS x

And we gave it a name x (the AS x). With this technique we treat the whole “select” as another table and select from it. Importantly, we put a LIMIT on there. Finally, we moved the WHERE clause up into the inline view since the conditions only apply to L (the alias we gave to ls_dr3.tractor_primary).

This query runs in about 10 seconds, and the database estimates it only needs to work with 2,281,969 rows instead of 14 trillion.


HINT: There’s a lot of text to read with queries, but keep the “skeleton” in mind as you read or scan them. The mental “skeleton” or pattern with this query might be like the following.

SELECT
  FROM
      ( select from where limit 10) AS x
  INNER JOIN something
          ON something
  INNER JOIN something
          ON something
  LIMIT 10;

In summary,

  • use LIMITs to explore your data to prevent long queries and reduce load on computer resources; you won’t have to wait as long to get back results
  • take advantage of inline views to apply LIMITs to tables you join against

1.11.2. Make your JOINs work for you not against you

Here is an example of a JOIN that will work near optimal performance.

SELECT   L.ra, L.dec, S.z, S.plug_ra, S.plug_dec, S.class,
         S.vdisp, S.vdisp_err
FROM     ls_dr3.dr3_specobj_dr13 as L
    JOIN sdss_dr13.specobj as S
      ON L.specobjid = S.specobjid
   WHERE L.ra BETWEEN 126 and 131
         and L.dec BETWEEN 7.0 and 12.0
         and (L.ra_ivar > 0)
ORDER BY L.ra_ivar
   LIMIT 1000;

In this JOIN, the column L.specobjid is indexed and so is S.specobjid. Also, this is a 1-to-1 join, meaning there is exactly 1 row for in both tables for a mathcing specobjid. Hence, the query is optimal for the JOINs.

On the other hand, if specobjid were not indexed in either table, the database executor for this query would have to read every row in both tables, about 4 million each. This is 4000 times more rows than necessary.

(Note: we still have the LIMIT 1000 in there just in case as discussed above.)

If this were a many-to-many join, the number of rows to scan can be the square of the number of rows in worst cases.

In summary for best performance, * join only on indexed columns * make sure your joins are not many-to-many but one-to-one

1.11.3. ORDER BY Trick

Again looking at the query we talked about above in the discussion of JOINs (reshown below), the LIMIT 1000 may look like the database needs only to scan 1000 rows. It is more accurate to say that LIMIT means return only the specified number of rows from the results. The results may require scanning many more rows, we do not always know.

In our example, the ORDER BY forces the database to generate the entire results regardles of the LIMIT.

SELECT   L.ra, L.dec, S.z, S.plug_ra, S.plug_dec, S.class,
         S.vdisp, S.vdisp_err
FROM     ls_dr3.dr3_specobj_dr13 as L
    JOIN sdss_dr13.specobj as S
      ON L.specobjid = S.specobjid
   WHERE L.ra BETWEEN 126 and 131
         and L.dec BETWEEN 7.0 and 12.0
         and (L.ra_ivar > 0)
ORDER BY L.ra_ivar
   LIMIT 1000;

The Trick We re-write the query as shown below.

SELECT * FROM

   (SELECT L.ra, L.dec, S.z, S.plug_ra, S.plug_dec, S.class,
           S.vdisp, S.vdisp_err, /* added: */ L.ra_ivar
    FROM     ls_dr3.dr3_specobj_dr13 as L
    JOIN sdss_dr13.specobj as S
      ON L.specobjid = S.specobjid
   WHERE L.ra BETWEEN 126 and 131
         and L.dec BETWEEN 7.0 and 12.0
         and (L.ra_ivar > 0)
   LIMIT 1000
   ) as x

ORDER BY x.ra_ivar;

We force the database to generate the results of the first 1000 rows before it performs the ORDER BY utilizing an inline view. If we had simply put the LIMIT 1000 before the ORDER BY without the inline view it would have caused a syntax error. Note we had to add the L.ra_ivar column indicated by the SQL comment /* added: */ so the outer SELECT could see it. (You don’t have to add the SQL comment, it’s there for our demonstration).

This runs in less than half the time as the original in our example.

1.11.4. Learning more about SQL

Khan Academy Intro to SQL: Querying and managing.

Online SQL Tutorial; learn the mechanics interactively.

Try some other online exercises geared to PostgreSQL pgexercises.

PostgreSQL SQL Lanuage from the online manual The SQL Language.