1.1.3. Survey datasets

Most Data Lab workflows will begin by posing a question for which the first step in getting an answer will involve a query of one or more survey catalog datasets. In the Data Lab, catalogs are stored in databases, with any given catalog consisting of a number of separate but linked tables. These tables are accessed via Structured Query Language (SQL) or its variant, Astronomical Data Query Language (ADQL). From the beginning, users are thus presented with a set of challenges:

  • Learning what measurements the tables from a given survey dataset contain and what they are named
  • Learning how to construct a database query that will retrieve all of the measurements needed for a given question
  • If measurements from more than one table or more than one survey are needed, learning how to join tables in such a way that all of the information is retrieved
  • For complex questions in particular, learning how to optimize the database query for performance

For many users, the first step in answering a question through the Data Lab will thus be to learn about the particular datasets that it contains.

1.1.3.1. What kinds of datasets does the Data Lab contain?

1.1.3.1.1. Core datasets

These are large high-value datasets served by the Data Lab, possibly providing value-added data such as pre-computed columns or external-table crossmatches. Tables are optimized/indexed to support the most common science cases. Examples of current and coming datasets are DECaLS and the DESI Targeting Surveys, DES, the DESI survey, and the pixel data contained in the NOAO Science Data Archive.

1.1.3.1.2. Hosted datasets

These are smaller-scale, Survey Team, or PI datasets where a delivered high-level data product data collection is provided by users who want to share the data via Data Lab services. These are relatively static in terms of release frequency/versions but imply some level of Data Lab operational support in order to be made available to community users. Examples are SMASH, the NEO Survey (PI Allen), and the Galactic Bulge Survey (coming soon; PI Saha).

1.1.3.1.3. Reference datasets

These are large external datasets, mirrored through the Data Lab because of their value as photometric, spectroscopic, or astrometric references. Examples are SDSS, WISE, GAIA, and USNO A/B.

1.1.3.2. Guidance on understanding table schema

Given the variety of datasets available through the Data Lab database, learning how to identify the tables and table columns of interest can be a challenge. There are several tools to help with this:

  • The Data Lab query webpage contains a schema browser through which you can browse the available datasets, their tables, and the column descriptions.
  • The datalab command has a schema method that will display the schema and table descriptions.
  • The Survey Data webpage contains full dataset descriptions and links to survey documentation.

In general, the survey datasets hosted by the Data Lab contain a few kinds of tables:

  • Overview: These are tables that provide summary information of the survey, such as the spatial organization of the catalog data. These tables generally have many fewer rows than the main catalog tables, as they do not contain individual objects.
  • Object: These are typically the main catalog tables, and contain aggregated information for the astronomical objects identified by the survey. There are often views of these main tables that apply a constraint to yield subsets of objects with similar properties, e.g. the star and galaxy views of DECaLS DR3. The object tables are sometimes broken into several tables, each with different columns of information but joined by a unique object identifier.
  • Measurement: These are typically tables containing time-stamped individual measurements of objects in the main catalog tables, in general organized by having one row for every individual epoch of every individual object. While the number of columns in these tables is typically smaller than for the object tables, the number of rows can be much larger, and thus care should be exercised when pulling data from them.
  • Neighbors: These are specialized tables that contain information on all the internal spatial matches within a specified radius of all objects in the object table. Depending on the density of the objects on the sky and matching radius, these tables can be very large.
  • Crossmatch: These tables typically contain the spatially matched cross-identifications of the main object table with object catalogs from one or more external surveys.
  • Exposure: These tables typically contain metadata, such as calibration information, airmass, etc., for every individual exposure taken during the survey. By joining these tables through the measurement and object tables, users can assign these metadata values to their objects of interest.
  • Chip: These tables are similar to the Exposure tables, but contain metadata relevant to the individual chips in the mosaics that make up the exposures, e.g. chip-dependent photometric calibration information.

The tables below organize the database tables for the core and reference catalogs hosted by the Data Lab by their type. Table views are listed in italics underneath the primary table from which they are drawn.


Table of Tables: DECaLS, SMASH, and GAIA


Table Type

root:

DECaLS (DR3)

ls_dr3

SMASH (DR1)

smash_dr1

GAIA (DR1)

gaia_dr1

Overview

bricks

bricks_dr3

depth

depth_summary

field  
Object

tractor

tractor_primary

tractor_secondary

star

galaxy

apflux

wise

object

stars

galaxies

gaia_source

tgas_source

variable_summary

rrlyrae

cepheid

phot_variable
_time_series_gfov _statistical_parameters
Measurement   source
phot_variable
_time_series_gfov
Neighbors neighbors    
Crossmatch

dr3_dr12q

dr3_dr7q

dr3_specobj_dr13

dr3_superset_dr12q

xmatch  
Exposure   exposure  
Chip

survey_ccds

ccds_annotated

chip  

Table of Tables: DES SVA1, SDSS DR13, NEO, and USNO


Table Type

root:

DES (SVA1)

des_sva1

SDSS (DR13)

sdss_dr13

NEO (DR1)

neo_dr1

USNO

usno

Overview        
Object

gold_catalog

redmapper_exp_catalog

redmapper_pub_catalog

and many subtables

specobj

photoposplate

calibobj_star

calibobj_gal

dr12qso

emissionlinesport_dr12

galspecline_dr8

movgrp

movmpc

a2

b1

Measurement    

movds

movobs

 
Neighbors        
Crossmatch  

wisematch_gal

wisematch_star

   
Exposure     movexp  
Chip        

1.1.3.3. Guidance on constructing queries

To turn your science question into an experiment using the data hosted by the Data Lab, you need to:

  • design a basic workflow
  • construct an SQL/ADQL query to retrieve the data that you need

For example, to answer the question, “Are there faint Milky Way dwarf galaxies yet to be discovered in SMASH or DECaLS?”, you might adopt the technique of looking for overdensities of blue, potentially metal-poor, stars in the SMASH and DECaLS catalogs. The columns you would retrieve would at minimum contain RA, Dec, and magnitudes in the survey bands. You would place a constraint on the color of the objects, make a cut on the shape of objects to retrieve only point sources, and make any data quality cuts suggested by the surveys to limit the number of spurious objects. To identify the tables/views and columns that you would use, you would browse the table schema following the guidance on schema.

Retrieving all blue stars from SMASH and DECaLS would likely return a very large number of rows and take a long time. Thus, as the first step in a workflow, you should design a query that works off a limited area of sky, e.g. a single brick in DECaLS or a single field in SMASH. Here is the query from the notebook demonstrating the discovery of the Hydra II dwarf galaxy (Martin et al.2015) in SMASH:

field = 169 # SMASH field number to query

# Create the query string; SQL keyword capitalized for clarity
#   depth > 1 = no short exposures please
#   ndetr, ndetg > 3 = more than 3 detections in r & g bands
#   abs(sharp) < 0.5 = avoid broad objects
query = """SELECT ra,dec,gmag,rmag,imag
           FROM smash_dr1.object
           WHERE fieldid = '%d' AND
               depthflag > 1 AND
               ndetr > 3 AND ndetg > 3 AND
               abs(sharp) < 0.5 AND
               gmag BETWEEN 9 AND 25 AND
               (gmag-rmag) BETWEEN -0.4 AND 0.4""" % field

The scientific analysis would then proceed using the data returned from this query.

For more guidance on constructing queries and references for learning SQL, check out:

1.1.3.4. Joins on tables in the Data Lab

For some queries, you will want to retrieve information from more than one table in a survey dataset, or retrieve information across survey datasets, e.g. to identify objects in DECaLS that have spectra in SDSS. To do this, you will need to join one table against one or more other tables in the Data Lab. See SQL Basics and Gotchas for links that explain the various ways in which tables can be joined (Khan Academy has a particularly nice tutorial), and some warnings about sizes when performing joins on large tables.

To join tables, you need to identify the keys in the tables on which you will perform the table match. For tables in the Data Lab, these are typically one or more columns of object IDs. These Entity-Relationship Diagrams (ERDs) for DECaLS/SDSS and for SMASH identify the keys to use in joining tables:

DECaLS/SDSS ERD
SMASH Survey ERD

1.1.3.5. Sub Queries

If you see two or more SELECT s in a query, it has a sub query in it. Some call this a nested query.

SELECT ra, dec from smash_dr1.exposure e
WHERE e.expnum = ( SELECT expnum from smash_dr1.object o
                   INNER JOIN smash_dr1.source s
                   ON o.id = s.id
                   WHERE s.chi > 17.0 );

We take the results of one query, the subquery, and feed them into the outer query.

Beside the = operator, >, <, IN, ANY, and EXISTS also are supported.

It is possible to change this query to join smash_dr1.exposure to smash_dr1.object instead of using a sub query. Either way should perform the same if the database optimizer is working correctly.

Many joins can be rewritten as sub queries, and many, not all, sub queries can be rewritten as joins.

Here is the subquery written as a join.

SELECT ra, dec FROM smash_dr1.exposure e
INNER JOIN smash_dr1.object o
        ON e.expnum = o.expnum
INNER JOIN smash_dr1.source s
        ON o.id = s.id
     WHERE s.chi > 17.0;

Here is a subquery that cannot be re-written as a join. (Note the MAX() aggregate function).

SELECT expnum from smash_dr1.source s
 WHERE s.chi = ( SELECT MAX(chi)
                   FROM smash_dr1.source s2 );

1.1.3.6. Optimizing queries

  • Most of the time, if you reduce the number of rows the query “scans”, you will get your results back faster.

    See the discussion on the LIMIT clause in SQL Basics and Gotchas

  • Instead of SELECT *, select only the specific columns of interest. This is true in general. Instead of selecting 20 columns, select only the ones you need. If you are choosing between two columns to select, pick the one that is smaller in size.

  • If the query JOINs one or more tables, be sure the columns you are joining on are indexed.

    To find out if a column is indexed, one way is to lookup the table and see if the column is in bold here Data Lab query webpage

  • To take advantage of indexed columns, be sure to not wrap them in functions such as AVG(), SUM(), or any other function call.

SELECT expnum, fieldid  FROM smash_dr1.exposure e
 WHERE CAST(e.expnum AS INTEGER) > 187566;

Since we put CAST() around e.expnum, even if the expnum column is indexed, it will be slow as the index will be ignored. (This example would not be significant since the exposure table has 5,809 rows–related to the first point above to reduce the number of rows the query scans).

  • When querying an indexed column, if you can add a WHERE clause and you can also query another indexed column mentioned in the WHERE clause, most often the results return more quickly than just using a single indexed column.

Consider the abstracted SELECT.

SELECT <<something>> FROM <<something>> WHERE
<<condition 1>> AND
<<condition 2>> AND
<<condition ...>> AND
<<condition N>>;

The more conditions you can provide on any indexed column, generally the better.

  • Sorting is a relatively expensive operation. If you do not need to order the results, removing the ORDER BY will greatly speed up the results. (GROUP BY also forces a sort.)
  • Indexing columns having only one value in their rows is a waste since all rows satisfy (the database still has to scan through all the rows). Thus, using a WHERE clause on an indexed column having unique values will perform better than an indexed column with fewer unique values.
  • If you search for values in a non-indexed column, the database will search through every row. Use a CSTORE table instead. Look for the table name with a _cs on the end, indicating it is a CSTORE table. CSTORE is for a columnar storage format that works faster on non-indexed columns. (Technically they use an index but it is of a very different type).

Here is a query on a non-CSTORE table on an un-indexed column.

SELECT ra, dec FROM ls_dr3.tractor t
 WHERE g_r between 0.3 and 1.1;

Here, where g_r is not indexed (as of this writing), the system will scan across over 330 GB of data, requiring over 10 minutes to run. (The entire tractor table is over 524 million rows).

If there is a ls_dr3.tractor_cs table, the cs indicates it is a CSTORE table. The following query on ls_dr3.tractor_cs might run in less than a minute. (Note the ls_dr3.tractor_cs table.)

SELECT ra, dec from ls_dr3.tractor_cs t
 WHERE g_r between 0.3 and 1.1;

Why not make everything a CSTORE table? The main reason, as of this writing, is CSTORE does not support something called function-based indexes which are critical for high-speed positional queries. We look forward to the future when this becomes available.

  • If you are exploring the data, try reducing the number of calculations. Or try using simpler calculations: Division is more time consuming than multiplying. Some division operations can be changed into multiplication (e.g. x/10 = x * 0.10)
  • Avoid doing string or text manipulation as they are CPU-intensive operations.
  • If you are doing all of the suggestions above and the query is still slow, you may have hit an optimization wall where we are limited by hardware performance. In the future look for more parallel options to run queries across multiple CPUs and nodes. Presently we have parallel query in place which works well for aggregate queries on large tables, though it still has limited use. Future versions will support parallel operations on even more queries.