Last modified 10 years ago Last modified on 07/27/12 09:06:52

SQL Query Construction

This is collection of notes of common things to consider when writing SQL queries. From my (Joe Hourclé's) experience in dealing with solar physics archives since 2003, I've seen a downturn in Oracle databases, and an upturn in PostgreSQL, but mySQL is the most prevalent database in use in our field. I have yet to see any of the 'noSQL' solutions in use in solar physics, and I don't believe that the 'document store' model will take off in solar physics until we have some sort of reliable identifiers for individual objects within the databases. (see my presentations from the 2008 AGU Fall Meeting, Data Relationships: Towards a Conceptual Model of Scientific Data Catalogs and What Are We Tracking ... and Why?)

A Note on Database Times

You should always set your database so it's using UTC (or TAI), not the local system time zone. Otherwise, you need to specify all operations to be converted to UTC to avoid Daylight Saving Time. UTC still has the issue of leap seconds, and neither mySQL nor PostgreSQL adjust for leap seconds at this time. If you're dealing with helioseismology or other analysis that is highly sensitive to leap seconds, you'll need to store seconds since a given epoch, rather than using the database's DATETIME field type.

Dealing with Multiple Spectral Units

The easiest way to deal with spectral units is to store two values in the table you're searching. One is used for presentation back to the user, with the appropriate units for that data. The second is a pre-computed value in consistent units through the whole table.

For example, for STEREO, in the table that I use for VSO searches, I have:

mysql> desc stereo.vso_view;
| Field      | Type                 | Null | Key | Default | Extra |
| wavemin    | float                | YES  | MUL | NULL    |       | 
| wavemax    | float                | YES  | MUL | NULL    |       | 
| wavemin_d  | float                | YES  |     | NULL    |       | 
| wavemax_d  | float                | YES  |     | NULL    |       | 
| waveunit   | varchar(8)           | YES  |     | NULL    |       | 

The fields wavemin_d, wavemax_d and waveunit are the display values -- those are what I return in response to queries. The wavemin and wavemax are normalized to Angstroms, and all incoming queries are converted to Angstroms before searching the table. In this way, I can quickly search the table no matter if it's in Angstrom (SECCHI), keV (PLASTIC), or Hz (SWAVES). I've normalized on Angstrom only because I have more SECCHI data than IMPACT, PLASTIC and SWAVES combined. You can normalize on any spectral unit or make up your own. (I once suggested log of meters, but I've never actually done it)

When we query, we can then use one set of terms in the WHERE clause, but return the other set:

  wavemin_d as wavemin,
  wavemax_d as wavemax,
      ( wavemin >= ? AND wavemax <= ? )
  AND ...

Querying on Range Matches

As we attempt to return too much data rather than too little, when we try to match a range against another range, we return values where there is an intersection. If we've given the range 'A to B' in the search query, and our records have ranges from 'x to y', our matching options are:

  Interval Matching

     .........A============B......... Search Range
     ...x--y..|............|......... (never match)
     .........|............|...x--y.. (never match)
     ......x--|--y.........|......... (1)
     ......x--|------------|---y..... (2)
     .........|...x-----y..|......... (3)
     .........|..........x-|----y.... (4)

        A <= y AND B >= x
        matches examples 1,2,3,4

    (item fully contained by search range)
        A <= x AND B >= y
        matches example 3 only.

    (item fully includes search range)
        A >= x AND B <= y
        matches example 2 only.

The problem is that for most SQL databases, these queries may not use the index (and if they do, they might actually give worse performance than a full table scan). At first glance, you'd think that a multi-column index will work (I know I did), but it won't in the case of unbounded inequality searches like we have here, so you're typically better having the min & max independently indexed.

... but we can also do a trick if we know the maximum value of y - x (eg, if x is DATE_OBS, and y is DATE_END, what the longest duration of any of the records being indexed.) In the case of STEREO, I know that the longest files are 1 day, so know that the DATE_END will always fall within 2 days of DATE_OBS:

   $sql .= 'AND ( ob.date_obs BETWEEN (? - INTERVAL 2 DAY)  and ? ) AND ob.date_end >= ? ';
   push (@bind_vars, $query{'time'}{'start'},$query{'time'}{'end'},$query{'time'}{'start'});

This allows the database to have a bounded range when searching for the database field ob.date_obs, for which it can use an index efficiently, and then further winnow down the result based on `ob.date_end >= ($query{'time'}{'start'}) which does not need to be indexed.