Changes between Initial Version and Version 1 of QueryTuning


Ignore:
Timestamp:
07/27/12 09:06:52 (12 years ago)
Author:
joe
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • QueryTuning

    v1 v1  
     1= SQL Query Construction = 
     2 
     3This is collection of notes of common things to consider when writing SQL queries.  From my ([People/JoeHourcle Joe Hourclé's]) experience in dealing with solar physics archives since 2003, I've seen a downturn in Oracle databases, and an upturn in PostgreSQL, but mySQL is the most prevalent database in use in our field.  I have yet to see any of the 'noSQL' solutions in use in solar physics, and I don't believe that the 'document store' model will take off in solar physics until we have some sort of reliable identifiers for individual objects within the databases.  (see my presentations from the 2008 AGU Fall Meeting, [http://vso1.nascom.nasa.gov/vso/misc/AGU2008_DataRelationships.ppt  Data Relationships: Towards a Conceptual Model of Scientific Data Catalogs] and [http://vso1.nascom.nasa.gov/vso/misc/AGU_2008_provenance_2.pdf What Are We Tracking ... and Why?]) 
     4 
     5== A Note on Database Times == 
     6 
     7You should always set your database so it's using UTC (or TAI), not the local system time zone.  Otherwise, you need to specify all operations to be converted to UTC to avoid Daylight Saving Time.  UTC still has the issue of leap seconds, and neither mySQL nor PostgreSQL adjust for leap seconds at this time.  If you're dealing with helioseismology or other analysis that is highly sensitive to leap seconds, you'll need to store seconds since a given epoch, rather than using the database's `DATETIME` field type. 
     8 
     9== Dealing with Multiple Spectral Units == 
     10 
     11The easiest way to deal with spectral units is to store two values in the table you're searching.  One is used for presentation back to the user, with the appropriate units for that data.  The second is a pre-computed value in consistent units through the whole table. 
     12 
     13For example, for STEREO, in the table that I use for VSO searches, I have: 
     14 
     15{{{ 
     16mysql> desc stereo.vso_view; 
     17+------------+----------------------+------+-----+---------+-------+ 
     18| Field      | Type                 | Null | Key | Default | Extra | 
     19+------------+----------------------+------+-----+---------+-------+ 
     20... 
     21| wavemin    | float                | YES  | MUL | NULL    |       |  
     22| wavemax    | float                | YES  | MUL | NULL    |       |  
     23| wavemin_d  | float                | YES  |     | NULL    |       |  
     24| wavemax_d  | float                | YES  |     | NULL    |       |  
     25| waveunit   | varchar(8)           | YES  |     | NULL    |       |  
     26+------------+----------------------+------+-----+---------+-------+ 
     27}}} 
     28 
     29The fields `wavemin_d`, `wavemax_d` and `waveunit` are the display values -- those are what I return in response to queries.  The `wavemin` and `wavemax` are normalized to Angstroms, and all incoming queries are converted to Angstroms before searching the table.   In this way, I can quickly search the table no matter if it's in Angstrom (SECCHI), keV (PLASTIC), or Hz (SWAVES).  I've normalized on Angstrom only because I have more SECCHI data than IMPACT, PLASTIC and SWAVES combined.  You can normalize on any spectral unit or make up your own.  (I once suggested log of meters, but I've never actually done it) 
     30 
     31When we query, we can then use one set of terms in the `WHERE` clause, but return the other set: 
     32 
     33{{{ 
     34SELECT 
     35  wavemin_d as wavemin, 
     36  wavemax_d as wavemax, 
     37  waveunit 
     38FROM 
     39  stereo.vso_view 
     40WHERE 
     41      ( wavemin >= ? AND wavemax <= ? ) 
     42  AND ... 
     43}}} 
     44 
     45== Querying on Range Matches == 
     46 
     47As we attempt to return too much data rather than too little, when we try to match a range against another range, we return values where there is an intersection.  If we've given the range 'A to B' in the search query, and our records have ranges from 'x to y', our matching options are: 
     48 
     49{{{ 
     50 
     51  Interval Matching 
     52 
     53     .........A============B......... Search Range 
     54     ...x--y..|............|......... (never match) 
     55     .........|............|...x--y.. (never match) 
     56     ......x--|--y.........|......... (1) 
     57     ......x--|------------|---y..... (2) 
     58     .........|...x-----y..|......... (3) 
     59     .........|..........x-|----y.... (4) 
     60 
     61    Intersection 
     62        A <= y AND B >= x 
     63        matches examples 1,2,3,4 
     64 
     65    (item fully contained by search range) 
     66        A <= x AND B >= y 
     67        matches example 3 only. 
     68 
     69    (item fully includes search range) 
     70        A >= x AND B <= y 
     71        matches example 2 only. 
     72}}} 
     73 
     74The problem is that for most SQL databases, these queries may not use the index (and if they do, they might actually give worse performance than a full table scan).   At first glance, you'd think that a multi-column index will work (I know I did), but it won't in the case of unbounded inequality searches like we have here, so you're typically better having the min & max independently indexed. 
     75 
     76... but we can also do a trick if we know the maximum value of `y - x` (eg, if `x` is `DATE_OBS`, and `y` is `DATE_END`, what the longest duration of any of the records being indexed.)  In the case of STEREO, I know that the longest files are 1 day, so know that the DATE_END will always fall within 2 days of DATE_OBS: 
     77 
     78{{{ 
     79   $sql .= 'AND ( ob.date_obs BETWEEN (? - INTERVAL 2 DAY)  and ? ) AND ob.date_end >= ? '; 
     80   push (@bind_vars, $query{'time'}{'start'},$query{'time'}{'end'},$query{'time'}{'start'}); 
     81}}} 
     82 
     83This allows the database to have a bounded range when searching for the database field `ob.date_obs`, for which it can use an index efficiently, and then further winnow down the result based on `ob.date_end >= ($query{'time'}{'start'}) which does not need to be indexed.