wiki:drmsExportDebug
Last modified 4 years ago Last modified on 08/28/20 12:13:48

Debugging VSO / NetDRMS / drms_export.cgi Problems

This is an attempt to explain how the VSO / JSOC integration works, and steps to take to debug when someone says they "can't download an AIA image".

(note that when I talk about modules, and prefix it with "VSO", it's really "Physics::Solar::VSO", and "JSOC" is shorthand for "Physics::Solar::VSO::DataProvider::JSOC")

How the VSO queries the JSOC

We actually don't for a "VSO Query". We use slony1 replication to get copies of certain tables from their database. The VSO replicas have triggers which build the 'vso shadow tables', which are de-duplicated (unjournaled?), stripped of fields that we don't care about, and pre-compute some fields that we use (like converting "dynamical time" to SQL datetime fields). We then run searches against that. But it's fairly complicated, as for increased availability:

  1. There are multiple "NetDRMS sites" that have some software installed (the "JSOC-Lite data provider") that contains the CGI "drms_export.cgi" and module "JSOC::Registry", which has information about all of the JSOC series that the VSO serves. The CGI accepts requests, checks it against the registry, and then decodes it to build a request to the DRMS program "drms_export_cgi" (note there's an underscore, not a period), which was forked from another program (jsoc_export_as_fits ?) but instead generates CGI headers and streams the output rather than writing a file out to the local disk.
  1. There are two instances of the full "JSOC data provider", which have the database replicas and run the actual queries. The main one is at the SDAC, and there's a failover at NSO. For "Query" operations, they act like the VSO but instead of farming out requests to multiple data providers and merging the results, it farms out the requests to modules that each know about a specific DRMS series.

Within the full providers, there is a module "JSOC::Sites" which will expand requests for a specific named site into a URL. Some of the JSOC::Series modules also have hooks to push requests to specific sites (to deal with some problems with serving HMI.*_720s series that we were seeing, and an attempt to both distribute traffic evenly and to push traffic to the site that was attempting to collect more of a given type of data). But it's important to note that the "JSOC::Sites" at SDAC and NSO are intentionally different -- NSO does *not* send traffic to the SDAC, as there's an assumption that we've failed over to it then something is wrong at the SDAC.

Because of all of this, when someone "can't download" a file, we have to track down which VSO instance they went through, which may affect what JSOC provider instance they called, which in turn affects what URL they were given to download. For the web client, they're given a bunch of links, so we can just look at that. For IDL, we can run the vso_get() call with /debug so we can get the list of URLs without downloading them.

It's relatively rare for something to go wrong with the URL generation. Unless of course a site has gone offline, and needs to be removed or redirected temporarily. The typical problem is that they go to retrieve the URL, and either time out or get a 500 error message.

If you have the URL, you can just try retrieving it, and see what happens. It's important to note that due to how NetDRMS works, at periods of high load, it's quite possible to get "transient errors", where they just go away over time, but it's also possible that there are problems due to over-aggressive mod_security rules that affect only certain download clients, so just because "it works for me" doesn't mean the problem is fixed.

How drms_export.cgi works

The URL will end in something similar to:

drms_export.cgi?series=aia__lev1;compress=rice;record=1700_1234000866-1234000866

The CGI unpacks the information in the QUERY_STRING and first checks the JSOC::Registry if the 'series' is one that it knows about. If it is, then it gets information about how to parse the 'record' field, and how to use that information to build a "DRMS Recordset Name". For AIA, the format is "(wavelength)_(t_rec_index)-(t_rec_index)", so it can ask for a range of records at once. HMI series are just "(t_rec_index)-(t_rec_index)".

Occasionally, we get rogue sites trying to mirror data by asking for *every* t_rec_index. But there's currently only one AIA image at a given wavelength for roughly every 12th t_rec_index. (white light are less frequent, and early in the mission it's 1 every 10 for EUV images). So this will trigger a *lot* of errors. The usual procedure is to ban the site using IPTables or similar, and wait to see who contacts us so we can get them to clean up their scripts.

If the CGI doesn't like the series name, or it doesn't like the other parameters that were passed in, it'll return an HTML message of "ERROR PROCESSING REQUEST" and log the reason to the HTTP Error log with a prefix of "drms_export : ABORTING", and *not* hand off to drms_export_cgi

The program drms_export_cgi is a "NetDRMS" program -- it attempts to retrieve the request "DRMS RecordSet", which means it needs access to DRMS and SUMS. As it's a fork of jsoc_export_as_fits, which wasn't quite meant to be run this way, it writes a bit of spam to STDERR, which ends up in the HTTP error log. Much of it's crap, but it will write the command that was sent, with arguments. You can then take this command, wrap the arguments in single quotes, and try it from the command line. (both as yourself or the production user, then su to the webserver user to test it again)

If the data *isn't* available in SUMS, it will trigger the "load from tape" functionality, which at the remote sites will cause a "USER" priority request to the JMD. (and I should probably look over Niles' notes on the wiki:jmd at some point, as that's more than I really want to explain in here)

The most common error is that the JMD transfers are running slowly, so SUMS doesn't have the data before the web client times out. This will show up as a '500' error in the HTTP access logs. (I attempted to make a smarter wrapper to deal with SIG_PIPE and return a 503 error instead for this, but never got it working reliably)

Basically, what happens is this:

  1. someone makes a request for a lot of HMI images, in parallel
  2. the JMD goes to download those images, but they're packed multiple per unit, so we have to retrieve really large files from the JSOC
  3. those requests clog the queue until they've finished downloading

If a given SU (storage unit; a set of files) has trouble downloading (failed scp for whatever reason, taken offline at the JSOC, etc.), then the JMD will write out a line marked "FAIL" in its log, and will change the status field in the derby database that it uses to track its queue. It will then not reattempt that particular SU for a few minutes ... and after multiple fails, only tries about once an hour or so.

The actual debugging

Do whatever you can to figure out what URL they're trying to retrieve. This tells us both what site they're trying to download from, and what image(s) they're trying to download. Once you have that URL:

  1. Connect to the DRMS database, and find out what 'sunum' (storage unit number) is associated with that series & t_rec_index :

select sunum from aia.lev1 where wavelnth = ? and t_rec_index = ?;

  1. Connect to the SUMS database to find out if the data is at the local site. Note that 'ds_index' is the same as 'sunum', as chunks of the internals are based upon the 'dsds' system used for SOHO and they used different variable names for things. Also note that I don't remember what the different values for 'status' are. (one is written in when sum_alloc is run, then the data gets transfered, and it's changed when the sum_put is run to mark it's available. I *think* that available is "2".) You might also run into a strange case where sum_partn_alloc has multiple entries for the same SUNUM (if sum_alloc is run multiple times because an scp fails and it re-assigns it to a new partition, but I think Art's merging of sum_alloc and sum_put will now clean this up)

select wd, status from sum_partn_alloc where ds_index = ?;

  1. Check to make sure that there's a directory where there should be one (wd from the query above), that it's not empty, and that the permissions look okay. The top should have a file "Records.txt" and a series of directories sequentially named starting at S00000. Those directories should have one or more files in them, varying based on which series it is. (AIA will have "image_lev1.fits" and "spikes.fits")

ls -lR $WD

  1. If it dosn't have an entry in SUMS, or if the directory is entry, check if it's stuck in the JMD queue. (see wiki:jmdDebug for a list of aliases to use). The jmdstale alias will tell you if there are files that are taking a little too long, but you'll probably need to use ij and search for the specific SU:

select retries,state from drms.su_cache where sunum = ?;

  1. I would also check to see if there are any messages about that particular SU in the most recent JMD log. (if it didn't happen today, then you'll to look through the specific file for that day)

jmdcat | grep $SUNUM

  1. You *might* find some messages in the HTTP error log for when the command was run, but those logs aren't always time stamped, so you'll first need to look through the HTTP access log for the request. Note that more recent versions of CentOS decided to chmod /var/log/httpd to 700 root root, instead giving group read to it, so you're going need root access to look it:

grep $IP_or_URL_part $HTTP_access_log

  1. Then use that time to look through the error logs. Note that although Igor has set up some things to write to STDERR with timestamps and IP, not all of them do, so you can't just grep for the IP. You have to find roughly the time in the files from the access, and then slowly page through the logs to look for anything interesting.

See Also

wiki:jmdDebug , wiki:jmd , wiki:queryDerby