wiki:jmdDebug
Last modified 8 years ago Last modified on 03/06/15 08:47:28

Debugging the JMD

Every once in a while, things get stuck in the JMD. Here are some tips to try to figure out what's going wrong.

note : Joe H. wrote up some notes a while back re: the queues / sum service, etc.

Useful Aliases

I use these (note, bash shell) aliases to help with debugging things. You may need to adjust some of the variables to match your system. (see the NetDRMS file 'config.local')

export PROMPT_COMMAND='echo -ne "\033]0;${USER}@${HOSTNAME%%.*}"; echo -ne "\007"'

export JSOC_ROOT=/opt/netdrms/
export JSOC_DBHOST=sdo4localhost
export JSOC_DBNAME=sdac_drms
export JSOC_DBUSER=[CENSORED]

export DRMS=$JSOC_ROOT
export SUMSSERVER=sdo4localhost
export PATH=$PATH:/home/sums/bin:/home/sums/scripts

alias drms="psql -U $JSOC_DBUSER -d $JSOC_DBNAME"
alias sums="psql -U $JSOC_DBUSER -d ${JSOC_DBNAME}_sums -p 5434"
# alias jetty='sudo /opt/jetty/bin/jetty.sh'
# alias apachectl='sudo /usr/sbin/apachectl'

export DERBY_HOME=/opt/derby
alias ij=/opt/JMD/bin/jmdij
alias jmd='sudo /opt/JMD/bin/JMD.sh'

alias sumtail='tail -f `ls -1t /usr/local/logs/SUM//sum_svc*| head -1`'
alias drmsq='drms -c "select count(*) from sunum_queue"'
alias drmsq2='drms -c "select count(*), series_name from sunum_queue group by series_name"'

export IJCMD='/usr/bin/java -classpath '`ls -1t /opt/JMD/jar/*.jar | head -1`' -Dij.connection.jmdDB=jdbc:derby://localhost:31001//opt/derby/derbyDB
org.apache.derby.tools.ij'
alias jmdq='echo "select count(*) as count, request_type as type, substr(series_name,1,20) as series from drms.su_cache where state <> '"'DONE'"' group by request_type, series_name" |
'`ijcmd`;
alias jmdq0='echo "select count(*) from drms.su_cache where state <> '"'DONE'"'" | '`ijcmd`;
alias jmdq2='echo "select count(*) as count, state, request_type as type, substr(series_name,1,20) as series from drms.su_cache group by state,request_type, series_name" | '$IJCMD;
alias jmdstale='echo "select count(*) as count,substr(series_name,1,20) as series,request_type as type, min(elapse_Time) as min_time, max(elapse_time) as max_time from drms.su_cache where elapse_time > 1200 group by series_name, request_type" | '$IJCMD;
alias jmdspeed='tail -n 2000 `ls -1t /opt/JMD/log/*stderr* | head -1` | grep DONE | cut -d: -f1-2 | uniq -c'
alias jmdspeedxl='grep DONE `ls -1t /opt/JMD/log/*stderr* | head -1` | tail -n 2000 | cut -d: -f1-2 | uniq -c'
alias jmdspeed2='grep DONE `ls -1tr /opt/JMD/log/*stderr*  | tail -3`  | cut -d: -f2 | uniq -c'
alias jmdtail='tail -f `ls -1t /opt/JMD/log/*stderr* | head -1`'
alias jmdcat='cat `ls -1t /opt/JMD/log/*stderr* | head -1`'
# alias jmdofln='echo "select su_cache.sunum, sum62.sunum from drms.su_cache as su_cache left join sum62 on su_cache.sunum = sum62.sunum where state='"'OFLN'"'; " | `ijcmd`'

And what they all do:

drms : connect to the DRMS database using psql (command line postgres client)
sums : connect to the SUMS  database using psql (command line postgres client)
ij : connect to the JMD database using ij (command line derby client)

jmd : script to start/stop the JMD service.  (call using `jmd start` or `jmd stop`)

sumtail : tail -f the most recent sums log
jmdtail : tail -f the most recent jmd log
jmdcat : cat the most recent jmd log

drmsq : report the number of items in the DRMS sunum_queue table
drmsq2 : report the number of items per series in the DRMS sunum_queue table

jmdq : report the number of items per series in the JMD drms.su_cache table that are not marked 'DONE'.
jmdq0 : report the number of items (a single number) in the JMD drms.su_cache table that are not marked 'DONE'.
jmdq2 : report the number of items in the JMD drms.su_cache table, broken down by series & processing state.

jmdspeed : a count of how many SCPs per minute have recently completed
jmdspeedxl : a count of how many SCPs per minute have completed (looks further back through the JMD logs)
jmdspeed2 : a count of how many SCPs per hour have completed over the last day or so.

jmdstale : report on things in the JMD drms.su_cache table that have elapsed times of more than 20 min.
jmdofln : (note, requires populating a 'sum62' table) : report on which items marked as 'ofln' were on the JSOC /SUM62 partition

Restarting the JMD

In your JMD directory, there should be a bin/JMD.sh. It behaves like an rc script, so call it as root with either 'start' or 'stop' as an argument. See the above 'jmd' alias which calls it using sudo, so you can do jmd stop; jmd start.

Sniffing Incoming Requests

When nothing else works, we can try monitoring the traffic to make sure that the JMD processes are all happy.

To make sure that user requests are coming from DRMS to the JMD, we can monitor the traffic on port 8080. You'll need to be on the same machine, as it's only opened up to localhost, not to the world. First, install wireshark, then :

tshark -i lo -f 'tcp port 8080' -d tcp.port==8080,http | grep '/JMD'

Then issue a request via the VSO IDL client, making sure to specify 'site' on the vso_get call.

You should see something like:

[oneiros@sdo4 scripts]$ sudo tshark -i lo -f 'tcp port 8080' -d tcp.port==8080,http | grep '/JMD'
Running as user "root" and group "root". This could be dangerous.
Capturing on lo
  0.000176    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=baa20705-b50b-4ac6-a7b8-b099e5f2850a HTTP/1.1
  2.875262    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=c3aa5a5e-95b7-45b0-9de4-df85fa56d4c7 HTTP/1.1
  9.370789    127.0.0.1 -> 127.0.0.1    HTTP 231 POST /JMD/JMD HTTP/1.1  (application/x-www-form-urlencoded)
 10.012264    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=baa20705-b50b-4ac6-a7b8-b099e5f2850a HTTP/1.1
 10.820344    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=a4d74aab-2017-47a8-8014-b1476de52f6c HTTP/1.1
 12.887373    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=c3aa5a5e-95b7-45b0-9de4-df85fa56d4c7 HTTP/1.1
 20.023336    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=baa20705-b50b-4ac6-a7b8-b099e5f2850a HTTP/1.1
 20.829574    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=a4d74aab-2017-47a8-8014-b1476de52f6c HTTP/1.1

The POST is the initial request, and then the subsequent GET requests are to monitor if the retrieval is completed. (note that there were multiple ones running concurently in this example). If you see bogus characters in there, and not a valid UUID, you need to upgrade your JMD to fix a resolved bug.

Sniffing Inter-server Communications

When things are 'stuck' in the JMD's queue, we can also try to monitor to make sure that we're getting good messages between NetDRMS nodes.

First, we'll need to identify the name of our external interface using either ifconfig or ip:

[oneiros@sdo4 scripts]$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether bc:30:5b:ee:fb:48 brd ff:ff:ff:ff:ff:ff
    inet 198.118.[CENSORED]/24 brd 198.118.248.255 scope global em1
3: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether bc:30:5b:ee:fb:49 brd ff:ff:ff:ff:ff:ff
4: em3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether bc:30:5b:ee:fb:4a brd ff:ff:ff:ff:ff:ff
5: em4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether bc:30:5b:ee:fb:4b brd ff:ff:ff:ff:ff:ff
[oneiros@sdo4 scripts]$ ifconfig -a
em1       Link encap:Ethernet  HWaddr BC:30:5B:EE:FB:48
          inet addr:198.118.[CENSORED]  Bcast:198.118.248.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:25800945176 errors:0 dropped:6701 overruns:95689 frame:0
          TX packets:19038424359 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:23062736230177 (20.9 TiB)  TX bytes:23905987935940 (21.7 TiB)
          Memory:dcb00000-dcbfffff
...

So in this case, our external interface is 'em1', so we can sniff that interface. The censored partially censored IP is that interface's address. Unfortunately, because we accept connections on port 80, and we're trying to monitor port 80 outbound, it's a bit messy. Also, this won't show us the response payload:

[[oneiros@sdo4 scripts]$ sudo tshark -i em1 -f 'tcp dst port 80 and src 198.118.[CENSORED]' -d tcp.port==80,http | grep HTTP
Running as user "root" and group "root". This could be dangerous.
Capturing on em1
  0.013785 198.118.[CENSORED] -> 131.142.[CENSORED] HTTP 166 POST /cgi-bin/VSO/DRMS/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
  0.313581 198.118.[CENSORED] -> 140.252.[CENSORED] HTTP 166 POST /cgi-bin/VSO/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
  0.686565 198.118.[CENSORED] -> 171.64.[CENSORED] HTTP 166 POST /cgi-bin/ajax/jsoc_fetch_VSO HTTP/1.1  (application/x-www-form-urlencoded)
  2.138692 198.118.[CENSORED] -> 131.142.[CENSORED] HTTP 166 POST /cgi-bin/VSO/DRMS/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
  2.439454 198.118.[CENSORED] -> 140.252.[CENSORED] HTTP 166 POST /cgi-bin/VSO/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
  2.807359 198.118.[CENSORED] -> 171.64.[CENSORED] HTTP 166 POST /cgi-bin/ajax/jsoc_fetch_VSO HTTP/1.1  (application/x-www-form-urlencoded)
 11.453624 198.118.[CENSORED] -> 131.142.[CENSORED] HTTP 166 POST /cgi-bin/VSO/DRMS/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
 11.753938 198.118.[CENSORED] -> 140.252.[CENSORED] HTTP 166 POST /cgi-bin/VSO/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
 12.118593 198.118.[CENSORED] -> 171.64.[CENSORED] HTTP 166 POST /cgi-bin/ajax/jsoc_fetch_VSO HTTP/1.1  (application/x-www-form-urlencoded)
 13.580818 198.118.[CENSORED] -> 131.142.[CENSORED] HTTP 166 POST /cgi-bin/VSO/DRMS/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
 13.878723 198.118.[CENSORED] -> 140.252.[CENSORED] HTTP 166 POST /cgi-bin/VSO/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)

To actually sniff the connection requires more work. You might be able to use the full wireshark command if you can can get an X console to the machine, or you can dump the traffic, and then move the dump to another machine to analyze: