wiki:jmdDebug

Version 2 (modified by joe, 9 years ago) (diff)

--

Debugging the JMD

Every once in a while, things get stuck in the JMD. Here are some tips to try to figure out what's going wrong.

note : Joe H. wrote up some notes a while back re: the queues / sum service, etc.

Useful Aliases

I use these (note, bash shell) aliases to help with debugging things. You may need to adjust some of the variables to match your system. (see the NetDRMS file 'config.local')

export PROMPT_COMMAND='echo -ne "\033]0;${USER}@${HOSTNAME%%.*}"; echo -ne "\007"'

export JSOC_ROOT=/opt/netdrms/
export JSOC_DBHOST=sdo4localhost
export JSOC_DBNAME=sdac_drms
export JSOC_DBUSER=[CENSORED]

export DRMS=$JSOC_ROOT
export SUMSSERVER=sdo4localhost
export PATH=$PATH:/home/sums/bin:/home/sums/scripts

alias drms="psql -U $JSOC_DBUSER -d $JSOC_DBNAME"
alias sums="psql -U $JSOC_DBUSER -d ${JSOC_DBNAME}_sums -p 5434"
# alias jetty='sudo /opt/jetty/bin/jetty.sh'
# alias apachectl='sudo /usr/sbin/apachectl'

export DERBY_HOME=/opt/derby
# alias ij=/opt/derby/bin/ij
alias ij=/opt/JMD/bin/jmdij
alias jmd='sudo /opt/JMD/bin/JMD.sh'

alias sumtail='tail -f `ls -1t /usr/local/logs/SUM//sum_svc*| head -1`'
alias drmsq='drms -c "select count(*) from sunum_queue"'
alias drmsq2='drms -c "select count(*), series_name from sunum_queue group by series_name"'
export IJCMD='/usr/bin/java -classpath '`ls -1t /opt/JMD/jar/*.jar | head -1`' -Dij.connection.jmdDB=jdbc:derby://localhost:31001//opt/derby/derbyDB
org.apache.derby.tools.ij'

alias jmdq='echo "select count(*) as count, request_type as type, substr(series_name,1,20) as series from drms.su_cache group by request_type, series_name" | '$IJCMD;
alias jmdq2='echo "select count(*) as count, state, request_type as type, substr(series_name,1,20) as series from drms.su_cache group by state,request_type, series_name" | '$IJCMD;
alias jmdstale='echo "select count(*),substr(series_name,1,20),request_type, min(elapse_Time), max(elapse_time) from drms.su_cache where elapse_time > 1200 group by series_name, request_type" | '$IJCMD;
alias jmdspeed='tail -n 2000 `ls -1t /opt/JMD/log/*stderr* | head -1` | grep DONE | cut -d: -f1-2 | uniq -c'
alias jmdspeedxl='grep DONE `ls -1t /opt/JMD/log/*stderr* | head -1` | tail -n 2000 | cut -d: -f1-2 | uniq -c'
alias jmdspeed2='grep DONE `ls -1tr /opt/JMD/log/*stderr*  | tail -3`  | cut -d: -f2 | uniq -c'
alias jmdtail='tail -f `ls -1t /opt/JMD/log/*stderr* | head -1`'
alias jmdcat='cat `ls -1t /opt/JMD/log/*stderr* | head -1`'

Restarting the JMD

Sniffing Incoming Requests

When nothing else works, we can try monitoring the traffic to make sure that the JMD processes are all happy.

To make sure that user requests are coming from DRMS to the JMD, we can monitor the traffic on port 8080. You'll need to be on the same machine, as it's only opened up to localhost, not to the world. First, install wireshark, then :

tshark -i lo -f 'tcp port 8080' -d tcp.port==8080,http | grep '/JMD'

Then issue a request via the VSO IDL client, making sure to specify 'site' on the vso_get call.

You should see something like:

[oneiros@sdo4 scripts]$ sudo tshark -i lo -f 'tcp port 8080' -d tcp.port==8080,http | grep '/JMD'
Running as user "root" and group "root". This could be dangerous.
Capturing on lo
  0.000176    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=baa20705-b50b-4ac6-a7b8-b099e5f2850a HTTP/1.1
  2.875262    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=c3aa5a5e-95b7-45b0-9de4-df85fa56d4c7 HTTP/1.1
  9.370789    127.0.0.1 -> 127.0.0.1    HTTP 231 POST /JMD/JMD HTTP/1.1  (application/x-www-form-urlencoded)
 10.012264    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=baa20705-b50b-4ac6-a7b8-b099e5f2850a HTTP/1.1
 10.820344    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=a4d74aab-2017-47a8-8014-b1476de52f6c HTTP/1.1
 12.887373    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=c3aa5a5e-95b7-45b0-9de4-df85fa56d4c7 HTTP/1.1
 20.023336    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=baa20705-b50b-4ac6-a7b8-b099e5f2850a HTTP/1.1
 20.829574    127.0.0.1 -> 127.0.0.1    HTTP 172 GET /JMD/JMD?type=query&sessionid=a4d74aab-2017-47a8-8014-b1476de52f6c HTTP/1.1

The POST is the initial request, and then the subsequent GET requests are to monitor if the retrieval is completed. (note that there were multiple ones running concurently in this example). If you see bogus characters in there, and not a valid UUID, you need to upgrade your JMD to fix a resolved bug.

Sniffing Inter-server Communications

When things are 'stuck' in the JMD's queue, we can also try to monitor to make sure that we're getting good messages between NetDRMS nodes.

First, we'll need to identify the name of our external interface using either ifconfig or ip:

[oneiros@sdo4 scripts]$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether bc:30:5b:ee:fb:48 brd ff:ff:ff:ff:ff:ff
    inet 198.118.[CENSORED]/24 brd 198.118.248.255 scope global em1
3: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether bc:30:5b:ee:fb:49 brd ff:ff:ff:ff:ff:ff
4: em3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether bc:30:5b:ee:fb:4a brd ff:ff:ff:ff:ff:ff
5: em4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether bc:30:5b:ee:fb:4b brd ff:ff:ff:ff:ff:ff
[oneiros@sdo4 scripts]$ ifconfig -a
em1       Link encap:Ethernet  HWaddr BC:30:5B:EE:FB:48
          inet addr:198.118.[CENSORED]  Bcast:198.118.248.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:25800945176 errors:0 dropped:6701 overruns:95689 frame:0
          TX packets:19038424359 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:23062736230177 (20.9 TiB)  TX bytes:23905987935940 (21.7 TiB)
          Memory:dcb00000-dcbfffff
...

So in this case, our external interface is 'em1', so we can sniff that interface. The censored partially censored IP is that interface's address. Unfortunately, because we accept connections on port 80, and we're trying to monitor port 80 outbound, it's a bit messy. Also, this won't show us the response payload:

[[oneiros@sdo4 scripts]$ sudo tshark -i em1 -f 'tcp dst port 80 and src 198.118.[CENSORED]' -d tcp.port==80,http | grep HTTP
Running as user "root" and group "root". This could be dangerous.
Capturing on em1
  0.013785 198.118.[CENSORED] -> 131.142.[CENSORED] HTTP 166 POST /cgi-bin/VSO/DRMS/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
  0.313581 198.118.[CENSORED] -> 140.252.[CENSORED] HTTP 166 POST /cgi-bin/VSO/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
  0.686565 198.118.[CENSORED] -> 171.64.[CENSORED] HTTP 166 POST /cgi-bin/ajax/jsoc_fetch_VSO HTTP/1.1  (application/x-www-form-urlencoded)
  2.138692 198.118.[CENSORED] -> 131.142.[CENSORED] HTTP 166 POST /cgi-bin/VSO/DRMS/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
  2.439454 198.118.[CENSORED] -> 140.252.[CENSORED] HTTP 166 POST /cgi-bin/VSO/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
  2.807359 198.118.[CENSORED] -> 171.64.[CENSORED] HTTP 166 POST /cgi-bin/ajax/jsoc_fetch_VSO HTTP/1.1  (application/x-www-form-urlencoded)
 11.453624 198.118.[CENSORED] -> 131.142.[CENSORED] HTTP 166 POST /cgi-bin/VSO/DRMS/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
 11.753938 198.118.[CENSORED] -> 140.252.[CENSORED] HTTP 166 POST /cgi-bin/VSO/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
 12.118593 198.118.[CENSORED] -> 171.64.[CENSORED] HTTP 166 POST /cgi-bin/ajax/jsoc_fetch_VSO HTTP/1.1  (application/x-www-form-urlencoded)
 13.580818 198.118.[CENSORED] -> 131.142.[CENSORED] HTTP 166 POST /cgi-bin/VSO/DRMS/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)
 13.878723 198.118.[CENSORED] -> 140.252.[CENSORED] HTTP 166 POST /cgi-bin/VSO/vso_jsoc_fetch.cgi HTTP/1.1  (application/x-www-form-urlencoded)

To actually sniff the connection requires more work. You might be able to use the full wireshark command if you can can get an X console to the machine, or you can dump the traffic, and then move the dump to another machine to analyze: