Changes between Initial Version and Version 1 of drmsSumsRpc


Ignore:
Timestamp:
07/30/15 11:52:45 (9 years ago)
Author:
joe
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • drmsSumsRpc

    v1 v1  
     1= DRMS / SUMS RPC Issues = 
     2 
     3SUMS (Storage Unit Manager?) runs a service that requires RPC.  Unfortunately, on some OSes, after a while, RPC calls will start backing up -- those attempting to connect will stall for a minute or two before succeeding. 
     4 
     5This results in significant delays for both data exports (`drms_export.cgi`) and in JMD downloading.  The times will not be apparent in the JMD's logs, as the speed listed is '''only''' for the scp, not the total time the downloading thread took. 
     6 
     7You will also notice processes that should run near-instantaneously (such as `vso_sum_put` and `vso_sum_alloc`) showing up multiple copies in `ps`. 
     8 
     9The problem will work itself out if you reboot, but you can also 'kick' the RPC server.  This requires stopping the server, adjusting its configuration so that it won't maintain state, bringing it back up, shutting it down again, then bringing it back up with its original configuration.  You'll then need to restart sum_svc. 
     10 
     11We use the following script (named 'kick_rpcbind') for a CentOS 6.x system running NetDRMS 7: 
     12 
     13{{{ 
     14#!/bin/sh 
     15 
     16echo `date` " : kicking RPC service" >> /var/log/sums/rpc_check_log.txt 
     17 
     18cp -p /etc/sysconfig/rpcbind /tmp 
     19 
     20echo 'RPCBIND_ARGS="-i"' > /etc/sysconfig/rpcbind 
     21 
     22/etc/init.d/rpcbind stop 
     23/etc/init.d/rpcbind start 
     24 
     25cp -p /tmp/rpcbind /etc/sysconfig 
     26 
     27/etc/init.d/rpcbind stop 
     28/etc/init.d/rpcbind start 
     29 
     30# su -lm sums --command='/usr/local/scripts/sums/restart_sums_cron'; # netdrms 6 
     31su -lm sums --command='/usr/local/scripts/sums/restart_sums_wrapper'; # netdrms 7 
     32 
     33echo `date` " : RPC service kicked" >> /var/log/sums/rpc_check_log.txt 
     34}}} 
     35 
     36 
     37You'll need to adjust usernames (sums, in the `su` command at the end) and paths to log files & commands to restart SUMS. 
     38 
     39Note that this script '''should not''' be run from cron -- it doesn't always bring things back up cleanly.  If you notice error messages about 5-20 seconds after running it, you should kick the server again.  And again, until you don't get any error messages for about 5 minutes.