wiki:drmsSumsRpc
Last modified 7 years ago Last modified on 07/30/15 13:31:46

DRMS / SUMS RPC Issues

SUMS (Storage Unit Manager?) runs a service that requires RPC. Unfortunately, on some OSes, after a while, RPC calls will start backing up -- those attempting to connect will stall for a minute or two before succeeding.

This results in significant delays for both data exports (drms_export.cgi) and in JMD downloading. The times will not be apparent in the JMD's logs, as the speed listed is only for the scp, not the total time the downloading thread took.

You will also notice processes that should run near-instantaneously (such as vso_sum_put and vso_sum_alloc) showing up multiple copies in ps.

The problem will work itself out if you reboot, but you can also 'kick' the RPC server. This requires stopping the server, adjusting its configuration so that it won't maintain state, bringing it back up, shutting it down again, then bringing it back up with its original configuration. You'll then need to restart sum_svc.

We use the following script (named 'kick_rpcbind') for a CentOS 6.x system running NetDRMS 7:

#!/bin/sh

echo `date` " : kicking RPC service" >> /var/log/sums/rpc_check_log.txt

cp -p /etc/sysconfig/rpcbind /tmp

echo 'RPCBIND_ARGS="-i"' > /etc/sysconfig/rpcbind

/etc/init.d/rpcbind stop
/etc/init.d/rpcbind start

cp -p /tmp/rpcbind /etc/sysconfig

/etc/init.d/rpcbind stop
/etc/init.d/rpcbind start

# su -lm sums --command='/usr/local/scripts/sums/restart_sums_cron'; # netdrms 6
su -lm sums --command='/usr/local/scripts/sums/restart_sums_wrapper'; # netdrms 7

echo `date` " : RPC service kicked" >> /var/log/sums/rpc_check_log.txt

You'll need to adjust usernames (sums, in the su command at the end) and paths to log files & commands to restart SUMS.

Note that this script should not be run from cron -- it doesn't always bring things back up cleanly. If you notice error messages about 5-20 seconds after running it, you should kick the server again. And again, until you don't get any error messages for about 5 minutes.

You'll need to run it as root, or via sudo so that it has sufficient privileges to modify the RPC config files and restart services.