| 1 | = DRMS / SUMS RPC Issues = |
| 2 | |
| 3 | SUMS (Storage Unit Manager?) runs a service that requires RPC. Unfortunately, on some OSes, after a while, RPC calls will start backing up -- those attempting to connect will stall for a minute or two before succeeding. |
| 4 | |
| 5 | This results in significant delays for both data exports (`drms_export.cgi`) and in JMD downloading. The times will not be apparent in the JMD's logs, as the speed listed is '''only''' for the scp, not the total time the downloading thread took. |
| 6 | |
| 7 | You will also notice processes that should run near-instantaneously (such as `vso_sum_put` and `vso_sum_alloc`) showing up multiple copies in `ps`. |
| 8 | |
| 9 | The problem will work itself out if you reboot, but you can also 'kick' the RPC server. This requires stopping the server, adjusting its configuration so that it won't maintain state, bringing it back up, shutting it down again, then bringing it back up with its original configuration. You'll then need to restart sum_svc. |
| 10 | |
| 11 | We use the following script (named 'kick_rpcbind') for a CentOS 6.x system running NetDRMS 7: |
| 12 | |
| 13 | {{{ |
| 14 | #!/bin/sh |
| 15 | |
| 16 | echo `date` " : kicking RPC service" >> /var/log/sums/rpc_check_log.txt |
| 17 | |
| 18 | cp -p /etc/sysconfig/rpcbind /tmp |
| 19 | |
| 20 | echo 'RPCBIND_ARGS="-i"' > /etc/sysconfig/rpcbind |
| 21 | |
| 22 | /etc/init.d/rpcbind stop |
| 23 | /etc/init.d/rpcbind start |
| 24 | |
| 25 | cp -p /tmp/rpcbind /etc/sysconfig |
| 26 | |
| 27 | /etc/init.d/rpcbind stop |
| 28 | /etc/init.d/rpcbind start |
| 29 | |
| 30 | # su -lm sums --command='/usr/local/scripts/sums/restart_sums_cron'; # netdrms 6 |
| 31 | su -lm sums --command='/usr/local/scripts/sums/restart_sums_wrapper'; # netdrms 7 |
| 32 | |
| 33 | echo `date` " : RPC service kicked" >> /var/log/sums/rpc_check_log.txt |
| 34 | }}} |
| 35 | |
| 36 | |
| 37 | You'll need to adjust usernames (sums, in the `su` command at the end) and paths to log files & commands to restart SUMS. |
| 38 | |
| 39 | Note that this script '''should not''' be run from cron -- it doesn't always bring things back up cleanly. If you notice error messages about 5-20 seconds after running it, you should kick the server again. And again, until you don't get any error messages for about 5 minutes. |