wiki:drmsSumRm
Last modified 9 years ago Last modified on 12/09/14 08:18:46

Managing disk use with sum_rm

sum_rm is the program that ages data off disk in the NetDRMS system. The configuration file for sum_rm resides in the same directory as the log files for the program. A highly commented configuration file for sum_rm is included here, as it pre-dates the wiki and documents most of the relevant points.

It is worth noting that sometimes errors like this will be written to the sum_rm log :

Error in SUM_StatOffline
no data found on line 76
Err: SUM_StatOffline(844424933554303, ...)
Removing sum_main for ds_index = 844424933554303
Error in DS_SumMainDelete
no data found on line 24
**Err: DS_SumMainDelete(844424933554303)
**Err: DS_RmNowX() can't open /SUM02/D844424933554303/Records.txt

This happens when an entry for a sunum is in the SUMS database in the sum_partn_alloc table but not in the sum_main table (a situation that can arise in the event of a failed download). The error is generally not serious. An entry in sum_partn_alloc without a corresponding entry in sum_main generally indicates that a download failed. Hopefully failed downloads are due to a transient event that only lasted a short time, so sum_rm should work its way through the data from that time and get back to deleting data soon.

It's also worth noting that sum_rm is *very* quiet in the event that there are no data for which the effective date is in the past - it does not print anything to indicate that this is the case. The user will have to run this query :

select effective_date from sum_partn_alloc order by effective_date limit 30;

to ascertain if this may be the case. You can also try:

select count(*) from sum_partn_alloc where effective_date < to_char( now(), 'YYYYMMDDHH24MI' );

The configuration file for sum_rm is below :

#                                                         
# Configuration file for sum_rm program                   
#                                                         
# The sum_rm program does disk data age off (ie. deletes old data) to stop
# disk partitions from filling up. It is started as                       
# a service along with the rest of sums. It works as follows.             
#                                                                         
# In the sum_partn_alloc table of the DRMS database,                      
# there is an 'effective_date' column. After the                          
# effective_date, the data product may be deleted by sum_rm.              
#                                                                         
# sum_rm runs repeatedly. For each run, sum_rm does the following :       
#                                                                         
# [1] It reads this configuration file (allowing                          
#     dynamic adjustment of these parameters, ie. you do not              
#     need to restart the sum_rm program for a config change              
#     to take effect, which is very handy).                               
# [2] Accesses the database and orders results by effective_date.         
# [3] Starts deleting products from disk if the effective_date is in the past,
#     until one of three conditions is met :                                  
#      (a) The disk has at least PART_PERCENT_FREE percent                    
#          of free space on it (see below), or                                
#      (b) There are no more cases in which effective_date is in the past, or 
#      (c) At least 600 deletions have taken place (this is defined in        
#          the LIMIT statement in the file SUMLIB_RmDoX.pgc).
# [4] Sleeps for the time defined by the SLEEP parameter (see below) before the next run,
#     then goes back to [1] for the next run.
#
############# sum_rm parameters ##########################
#
# Number of seconds to sleep between runs of sum_rm
SLEEP=60
#
# Integer percent of each disk partition to be kept free. Applies to all partitions.
# Defaults to 3 if not specified. Setting PART_PERCENT_FREE=5 will allow
# partitions to fill to 95%. Note that the math the 'df' command uses
# tends to round up, but dividing the number of used blocks by the total
# number of blocks should give the result specified by PART_PERCENT_FREE.
#
# NOTE : This parameter used to be specified as MAX_FREE_0, MAX_FREE_1...
#        ...MAX_FREE_n, which specified the number of free MB on different partitions,
#        but this is no longer the case and specifying
#        MAX_FREE_{n} will have no effect as of NetDRMS 7.0. If
#        MAX_FREE_{n} is specified and PART_PERCENT_FREE is not,
#        then the default PART_PERCENT_FREE=3 will be used and
#        disk partitions will fill to 97%.
PART_PERCENT_FREE=4
#
# Log file (only opened at startup and pid gets appended to this name)
LOG=/usr/local/logs/SUM/sum_rm.log
#
# Who to email when there's a notable problem
MAIL=webmaster@mydomain.edu
#
# To prevent sum_rm from doing anything set NOOP to non-0.
NOOP=0
#
# sum_rm can only be enabled for a single user as defined by USER.
USER=sumsUser
#
# Don't run sum_rm between these NORUN hours of the day (0-23).
# Comment out to ignore, or set them both to the same hour.
# The NORUN_STOP must be >= NORUN_START.
#
# Don't run when the hour first hits NORUN_START
NORUN_START=7
#
# Start running again when the hour first hits NORUN_STOP
NORUN_STOP=7

If you are setting SUMS up on a test server that doesn't have much disk space, the following trigger can be used to ensure that the SUMS effective_date is kept low enough to ensure sum_rm can clean up:

CREATE OR REPLACE FUNCTION no_retention ( ) RETURNS TRIGGER LANGUAGE PLPGSQL AS $$
BEGIN
NEW.effective_date := to_char( (now()+INTERVAL '6 hours'), 'YYYYMMDDHH24MI' );
RETURN NEW;
END
$$;

CREATE TRIGGER no_retention BEFORE INSERT ON sum_partn_alloc FOR EACH ROW EXECUTE PROCEDURE no_retention();

The C routine that actually does the deletions is "int DS_RmDoX(char *name, double bytesdel)" in the file base/sums/libs/pg/SUMLIB_RmDoX.pgc