Ticket #323 (closed problem: fixed)

Opened 3 months ago

Last modified 3 months ago

STEREO & AIA data not coming back on sunpy

Reported by: jacob Owned by: jacob
Priority: highest Milestone:
Component: DP:NSO Version: 1.4
Severity: blocker Keywords:


For a while any queries from SunPy to vso03 boulder were timing out. Niles and I investigated the top-level cgi script and found that the issue was deeper in the core perl code. After reverting my changes to vsoi_wsdl.cgi the problem disappeared, so I'm chalking this up to a brief glitch either with http or perl on vso03.

Change History

comment:1 Changed 3 months ago by jacob

  • Status changed from new to closed
  • Resolution set to worksforme

comment:2 Changed 3 months ago by jacob

  • Status changed from closed to reopened
  • Resolution worksforme deleted

Reopening because the same problem is back. After a little testing I have some information:

The problem either exists in the core code or in the definition of the "$vso" object in the wsdl. The vso object in the wsdl cgi is defined as follows:

our $vso = Physics::Solar::VSO::Core->new(tpool=>$tpool);

Where $tpool is:

our $tpool = Physics::Solar::VSO::Utils::ThreadPool?->new(4);

The last line of code that runs in the cgi script is line 87:

my @res = $vso->Query({version => $version, block => $block});

Which calls a member function of the vso object (Query), however prints from the first lines of Query in Core.pm do not show up in a live log file, meaning that the process never reaches Query in Core.pm. This leads me to believe that there is something wrong with the definition of the $vso object, but since we are seeing timeouts and not internal server errors. I also think there could be an issue with VSO::Utils::ThreadPool?.

comment:3 Changed 3 months ago by jacob

  • Status changed from reopened to closed
  • Resolution set to fixed

We discovered that perl was generally unable to load other modules while running vsoi_wsdl.cgi, and a reboot fixed the problem, revealing that this is more of a server issue than a problem within the code. While this problem persisted, we found that instances of vsoi_wsdl.cgi were running indefinitely and consuming 100% of cpu resources. I wrote a cron script that runs every 5 minutes to monitor all processes over 10 seconds and terminate any instances of vsoi_wsdl.cgi taking >95% of cpu power over those 10 seconds. In case that doesn't bring vso03 back up, the cron also logs that a process was killed and sends me an email so I can run a quick test query to verify that vso03 still works and reboot if it doesn't.

Note: See TracTickets for help on using tickets.