Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017 Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
<== Date ==> <== Thread ==>

Subject: Re: Stalled CA connection (IOC to CS-Studio archiver)
From: Ralph Lange <ralph.lange@gmx.de>
To: "Kasemir, Kay" <kasemirk@ornl.gov>, EPICS Core Talk <core-talk@aps.anl.gov>
Cc: Kunal Shroff <kshroff@ospreydcs.com>
Date: Sat, 26 Aug 2017 12:00:28 +0200
Hi Kay,

We have run debugging on the engine, and as far as I remember it is busy, yes.

But the issue is not that the client is busy and samples are dropped in the client's ring buffer or the IOC (via congestion mode). If the client can't keep up, samples will have to be dropped somewhere, obviously.

The issue is that caj declares an IOC connection dead while constantly receiving >40K updates/s over it. This is plain wrong, and I would consider it a serious bug.
Overloaded and dead are two opposite extremes on the load spectrum, and the CA client treats them the same. The criteria (no answer to echo request within 5 sec) is not good enough, as it triggers in both cases.

As a result, caj re-subscribes the channels, leading to more extreme overload and a blocked connection (both sides failing to send, both sides not receiving). Archiving just stops and never continues until you shut down one side.
I would also consider this a bug, probably on both ends: failing to send should never lead to stop reading from a connection.

Cheers,
~Ralph



On Sat, Aug 26, 2017 at 12:14 AM, Kasemir, Kay <kasemirk@ornl.gov> wrote:
Hi:

>  The connection is very busy, at above 40K updates (double+status+timestamp) per second.
> We see congestion mode (events_off / events_on message pairs), i.e. the client is the bottleneck, while the IOC is still sort of relaxed,
> sending out beacons like a  clockwork.
>At some point, the client gets so busy that it stops decoding the IOC's beacons. The reason for this moment of saturation is not clear. Garbage collection?

Our bottleneck has always been the RDB (Oracle, ..), so the ring buffers for the samples drop values when too many PV updates arrive.
Haven't seen PV updates as a problem.

Does the client have a high CPU load, or is it somehow stuck?
Can you attach jvisualvm to see what's going on:
Is there a lot of GC activity?
Are threads blocked a lot?
The "Sampler" for the CPU should show where it's spending most of the time.

-Kay


References:
Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Re: Stalled CA connection (IOC to CS-Studio archiver) Michael Davidsaver
Re: Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay

Navigate by Date:
Prev: Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Next: Re: [epics-base/pvAccessCPP] fix issues 62,63,64 (#65) Andrew Johnson
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
Navigate by Thread:
Prev: Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Next: Build failed in Jenkins: epics-base-3.15-win64-test #119 APS Jenkins
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
ANJ, 29 Aug 2017 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·