Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017 Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
<== Date ==> <== Thread ==>

Subject: Re: Stalled CA connection (IOC to CS-Studio archiver)
From: Ralph Lange <ralph.lange@gmx.de>
To: EPICS Core Talk <core-talk@aps.anl.gov>
Date: Fri, 25 Aug 2017 21:52:28 +0200
Update: We have been able to reproduce the issue on a set of wiresharked boxes.

After staring at wireshark captures for a good while, here's our current most probable explanation:
  • The caj Java client (BEAUTY archive engine) is happily connected to the IOC.
  • The connection is very busy, at above 40K updates (double+status+timestamp) per second. We see congestion mode (events_off / events_on message pairs), i.e. the client is the bottleneck, while the IOC is still sort of relaxed, sending out beacons like a clockwork.
  • At some point, the client gets so busy that it stops decoding the IOC's beacons. The reason for this moment of saturation is not clear. Garbage collection?
  • When the next beacon gets through (after a few minutes!), the client decides the IOC is unresponsive, and issues a ping (echo request) on the TCP circuit.
  • The TCP circuit is so busy that the echo doesn't return within the 5 second timeout period. The client declares the IOC dead (while continuing to receive lots of updates from it).
  • We see the client doing name resolution broadcasts, the IOC answers.
  • The client - still receiving lots of updates - issues event_add subscription requests for all channels.
  • Trying to send the initial value update responses for the new subscriptions, within a very short time the IOC fills its send buffers and blocks.
  • At the same time, the client fills it send buffers with event_add messages and blocks.
  • Both ends continuously fail to send (send buffers full), and never receive (receive buffers not empty).
Bottom line: an IOC may be unresponsive but not dead at all. Getting lots of updates should count as a sign of life.

Does that sound realistic? We can put up the original captures on an accessible place for download if someone is interested.

Thanks for your help,
~Ralph



On Thu, Jun 15, 2017 at 5:53 PM, Michael Davidsaver <mdavidsaver@gmail.com> wrote:
On the RSRV side, my best guess is that the sender thread is in a
blocking send() with the client lock held (cf. cas_send_bs_msg() w/
lock_needed=true).  The recv thread is stuck trying to take the client lock.

A cursory look at src/com/cosylab/epics/caj/impl/CATransport.java
suggests that CAJ also locks around some send().  So it may be the same
situation there.

libca at least claims not to do this (in tcpiiu::sendThreadFlush() ).
If true, then libca would timeout if RSRV got into this situation.




On 06/15/2017 05:29 PM, Kasemir, Kay wrote:
> Hi:
>
>
> No real clue.
>
>
> On the archive engine VM, you would issue
>
>
>   kill -QUIT {PID of the java process}
>
>
> which causes Java to dump a stack trace of all threads to its console,
> including locks that each thread has taken or is trying to take.
>
> (You can also use "jps" to list all java processes, then "jstack {PID}"
> to fetch a stack trace.)
>
>
> Maybe do that again 5 minutes later and compare to see if there's one
> thread that's blocked by a lock, or stuck in some function call and not
> progressing for some other reason.
>
>
> -Kay
>
>
>
> ------------------------------------------------------------------------
> *From:* core-talk-bounces@aps.anl.gov <core-talk-bounces@aps.anl.gov> on
> behalf of Ralph Lange <ralph.lange@gmx.de>
> *Sent:* Thursday, June 15, 2017 5:37 AM
> *To:* EPICS Core Talk
> *Subject:* Stalled CA connection (IOC to CS-Studio archiver)
>
> Hi all,
>
> We have an ongoing issue in a test setup that includes a Linux "Fast
> Controller" (IP...37) running IOCs (40k records each) on one end and a
> CS-Studio BEAUTY archiver on a VM (IP...41) on the other end. IOCs are
> running Base 3.15.5, BEAUTY uses a current JCA/CAJ client.
>
> The CA TCP connection is up, but blocked in both directions:
>
> On the fast controller (...37) , netstat shows
>
> tcp        0      0 IP...37:5064   0.0.0.0:*      LISTEN      29499/MAG-CYSI
> tcp    86888 178656 IP...37:5064   IP...41:40147  ESTABLISHED 29499/MAG-CYSI
>
> On the archiver VM (...41), we see
>
> tcp   495144  70184 IP...41:40147  IP...37:5064   ESTABLISHED 9164/java
> tcp        0      0 IP...41:40691  IP...49:5064   ESTABLISHED 9164/java
>
> tcpdump shows no traffic on that connection.
>
> The archive engine logs things like:
>
> 2017-06-12 22:17:53.047 WARNING [Thread 30]
> com.cosylab.epics.caj.impl.CATransport (noSyncSend) - Failed to send
> message to /IP...37:5064 - buffer full, will retry.
>
> and has not written data to the archive from this IOC for a long time.
> It is happily archiving data from other connections (e.g. the one shown
> in line 2 of the netstat output above).
>
> Obviously the TCP connection is blocked and backed up to the other host
> in both directions.
>
> The IOC is alive and casr shows all channels as connected.
>
> Why are both sides not taking data out of their receive-Qs?
>
> In this test setup, this is not happening to us for the first time. Has
> anyone seen such situations before? Any ideas for how to proceed trying
> to find out what's happening?
>
> Thanks a lot
> ~Ralph
>



Replies:
Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
References:
Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Re: Stalled CA connection (IOC to CS-Studio archiver) Michael Davidsaver

Navigate by Date:
Prev: Re: Possible Access Rights improvement? Kasemir, Kay
Next: Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
Navigate by Thread:
Prev: Re: Stalled CA connection (IOC to CS-Studio archiver) Michael Davidsaver
Next: Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
ANJ, 25 Aug 2017 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·