EPICS Re: Stalled CA connection (IOC to CS-Studio archiver)

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 <2017> 2018 2019 2020 2021 2022 2023 2024	Index	2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 <2017> 2018 2019 2020 2021 2022 2023 2024
<== Date ==>		<== Thread ==>

Update: We have been able to reproduce the issue on a set of wiresharked boxes.

After staring at wireshark captures for a good while, here's our current most probable explanation:

The caj Java client (BEAUTY archive engine) is happily connected to the IOC.
The connection is very busy, at above 40K updates (double+status+timestamp) per second. We see congestion mode (events_off / events_on message pairs), i.e. the client is the bottleneck, while the IOC is still sort of relaxed, sending out beacons like a clockwork.
At some point, the client gets so busy that it stops decoding the IOC's beacons. The reason for this moment of saturation is not clear. Garbage collection?
When the next beacon gets through (after a few minutes!), the client decides the IOC is unresponsive, and issues a ping (echo request) on the TCP circuit.
The TCP circuit is so busy that the echo doesn't return within the 5 second timeout period. The client declares the IOC dead (while continuing to receive lots of updates from it).
We see the client doing name resolution broadcasts, the IOC answers.
The client - still receiving lots of updates - issues event_add subscription requests for all channels.
Trying to send the initial value update responses for the new subscriptions, within a very short time the IOC fills its send buffers and blocks.
At the same time, the client fills it send buffers with event_add messages and blocks.
Both ends continuously fail to send (send buffers full), and never receive (receive buffers not empty).

Bottom line: an IOC may be unresponsive but not dead at all. Getting lots of updates should count as a sign of life.

Does that sound realistic? We can put up the original captures on an accessible place for download if someone is interested.

Thanks for your help,

~Ralph

On Thu, Jun 15, 2017 at 5:53 PM, Michael Davidsaver <[email protected]> wrote:

On the RSRV side, my best guess is that the sender thread is in a
blocking send() with the client lock held (cf. cas_send_bs_msg() w/
lock_needed=true). The recv thread is stuck trying to take the client lock.

A cursory look at src/com/cosylab/epics/caj/impl/CATransport.java
suggests that CAJ also locks around some send(). So it may be the same
situation there.

libca at least claims not to do this (in tcpiiu::sendThreadFlush() ).
If true, then libca would timeout if RSRV got into this situation.

On 06/15/2017 05:29 PM, Kasemir, Kay wrote:
> Hi:
>
>
> No real clue.
>
>
> On the archive engine VM, you would issue
>
>
> kill -QUIT {PID of the java process}
>
>
> which causes Java to dump a stack trace of all threads to its console,
> including locks that each thread has taken or is trying to take.
>
> (You can also use "jps" to list all java processes, then "jstack {PID}"
> to fetch a stack trace.)
>
>
> Maybe do that again 5 minutes later and compare to see if there's one
> thread that's blocked by a lock, or stuck in some function call and not
> progressing for some other reason.
>
>
> -Kay
>
>
>
> ------------------------------------------------------------------------
> *From:* [email protected] <[email protected]> on
> behalf of Ralph Lange <[email protected]>
> *Sent:* Thursday, June 15, 2017 5:37 AM
> *To:* EPICS Core Talk
> *Subject:* Stalled CA connection (IOC to CS-Studio archiver)

>
> Hi all,
>
> We have an ongoing issue in a test setup that includes a Linux "Fast
> Controller" (IP...37) running IOCs (40k records each) on one end and a
> CS-Studio BEAUTY archiver on a VM (IP...41) on the other end. IOCs are
> running Base 3.15.5, BEAUTY uses a current JCA/CAJ client.
>
> The CA TCP connection is up, but blocked in both directions:
>
> On the fast controller (...37) , netstat shows
>
> tcp 0 0 IP...37:5064 0.0.0.0:* LISTEN 29499/MAG-CYSI
> tcp 86888 178656 IP...37:5064 IP...41:40147 ESTABLISHED 29499/MAG-CYSI
>
> On the archiver VM (...41), we see
>
> tcp 495144 70184 IP...41:40147 IP...37:5064 ESTABLISHED 9164/java
> tcp 0 0 IP...41:40691 IP...49:5064 ESTABLISHED 9164/java
>
> tcpdump shows no traffic on that connection.
>
> The archive engine logs things like:
>
> 2017-06-12 22:17:53.047 WARNING [Thread 30]
> com.cosylab.epics.caj.impl.CATransport (noSyncSend) - Failed to send
> message to /IP...37:5064 - buffer full, will retry.
>
> and has not written data to the archive from this IOC for a long time.
> It is happily archiving data from other connections (e.g. the one shown
> in line 2 of the netstat output above).
>
> Obviously the TCP connection is blocked and backed up to the other host
> in both directions.
>
> The IOC is alive and casr shows all channels as connected.
>
> Why are both sides not taking data out of their receive-Qs?
>
> In this test setup, this is not happening to us for the first time. Has
> anyone seen such situations before? Any ideas for how to proceed trying
> to find out what's happening?
>
> Thanks a lot
> ~Ralph
>

Subject:	Re: Stalled CA connection (IOC to CS-Studio archiver)
From:	Ralph Lange <[email protected]>
To:	EPICS Core Talk <[email protected]>
Date:	Fri, 25 Aug 2017 21:52:28 +0200

Experimental Physics and Industrial Control System