EPICS Home

Experimental Physics and Industrial Control System


 
2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <20172018  2019  2020  2021  2022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <20172018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Stalled CA connection (IOC to CS-Studio archiver)
From: Ralph Lange <[email protected]>
To: EPICS Core Talk <[email protected]>
Date: Fri, 25 Aug 2017 21:52:28 +0200
Update: We have been able to reproduce the issue on a set of wiresharked boxes.

After staring at wireshark captures for a good while, here's our current most probable explanation:
Bottom line: an IOC may be unresponsive but not dead at all. Getting lots of updates should count as a sign of life.

Does that sound realistic? We can put up the original captures on an accessible place for download if someone is interested.

Thanks for your help,
~Ralph



On Thu, Jun 15, 2017 at 5:53 PM, Michael Davidsaver <[email protected]> wrote:
On the RSRV side, my best guess is that the sender thread is in a
blocking send() with the client lock held (cf. cas_send_bs_msg() w/
lock_needed=true).  The recv thread is stuck trying to take the client lock.

A cursory look at src/com/cosylab/epics/caj/impl/CATransport.java
suggests that CAJ also locks around some send().  So it may be the same
situation there.

libca at least claims not to do this (in tcpiiu::sendThreadFlush() ).
If true, then libca would timeout if RSRV got into this situation.




On 06/15/2017 05:29 PM, Kasemir, Kay wrote:
> Hi:
>
>
> No real clue.
>
>
> On the archive engine VM, you would issue
>
>
>   kill -QUIT {PID of the java process}
>
>
> which causes Java to dump a stack trace of all threads to its console,
> including locks that each thread has taken or is trying to take.
>
> (You can also use "jps" to list all java processes, then "jstack {PID}"
> to fetch a stack trace.)
>
>
> Maybe do that again 5 minutes later and compare to see if there's one
> thread that's blocked by a lock, or stuck in some function call and not
> progressing for some other reason.
>
>
> -Kay
>
>
>
> ------------------------------------------------------------------------
> *From:* [email protected] <[email protected]> on
> behalf of Ralph Lange <[email protected]>
> *Sent:* Thursday, June 15, 2017 5:37 AM
> *To:* EPICS Core Talk
> *Subject:* Stalled CA connection (IOC to CS-Studio archiver)
>
> Hi all,
>
> We have an ongoing issue in a test setup that includes a Linux "Fast
> Controller" (IP...37) running IOCs (40k records each) on one end and a
> CS-Studio BEAUTY archiver on a VM (IP...41) on the other end. IOCs are
> running Base 3.15.5, BEAUTY uses a current JCA/CAJ client.
>
> The CA TCP connection is up, but blocked in both directions:
>
> On the fast controller (...37) , netstat shows
>
> tcp        0      0 IP...37:5064   0.0.0.0:*      LISTEN      29499/MAG-CYSI
> tcp    86888 178656 IP...37:5064   IP...41:40147  ESTABLISHED 29499/MAG-CYSI
>
> On the archiver VM (...41), we see
>
> tcp   495144  70184 IP...41:40147  IP...37:5064   ESTABLISHED 9164/java
> tcp        0      0 IP...41:40691  IP...49:5064   ESTABLISHED 9164/java
>
> tcpdump shows no traffic on that connection.
>
> The archive engine logs things like:
>
> 2017-06-12 22:17:53.047 WARNING [Thread 30]
> com.cosylab.epics.caj.impl.CATransport (noSyncSend) - Failed to send
> message to /IP...37:5064 - buffer full, will retry.
>
> and has not written data to the archive from this IOC for a long time.
> It is happily archiving data from other connections (e.g. the one shown
> in line 2 of the netstat output above).
>
> Obviously the TCP connection is blocked and backed up to the other host
> in both directions.
>
> The IOC is alive and casr shows all channels as connected.
>
> Why are both sides not taking data out of their receive-Qs?
>
> In this test setup, this is not happening to us for the first time. Has
> anyone seen such situations before? Any ideas for how to proceed trying
> to find out what's happening?
>
> Thanks a lot
> ~Ralph
>



Replies:
Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
References:
Stalled CA connection (IOC to CS-Studio archiver) Ralph Lange
Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Re: Stalled CA connection (IOC to CS-Studio archiver) Michael Davidsaver

Navigate by Date:
Prev: Re: Possible Access Rights improvement? Kasemir, Kay
Next: Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <20172018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: Stalled CA connection (IOC to CS-Studio archiver) Michael Davidsaver
Next: Re: Stalled CA connection (IOC to CS-Studio archiver) Kasemir, Kay
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <20172018  2019  2020  2021  2022  2023  2024