Hi Michael,
As Andrew mentions the CA server runs as the lowest priority entity
in the IOC with the CA UDP daemon holding the very lowest slot.
Therefore, of course, if the IOC's CPU is saturated then we certainly
expect that this could impact the server's ability to run (either to
cleanup or to allow new clients to attach).
> However if the gateway uses a name-server instead of UDP broadcasts
> this natural throttling will not be happening and I could see your
> symptoms occurring as a result.
Additionally, certain ca client might be granted access to more of the
cpu in this situation than others depending on what priority is
specified when creating the channel on the client side (internally
in the GW when it creates a channel using the client library in this
situation).
Furthermore, if the IP kernel is low on buffers then activities in the server
might also be stalled. It's even possible that TCP circuit shutdown activities
might stall (i.e. socket close might block) when a network buffer isn't available
depending on the IP kernel's implementation.
> 6. Running casr reports that practically all of these connections are to
> the gateway, for example:
Starting around R3.14.6 I made some changes in the CA client library so that it
will _not_ disconnect an unresponsive circuit and start a new one in such (heavily
congested) situations. Instead it disconnects the application but does not
disconnect the circuit, and simply waits for TCP to recover the circuit using
its mature built-in capabilities for dealing with congestion.
However, if for any reason the CA GW, on its own volition, were to destroy the
channel and create a new one (when the channel was unresponsive) then this would
circumvent the protections mentioned in the previous paragraph. I didn't write
the GW, but have looked inside of it, and I don't recall that it does this. I do
seem to recall that if a channel isn't used for some time in the gateway then it
will be destroyed and later recreated when a client (of the GW) needs it again, but
the timeouts are probably long enough that they are not coming into play in your
situation? Another possibility might be that this gateway was somehow using
a pre R3.14.6 ca client library.
> 8. After `casr 2` has completed, the bogus channels have gone away:
As I recall, the casr doesn't perform any cleanup activities. I don't claim to have a clear
explanation for this behavior. One possible guess is that casr, running at higher priority
than the vxWorks shell, temporarily disrupts the scan tasks from adding more labor to the
event queue for the server to complete, and thereby allows network buffer starvation in the
IP kernel to clear out, and maybe this allows the server threads to finish their cleanup
activities. Maybe a stretch, HTA.
The best way to find out what _is_ occurring would be to log into that IOC and use the
"tt <task id>" vxWorks target shell command to determine where the CA server's TCP
threads happen to be loitering at.
Jeff
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of [email protected]
> Sent: Monday, June 11, 2012 1:17 AM
> To: [email protected]
> Subject: Runaway connection count on IOC
>
> I have a very odd problem with one particular vxWorks (EPICS 3.14.11) IOC,
> where the connection count as reported by casr climbs into the many
> thousands, all to one client (the gateway server). Simply running `casr
> 2` is enough to clear this condition!
>
> Let me try and be precise.
>
> 1. The server is vxWorks 5.5.1 and EPICS 3.14.11
>
> 2. The server is running an asyn driver interfacing to a firewire camera
> and providing images over EPICS
>
> 3. The ethernet connection is horribly horribly overloaded (100MBit link),
> EPICS would like to deliver far more image frames than the network will
> permit
>
> 4. The EPICS gateway clearly struggles to connect to the IOC, typically
> most of the PVs provided by the IOC are inaccessible through the gateway.
>
> 5. At some random point during operation the number of IOC connections
> ($(IOC):CA:CNX as reported by vxStats) starts climbing steadily.
>
> 6. Running casr reports that practically all of these connections are to
> the gateway, for example:
>
> SR01C-DI-IOC-02 -> casr
> Channel Access Server V4.11
> Connected circuits:
> TCP 172.23.194.201:38552(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate",
> V4.11, 8697 Channels, Priority=0
> TCP 172.23.194.38:59307(cs03r-cs-serv-38.pri.diamond.ac.uk):
> User="epics_user", V4.11, 12 Channels, Priority=0
> TCP 172.23.194.201:38553(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate",
> V4.11, 18 Channels, Priority=0
> TCP 172.23.194.27:50143(cs03r-cs-serv-27.pri.diamond.ac.uk):
> User="archiver", V4.11, 1 Channels, Priority=20
> TCP 172.23.194.28:52559(cs03r-cs-serv-28.pri.diamond.ac.uk):
> User="archiver", V4.11, 1 Channels, Priority=20
>
> 7. Running `casr 2` takes forever (well, several minutes), the connection
> is over a 9600 baud serial line. Of the 8697 channels, the same PV is
> reported over and over and over and over again (it's a simple camera
> STATUS provided by the asyn driver).
>
> 8. After `casr 2` has completed, the bogus channels have gone away:
>
> SR01C-DI-IOC-02 -> casr
> Channel Access Server V4.11
> Connected circuits:
> TCP 172.23.194.201:38552(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate",
> V4.11, 6 Channels, Priority=0
> TCP 172.23.194.38:59307(cs03r-cs-serv-38.pri.diamond.ac.uk):
> User="epics_user", V4.11, 12 Channels, Priority=0
> TCP 172.23.194.201:38553(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate",
> V4.11, 18 Channels, Priority=0
> TCP 172.23.194.27:50143(cs03r-cs-serv-27.pri.diamond.ac.uk):
> User="archiver", V4.11, 1 Channels, Priority=20
> TCP 172.23.194.28:52559(cs03r-cs-serv-28.pri.diamond.ac.uk):
> User="archiver", V4.11, 1 Channels, Priority=20
>
>
> Very odd. Any thoughts?
>
> --
> This e-mail and any attachments may contain confidential, copyright and or
> privileged material, and are for the use of the intended addressee only.
> If you are not the intended addressee or an authorised recipient of the
> addressee please notify us of receipt by returning the e-mail and do not
> use, copy, retain, distribute or disclose the information in or attached
> to the e-mail.
> Any opinions expressed within this e-mail are those of the individual and
> not necessarily of Diamond Light Source Ltd.
> Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> attachments are free from viruses and we cannot accept liability for any
> damage which you may sustain as a result of software viruses which may be
> transmitted in or with the message.
> Diamond Light Source Limited (company no. 4375679). Registered in England
> and Wales with its registered office at Diamond House, Harwell Science and
> Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
>
>
>
>
- References:
- Runaway connection count on IOC michael.abbott
- Navigate by Date:
- Prev:
Re: Runaway connection count on IOC Andrew Johnson
- Next:
EPICS build on armv6l Florian Feldbauer
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
<2012>
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: Runaway connection count on IOC -- Possible Gateway Issue? Hill, Jeff
- Next:
EPICS build on armv6l Florian Feldbauer
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
<2012>
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|