Experimental Physics and
Industrial Control System

"Jeff Hill" <[email protected]> · Fri, 1 Dec 2006 11:38:53 -0700

> As I've read in CA documentation, some time ago the behaviour of CA
> library was different. The clients tended to start new search just as
> soon as they discover that connection is "unresponsive". But in the
> current version they wait until TCP connection goes down (and as you
> mentioned it may take pretty long time if the connection is not closed
> in a "polite" manner). So is there any configuration variable or some
> other way to choose which behaviour we need ?

First let me make a point of clarification. The client side application is
always informed of the channel disconnect within approximately
EPICS_CA_CONN_TMO seconds. This is independent of whether the TCP circuit is
closed in a "polite" manner or not.

So the only behavior difference is whether we should abandon the TCP circuit
and start a new one if it is unresponsive for EPICS_CA_CONN_TMO seconds (the
behavior prior to R3.14.6), or whether we should hang on to the unresponsive
TCP circuit and wait for it to become responsive again (the behavior
subsequent to R3.14.6).

Consider what happen with the old behavior if a network is congested, an
IOCs IP kernel is congested, or the IOC is running low on CPU. In that
situation we know that things are really not going well but hopefully only
temporarily. Perhaps this is because there is a temporary burst of traffic
that pushes the load into saturation of the available capacity (on certain
vxWorks versions we suspect that there are resource contention blockages in
the IP kernel implementation that may influence temporary congestion). If
that condition persists for more than EPICS_CA_CONN_TMO seconds then if the
IOC has many clients running the client library prior to R3.14.6 then all of
those clients will try to close there TCP circuits and start new ones. Due
to the congestion the close requests are likely to be postponed in their
delivery. These clients will immediately find the IOC and start building new
TCP circuits. These circuits are likely to be determined to be unresponsive
also so we start all over again adding more load onto the unfortunate IOC.
You can see that this is almost like trying to put out a fire by throwing
fuel on it.

In contrast we have the post R3.14.6 behavior where the client library will
just wait for the TCP circuit to be come responsive again. There are several
benefits. We don't have propensity for unstable load increases when we reach
load saturation. The client application will actually reconnect with less
latency when the circuit is responsive again because it need not renegotiate
the channel's circuit. Important preexisting clients are not thrown off of
the server and forced to compete for a new circuit when we reach load
saturation. We expect the load on an IOC will be less dynamic ...

What are the negatives? I can think of only a few. 

1) We anticipate with the post R3.13.6 behavior that the system is slightly
less fault tolerant should there be a special type of failure in the server
where one of its clients isn't serviced, but all of its other clients are
serviced. In that situation the user will be forced to wait the full
duration of the TCP keep-alive timer, or manually restart the client (or
server). IMHO this loss of robustness at the single client level is a good
trade off compared to the potential for loss of robustness at the system
level with the pre R3.14.6 client library behavior. We also observe that
with software that auto restarts when it fails there is less tendency for
bugs to be noticed enough that they get fixed. 

2) Furthermore, with the post R3.14.6 behavior we expect that if a user
abruptly turns off an IOC (this is typically associated with using the power
switch to stop an RTEMS or vxWorks based system) and reboots it (or some
fraction of the PVs contained therein) under a new IP address then the
client application will need to wait the full TCP keep-alive timeout until
it reconnects. During the design phase we came to the conclusion that this
situation will be a relatively rare occurrence in an operational system
where proper conservative change controls are in place, and therefore isn't
considered to be a profound negative. The argument goes like this; we don't
care about immediate reconnects when the PV moves to a new IP address unless
we are really depending on the system, and if we are really depending on the
system then why didn't we wait for a maintenance day to move the PV to a new
IP address?

So after that long explanation I hope our rationale, for there being no
option in the CA configuration changing this behavior, is clarified. Another
unrelated reason to avoid this configuration option is a desire to not make
the configuration any more complex than necessary. 

Note also that you _can_ change the TCP keep alive parameter in the kernel
configuration but be conservative about doing this considering the
propensity for unstable load feedback at load saturation and the fact that
this parameter will impact the behavior of many other types of TCP protocols
{FTP, telnet, NFS, HTTP etc}.

In the future we will have redundant IOCs. There is also potential for
redundant gateways and load sharing gateways. That changes our tolerance for
reconnect delays when the IP address of a PV changes, but does not change
are resolve to avoid unstable load feedback when we approach load
saturation. This new situation possibly could be managed with the TCP
keep-alive parameter, or it may be that we _will_ need to add another
configuration parameter to CA setting a functional equivalent of the
keep-alive timeout, but private to the CA protocol. Needs more thought. 

Of course, with a set of redundant IOCs, we will need the capability to turn
off search responses from the inactive members of the redundant set. And
with a set of load sharing gateways, we will need to have the capability to
configure a single CA UDP daemon, working on behalf of the entire set,
sending search responses with the IP address of the appropriate gateway
depending on load balancing criteria.

Jeff

> -----Original Message-----
> From: Artem Kazakov [mailto:[email protected]]
> Sent: Thursday, November 30, 2006 9:19 PM
> To: Jeff Hill
> Cc: 'Martin L. Smith'; 'Ken Evans'; 'Ralph Lange'; 'EPICS-tech-talk'
> Subject: Re: CA gateway question
> 
> Hi Jeff,
> 
> On Wed, 29 Nov 2006 15:19:58 -0700
> "Jeff Hill" <[email protected]> wrote:
> > The client library would eventually find the PV in the other gateway,
> and
> > connect through the other gateway. How fast it would switch over depends
> on
> > how it lost contact with the original gateway. If the network path to
> the
> > original gateway was severed it might take a relatively longer amount of
> > time to switch over - approximately as long as the TCP keep-alive
> interval
> > parameter specifies.
> 
> As I've read in CA documentation, some time ago the behaviour of CA
> library was different. The clients tended to start new search just as
> soon as they discover that connection is "unresponsive". But in the
> current version they wait until TCP connection goes down (and as you
> mentioned it may take pretty long time if the connection is not closed
> in a "polite" manner). So is there any configuration variable or some
> other way to choose which behaviour we need ?
> 
> 
> Artem.

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System