g+
g+ Communities
Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  <20072008  2009  2010  2011  2012  2013  2014  Index 2002  2003  2004  2005  2006  <20072008  2009  2010  2011  2012  2013  2014 
<== Date ==> <== Thread ==>

Subject: Re: necessary changes in caServer code for redundancy support
From: "Jeffrey O. Hill" <johill@lanl.gov>
To: "Matthias Clausen" <Matthias.Clausen@desy.de>, core-talk@aps.anl.gov
Cc: "Jeff Hill" <johill@lanl.gov>, "Ralph Lange" <ralph.lange@bessy.de>, "Bernd Schoeneburg" <bernd.schoeneburg@desy.de>, "Gongfa Liu" <gongfa.liu@desy.de>
Date: Wed, 7 Feb 2007 10:50:54 -0700 (MST)
Hello Matthias,

Sorry about some delay replying. This is at least partly due to my
workstation's power supply failing, and the necessary repairs taking
longer than expected.

Here is what I expect that we would need to do to the CA server(s) to
support hot switchover between redundant IOCs.

1) We need a function visible from outside the CA server that moves the CA
server to and from a disabled and an enabled state.

2) When entering the disabled state all preexisting channels and TCP
circuits must be disconnected. The R3.13 (as I recall most versions) and
R3.14 (all versions) protocol allows the server to disconnect individual
channels. All versions of course allow the server to disconnect all
channels by disconnecting the TCP circuit. The primary implementation
headache is interrupting threads blocking in the socket "recv()" function,
but we already know how to solve this problem based on experience with the
multi-threaded R3.14 CA client library.

3) Of course in the disabled state the CA server sends no beacons, does
not respond to UDP requests, and will not allow clients to create TCP
circuits.

4) Another concern is forcing a disconnect when the dysfunctional member
of the redundant pair fails in a way that prevents the CA server from
being forced into a disabled state. This could be due to the serious type
of software and or hardware failures that redundant IOCs are meant to
provide insurance against. In that situation the CA client library will
disconnect the application after EPICS_CA_CONN_TMO seconds, but the TCP
circuit to the unresponsive IOC might not disconnect until the TCP keep
alive timer in the client's IP kernel fires. We need this disconnect to
occur before the client library will proceed with finding the channels on
the redundant backup IOC and before the client library will proceed with
building a circuit to the redundant backup IOC. The keep alive timer's
delay might be considered to be too long to wait when we need to switch
quickly to the redundant backup IOC (this is a globally adjustable
parameter impacting all TCP circuits). This is a more difficult matter
because these type of delays are actually good for large system as we
don't like to "add fuel to the fire" when there is a temporary congestion
in the network or in an IP kernel, but there may be conflicting
requirements.

5) The CA reference manual briefly discusses some issues that can arise if
we try to have two servers share the same IP address and UDP port. These
issues can be avoided by adding support for multicasting (a good idea for
several reasons), or by enforcing a constraint that each member of the
redundant set of IOCs must have a unique IP address (maybe redundant IOCs
should never share the same network interface).

6) Another matter is that we still have two CA servers and so the work
could end up being done twice. I am in the middle of a project for LANSCE
which has as one of its tasks modifying the portable server for use as the
one and only server in EPICS.

I hope that the above will serve at least as a target to throw some
tomatoes at!

PS: We are still deciding if there are sufficient funds for my attendance
of the EPICS meeting.

Jeff

On Thu, February 1, 2007 9:05 am, Matthias Clausen wrote:
> Hi Jeff and Ralph,
>
> the work on redundant IOC's is making progress.
> Gongfa from Hefei is working at DESY to tune the existing redundancy
monitor task (RMT) and to integrate the different PRR's (primary
redundancy resources - as we call them) into the RMT.
> The continuous control executive (CCE) written by Bob and his brother is
already integrated hooks have been written for the scan tasks and we have
an example driver with all the necessary hooks to check the state of the
driver and to set the driver into the desired state.
>
> Now we have reached the point where we have to deal with the caServer.
Our approach is that we do NOT want to change the ca protocol. So the
client side will be untouched.
> We will leave any improvements on this side for later versions of ca and
the redundancy implementation.
>
> So here's our idea - and we'd like to know what you propose how to
implement it.
>
> Both caServer on a redundant pair should be up and running.
> This is due to the fact that we'd like to provide a fast switchover and
to avoid any interference with other tasks running on an IOC. E.g. we need
the database enabled (but not scanned) in order to guarantee that we can
update the database by means of the CCE.
>
> Ok - so we have two caServer running - but the client should only see
ONE.
> Our proposal:
> The caServer will not be able to 'find' any records. So the caSearch
requests from the network will actually reach the caServer, but the search
will be not success full for any search.
> ==> What do you think where we could implement this:
> if (Master) {
>  return caSearch();
> } else{
>  return FALSE;
> }
>
> The second problem is the reconnect of the ca clients.
> The idea here is that the selected (Master) IOC will send it's beacosn
while the not selected will not.
> What will happen:
> The master will change and stop sending beacons. The second IOC will
start sending beacons.
> The client connection will time out and try to reconnect. it will
connect to the new master which is already available and continue
operation. The timeout period of 30 second is no problem for our
applications.
> So where would be the best place to implement:
> if (Master) {
>   sendBeacon();
> }
>
> 2nd topic: caGateway
> We are going to collaborate on a redundant caGateway implementation with
KEK.
> Furokawasan will come in February and discuss our implementation. So
we(he) want to adopt the current RMT to the caGateway which would also
need the same changes mentioned above.
>
> As you can see we are interested in the most simplistic implementation
which can be integrated into EPICS Base. In case of a redundant
> environment these will take effect. But on the other hand they will not
interfere with normal IOC / gateway operation.
> More sophistic implementations (as we discussed previously - which
include changes in the ca protocol - ) might follow. For now the changes
proposed would fit our needs.
>
> Hope you can help us to these things right.
> Looking forward to your response.
>
> Take care - hope to see you in April at DESY!
>
> Regards
> Matthias
>
> --
> ------------------------------------------------------------------------
Matthias Clausen                         Cryogenic Controls Group(MKS-2)
phone:  +49-40-8998-3256                Deutsches Elektronen Synchrotron
fax:    +49-40-8994-3256                                    Notkestr. 85
e-mail: Matthias.Clausen@desy.de                           22607 Hamburg
WWW-MKS2.desy.de                                                 Germany
------------------------------------------------------------------------
>
>


-- 
Jeffrey O. Hill               Mail         JOHill@lanl.gov
LANL MS H820                  Voice        505 665 1831
Los Alamos NM 87545 USA       Fax          505 665 5107












Navigate by Date:
Prev: Re: AppDevGuide not up-to-date on devLib Andrew Johnson
Next: Re: EPICS on Tru64unix and HP-UX Kazuro FURUKAWA
Index: 2002  2003  2004  2005  2006  <20072008  2009  2010  2011  2012  2013  2014 
Navigate by Thread:
Prev: Proposal for an addition to configure/tools/convertRelease.pl Benjamin Franksen
Next: Re: EPICS on Tru64unix and HP-UX Kazuro FURUKAWA
Index: 2002  2003  2004  2005  2006  <20072008  2009  2010  2011  2012  2013  2014 
ANJ, 02 Feb 2012 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICSv4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·