g+
g+ Communities
Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014 
<== Date ==> <== Thread ==>

Subject: RE: orderly shutdown
From: "Mark Rivers" <rivers@cars.uchicago.edu>
To: "Jeff Hill" <johill@lanl.gov>, "Ernest L. Williams Jr." <ernesto@ornl.gov>
Cc: "Dirk Zimoch" <dirk.zimoch@psi.ch>, "EPICS tech-talk" <tech-talk@aps.anl.gov>
Date: Wed, 11 Jan 2006 16:04:07 -0600
Jeff,

> There are different perspectives on this. One perspective is 
> that CA already has such functions, ca_clear_channel and
ca_context_destroy, 
> and that all that is needed is a function called dbStopAll that calls
them 
> ;-). There would be many advantages to such an approach. One of them 
> would be that devices could be shutdown also. For example the Allen
Bradley TCP/IP
> circuits might also need to be gracefully shutdown.

I like this suggestion, since in the case of the XPS controller I
mentioned earlier, it is not a CA link to the XPS, but a socket opened
with asyn.  asyn or the driver needs to close that socket on shutdown in
order to avoid the serious problems we are having on reboot.

Mark

> 
> Jeff
> 
> > -----Original Message-----
> > From: Ernest L. Williams Jr. [mailto:ernesto@ornl.gov]
> > Sent: Wednesday, January 11, 2006 1:56 PM
> > To: Mark Rivers
> > Cc: Jeff Hill; Dirk Zimoch; EPICS tech-talk
> > Subject: RE: channel access
> > 
> > On Wed, 2006-01-11 at 13:41 -0600, Mark Rivers wrote:
> > > Folks,
> > >
> > > > > we have a problem with CA since we upgraded our MV2300 IOCs
> > > > to Tornado2.
> > > > >
> > > > > After a reboot, often channel access links don't connect
> > > > immediately to
> > > > > the server. They connect a few minutes later when this
> > > > message is printed:
> > > > >
> > > > > CAC: Unable to connect port 5064 on 
> "172.19.157.20:5064" because
> > > > >   22="S_errno_EINVAL"
> > >
> > > This is not just a problem with IOC to IOC sockets, but 
> with any vxWorks
> > > to vxWorks sockets.
> > >
> > > We recently purchased a Newport XPS motor controller.  It 
> communicates
> > > over Ethernet, and uses vxWorks as it's operating system. 
>  We control
> > > the XPS from a vxWorks IOC. When we reboot our vxWorks 
> IOC the XPS will
> > > not communicate again after the IOC reboots, because it 
> does not know
> > > the IOC rebooted, and the same ports are being used.  It is thus
> > > necessary to also reboot the XPS when rebooting the IOC.  
> But rebooting
> > > the XPS requires re-homing all of the motors, which is 
> sometimes almost
> > > impossible because of installed equipment!  This is a real pain.
> > >
> > > This problem goes away if we control the XPS with a 
> non-vxWorks IOC,
> > > such as Linux, probably because Linux closes the sockets 
> when killing
> > > the IOC.
> > >
> > > On a related topic, I am appending an exchange I had with 
> Jeff Hill and
> > > others on this topic in October 2003, that was not posted 
> to tech-talk.
> > >
> > > Cheers,
> > > Mark Rivers
> > >
> > >
> > >
> > > Folks,
> > >
> > > I'd like to revisit the problem of CA disconnects when rebooting a
> > > vxWorks client IOC that has CA links to a vxWorks server 
> IOC (that does
> > > not reboot).
> > >
> > > The EPICS 3.14.3 Release Notes say:
> > >
> > > "Recent versions of vxWorks appear to experience a 
> connect failure if
> > > the vxWorks IP kernel reassigns the same ephemeral TCP 
> port number as
> > > was assigned during a previous lifetime. The IP kernel on 
> the vxWorks
> > > system hosting the CA server might have a stale entry for 
> this ephemeral
> > > port that has not yet timed out which prevents the client from
> > > connecting with the ephemeral port assigned by the IP kernel.
> > > Eventually, after EPICS_CA_CONN_TMO seconds, the TCP 
> connect sequence is
> > > aborted and the client library closes the socket, opens a 
> new socket,
> > > receives a new ephemeral port assignment, and 
> successfully connects."
> > >
> > > The last sentence is only partially correct.  The problem is that:
> > > - vxWorks assigns these ephemeral port numbers in 
> ascending numerical
> > > order
> > > - It takes a very long time for the server IOC to kill 
> the stale entries
> > >
> > > Thus, if I reboot the client many times in a row, it does not just
> > > result in one disconnect before the succesful connection, 
> but many.  I
> > > just did a test where I rebooted a vxWorks client IOC 11 
> times, as one
> > > might do when debugging IOC software.  This IOC is running Marty's
> > > example sequence program, with 2 PVs connecting to a 
> remote vxWorks
> > > server IOC.
> > >
> > > Here is the amount of time elapsed before the sequence program PVs
> > > connected:
> > > Reboot #  Time (sec)
> > > 1           0.1
> > > 2           5.7
> > > 3            30
> > > 4            60
> > > 5            90
> > > 6           120
> > > 7            30
> > > 8           150
> > > 9           150
> > > 10          180
> > > 11          210
> > >
> > > Here is the output of "casr" on the vxWorks server IOC that never
> > > rebooted after client reboot #11.
> > > Channel Access Server V4.11
> > > 164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, 
> Channel Count=1
> > > Priority=80
> > > 164.54.160.100:4453(miata): User="dac_user", V4.8, 
> Channel Count=461
> > > Priority=0
> > > 164.54.160.75:1027(ioc13ida): User="iocboot", V4.11, 
> Channel Count=1
> > > Priority=80
> > > 164.54.160.101:3379(lebaron): User="dac_user", V4.8, 
> Channel Count=18
> > > Priority=0
> > > 164.54.160.73:1025(ioc13lab): User="iocboot", V4.11, 
> Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1027(ioc13lab): User="iocboot", V4.11, 
> Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1028(ioc13lab): User="iocboot", V4.11, 
> Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1029(ioc13lab): User="iocboot", V4.11, 
> Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1026(ioc13lab): User="iocboot", V4.11, 
> Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1030(ioc13lab): User="iocboot", V4.11, 
> Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1031(ioc13lab): User="iocboot", V4.11, 
> Channel Count=2
> > > Priority=0
> > > 164.54.160.111:55807(millenia.cars.aps.anl.gov): 
> User="webmaster", V4.8,
> > > Channel Count=291 Priority=0
> > > 164.54.160.73:1032(ioc13lab): User="iocboot", V4.11, 
> Channel Count=2
> > > Priority=0
> > >
> > > There should only be one connection from the client, 164.54.160.73
> > > (ioc13lab).  All but the highest numbered port (1032) are stale.
> > >
> > > The connection times do not increase by 30 seconds every 
> single time,
> > > because for some reason every once in a while one of the old port
> > > connections times out (?) and is reused.  You can see 
> that 1026 was
> > > reused in the above test. But in general they do increase 
> by 30 seconds
> > > on each reboot.
> > >
> > > This situation makes it very difficult to do software 
> development under
> > > vxWorks in the case where CA connections to other vxWorks 
> IOCs are used.
> > > It starts to take 4 or 5 minutes for the CA connections to get
> > > established.  Rebooting the server IOC is often not an option.
> > >
> > > Here is a proposal for Jeff:
> > >
> > > Would it be possible to create a new function named something like
> > > vxCAClientStopAll.  This command would call close() on the CA
> > > connections for all vxWorks CA clients, and hence would 
> gracefully close
> > > all of the socket connections on the server IOC.
> > >
> > > We could then make another new vxWorks command, "restart" 
> which does
> > > vxCAClientStopAll();
> > > reboot();
> > 
> > This is very awesome!!!
> > 
> > Jeff can you implement this for the next EPICS RELEASE???
> > 
> > 
> > Ernest
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > >
> > > This would not solve the problem for hard reboots, but it 
> would make it
> > > possible in many cases to avoid these long delays in 
> cases where an IOC
> > > is being deliberately rebooted under software control.
> > >
> > > Cheers,
> > > Mark
> > >
> > > Jeff's reply was:
> > > Mark,
> > >
> > >
> > > > - vxWorks assigns these ephemeral port numbers in 
> ascending numerical
> > > > order
> > >
> > > That's correct there could be several of these stale 
> circuits and the
> > > system
> > > will sequentially step through ephemeral port assignments 
> timing out
> > > each
> > > one until an open slot is found. One solution would be 
> for WRS to store
> > > the
> > > last ephemeral port assignment in non-volatile RAM between boots.
> > >
> > > It's also true that this problem is mostly a development 
> issue and not
> > > an
> > > operational issue because during operations machines 
> typically stay in a
> > > booted operational state for much longer than the stale 
> circuit timeout
> > > interval.
> > >
> > > > - It takes a very long time for the server IOC to kill the stale
> > > > entries
> > >
> > > Yes, that's true. I do turn on the keep-alive timer, but 
> it has a very
> > > long
> > > delay by default. This delay *can* however be changed 
> globally for all
> > > circuits.
> > >
> > > I don't know what RTEMS does, but I strongly suspect that 
> windows, UNIX,
> > > and
> > > VMS systems hang up all connected circuits when the 
> system is software
> > > rebooted.
> > >
> > > Therefore, we have a vxWorks and possibly an RTEMS 
> specific problem.
> > >
> > > > Would it be possible to create a new function named 
> something like
> > > > vxCAClientStopAll.  This command would call close() on the CA
> > > > connections for all vxWorks CA clients, and hence would
> > > > gracefully close all of the socket connections on the 
> server IOC.
> > > >
> > >
> > > Of course ca_context_destroy() and ca_task_exit() are fulfilling a
> > > similar,
> > > but context specific role. They do however shutdown only 
> one context at
> > > a
> > > time, and the context identifier is private to the context.
> > >
> > > So perhaps we should do this:
> > >
> > > Implement an iocCore shutdown module where subsystems register for
> > > callback
> > > when iocCore is shutdown. There would be a command line 
> function that
> > > users
> > > call to shutdown an IOC gracefully. This command line 
> would call all of
> > > the
> > > callbacks in the LIFO order. The sequencer and the 
> database links would
> > > of
> > > course call ca_context_destroy() in their IOC core 
> shutdown callbacks.
> > >
> > > Jeff
> 
> 


Navigate by Date:
Prev: orderly shutdown Jeff Hill
Next: pv connect/disconnect callbacks john sinclair
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014 
Navigate by Thread:
Prev: RE: orderly shutdown Jeff Hill
Next: RE: orderly shutdown Jeff Hill
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICSv4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·