Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
<== Date ==> <== Thread ==>

Subject: RE: orderly shutdown
From: "Jeff Hill" <johill@lanl.gov>
To: "'Ernest L. Williams Jr.'" <ernesto@ornl.gov>
Cc: "'Mark Rivers'" <rivers@cars.uchicago.edu>, "'Dirk Zimoch'" <dirk.zimoch@psi.ch>, "'EPICS tech-talk'" <tech-talk@aps.anl.gov>
Date: Thu, 12 Jan 2006 14:12:10 -0700
Thanks for that Ernest.

This appears to be a useful code snippet particularly relevant in the
context of an R3.14 patch that might be useful as a stop gap solution until
the architectural upgrades can be accomplished in R3.15.

I created Mantis 235.

Jeff

> -----Original Message-----
> From: Ernest L. Williams Jr. [mailto:ernesto@ornl.gov]
> Sent: Wednesday, January 11, 2006 8:51 PM
> To: Jeff Hill
> Cc: 'Mark Rivers'; 'Dirk Zimoch'; 'EPICS tech-talk'
> Subject: Re: orderly shutdown
> 
> On Wed, 2006-01-11 at 14:40 -0700, Jeff Hill wrote:
> > > > Here is a proposal for Jeff:
> > > >
> > > > Would it be possible to create a new function named something like
> > > > vxCAClientStopAll.  This command would call close() on the CA
> > > > connections for all vxWorks CA clients, and hence would gracefully
> close
> > > > all of the socket connections on the server IOC.
> > > >
> > > > We could then make another new vxWorks command, "restart" which does
> > > > vxCAClientStopAll();
> > > > reboot();
> > >
> > > This is very awesome!!!
> > >
> > > Jeff can you implement this for the next EPICS RELEASE???
> > >
> > >
> > > Ernest
> > >
> >
> > What Mark suggests is certainly a possible fix. If such a function were
> > written its name, instead of vxCAClientStopAll(), might be instead
> > ca_close_circuits_but_dont_shut_anything_else_down() because if the rest
> of
> > the CA infrastructure is not left in place the db threads that are still
> > using it will crash and potentially disrupt the orderly shutdown.
> >
> > There are different perspectives on this. One perspective is that CA
> already
> > has such functions, ca_clear_channel and ca_context_destroy, and that
> all
> > that is needed is a function called dbStopAll that calls them ;-). There
> > would be many advantages to such an approach. One of them would be that
> > devices could be shutdown also. For example the Allen Bradley TCP/IP
> > circuits might also need to be gracefully shutdown.
> >
> > Jeff
> 
> Jeff, here is some code that WindRiver sent me:
> ==========================================================================
> =
> /* tcpRstAll.c - send RST on All open TCP connections (prior to system
> reset)
> * Anton Langebner (anton.langebner@windriver.com)
> *
> * $Header:$
> *
> * $Log:$
> */
> 
> #include "vxWorks.h"
> #include "sys/types.h"
> #include "netinet/in.h"
> #include "netinet/in_pcb.h"
> #include "netinet/in_systm.h"
> #include "netinet/ip.h"
> #include "netinet/ip_var.h"
> #include "netinet/tcp.h"
> #include "netinet/tcp_fsm.h"
> #include "netinet/tcp_seq.h"
> #include "netinet/tcp_timer.h"
> #include "netinet/tcp_var.h"
> #include "netinet/tcpip.h"
> #include "net/route.h"
> #include "errno.h"
> #include "string.h"
> #include "stdio.h"
> 
> IMPORT struct inpcbhead *_pTcpPcbHead;
> 
> STATUS tcpRstAll(int startType)
> {
> int s;
> struct inpcb *pInpcb; /* TCP: PCB Head */
> struct inpcb *inp; /* TCP: Current PCB */
> struct tcpcb *pTcpCb; /* TCP: Current TCP PCB */
> 
> struct socket *pSock;
> struct rtentry * pRouteEntry = NULL;
> struct sockaddr * destAddr = NULL;
> short timeout;
> 
> s= splnet();
> 
> if (_pTcpPcbHead==NULL)
> {
> splx(s);
> printf("Reset TCP: no connections found!\n");
> return(ERROR);
> }
> 
> printf("Reset TCP connections");
> 
> pInpcb= _pTcpPcbHead->lh_first;
> 
> for (inp= pInpcb; inp!= NULL; inp= inp->inp_list.le_next)
> {
> pTcpCb= (struct tcpcb *)inp->inp_ppcb;
> 
> if (pTcpCb->t_state!=TCPS_ESTABLISHED)
> continue;
> 
> printf(".");
> pTcpCb->t_state= TCPS_CLOSED;
> tcp_output(pTcpCb);
> }
> 
> splx(s);
> printf("done\n");
> return(OK);
> }
> 
> STATUS tcpRstAllInit()
> {
> printf("Adding tcpRstAll() Reboot Hook\n");
> rebootHookAdd((FUNCPTR)tcpRstAll);
> return(OK);
> }
> 
> void tcpRstAllNow(int startType)
> {
> tcpRstAll(startType);
> reboot(startType);
> }
> =====================================================================
> 
> 
> Thanks,
> Ernest
> 
> 
> 
> 
> 
> 
> 
> 
> 
> >
> > > -----Original Message-----
> > > From: Ernest L. Williams Jr. [mailto:ernesto@ornl.gov]
> > > Sent: Wednesday, January 11, 2006 1:56 PM
> > > To: Mark Rivers
> > > Cc: Jeff Hill; Dirk Zimoch; EPICS tech-talk
> > > Subject: RE: channel access
> > >
> > > On Wed, 2006-01-11 at 13:41 -0600, Mark Rivers wrote:
> > > > Folks,
> > > >
> > > > > > we have a problem with CA since we upgraded our MV2300 IOCs
> > > > > to Tornado2.
> > > > > >
> > > > > > After a reboot, often channel access links don't connect
> > > > > immediately to
> > > > > > the server. They connect a few minutes later when this
> > > > > message is printed:
> > > > > >
> > > > > > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> > > > > >   22="S_errno_EINVAL"
> > > >
> > > > This is not just a problem with IOC to IOC sockets, but with any
> vxWorks
> > > > to vxWorks sockets.
> > > >
> > > > We recently purchased a Newport XPS motor controller.  It
> communicates
> > > > over Ethernet, and uses vxWorks as it's operating system.  We
> control
> > > > the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS
> will
> > > > not communicate again after the IOC reboots, because it does not
> know
> > > > the IOC rebooted, and the same ports are being used.  It is thus
> > > > necessary to also reboot the XPS when rebooting the IOC.  But
> rebooting
> > > > the XPS requires re-homing all of the motors, which is sometimes
> almost
> > > > impossible because of installed equipment!  This is a real pain.
> > > >
> > > > This problem goes away if we control the XPS with a non-vxWorks IOC,
> > > > such as Linux, probably because Linux closes the sockets when
> killing
> > > > the IOC.
> > > >
> > > > On a related topic, I am appending an exchange I had with Jeff Hill
> and
> > > > others on this topic in October 2003, that was not posted to tech-
> talk.
> > > >
> > > > Cheers,
> > > > Mark Rivers
> > > >
> > > >
> > > >
> > > > Folks,
> > > >
> > > > I'd like to revisit the problem of CA disconnects when rebooting a
> > > > vxWorks client IOC that has CA links to a vxWorks server IOC (that
> does
> > > > not reboot).
> > > >
> > > > The EPICS 3.14.3 Release Notes say:
> > > >
> > > > "Recent versions of vxWorks appear to experience a connect failure
> if
> > > > the vxWorks IP kernel reassigns the same ephemeral TCP port number
> as
> > > > was assigned during a previous lifetime. The IP kernel on the
> vxWorks
> > > > system hosting the CA server might have a stale entry for this
> ephemeral
> > > > port that has not yet timed out which prevents the client from
> > > > connecting with the ephemeral port assigned by the IP kernel.
> > > > Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect
> sequence is
> > > > aborted and the client library closes the socket, opens a new
socket,
> > > > receives a new ephemeral port assignment, and successfully
> connects."
> > > >
> > > > The last sentence is only partially correct.  The problem is that:
> > > > - vxWorks assigns these ephemeral port numbers in ascending
> numerical
> > > > order
> > > > - It takes a very long time for the server IOC to kill the stale
> entries
> > > >
> > > > Thus, if I reboot the client many times in a row, it does not just
> > > > result in one disconnect before the succesful connection, but many.
> I
> > > > just did a test where I rebooted a vxWorks client IOC 11 times, as
> one
> > > > might do when debugging IOC software.  This IOC is running Marty's
> > > > example sequence program, with 2 PVs connecting to a remote vxWorks
> > > > server IOC.
> > > >
> > > > Here is the amount of time elapsed before the sequence program PVs
> > > > connected:
> > > > Reboot #  Time (sec)
> > > > 1           0.1
> > > > 2           5.7
> > > > 3            30
> > > > 4            60
> > > > 5            90
> > > > 6           120
> > > > 7            30
> > > > 8           150
> > > > 9           150
> > > > 10          180
> > > > 11          210
> > > >
> > > > Here is the output of "casr" on the vxWorks server IOC that never
> > > > rebooted after client reboot #11.
> > > > Channel Access Server V4.11
> > > > 164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1
> > > > Priority=80
> > > > 164.54.160.100:4453(miata): User="dac_user", V4.8, Channel Count=461
> > > > Priority=0
> > > > 164.54.160.75:1027(ioc13ida): User="iocboot", V4.11, Channel Count=1
> > > > Priority=80
> > > > 164.54.160.101:3379(lebaron): User="dac_user", V4.8, Channel
> Count=18
> > > > Priority=0
> > > > 164.54.160.73:1025(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > > Priority=0
> > > > 164.54.160.73:1027(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > > Priority=0
> > > > 164.54.160.73:1028(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > > Priority=0
> > > > 164.54.160.73:1029(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > > Priority=0
> > > > 164.54.160.73:1026(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > > Priority=0
> > > > 164.54.160.73:1030(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > > Priority=0
> > > > 164.54.160.73:1031(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > > Priority=0
> > > > 164.54.160.111:55807(millenia.cars.aps.anl.gov): User="webmaster",
> V4.8,
> > > > Channel Count=291 Priority=0
> > > > 164.54.160.73:1032(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > > Priority=0
> > > >
> > > > There should only be one connection from the client, 164.54.160.73
> > > > (ioc13lab).  All but the highest numbered port (1032) are stale.
> > > >
> > > > The connection times do not increase by 30 seconds every single
time,
> > > > because for some reason every once in a while one of the old port
> > > > connections times out (?) and is reused.  You can see that 1026 was
> > > > reused in the above test. But in general they do increase by 30
> seconds
> > > > on each reboot.
> > > >
> > > > This situation makes it very difficult to do software development
> under
> > > > vxWorks in the case where CA connections to other vxWorks IOCs are
> used.
> > > > It starts to take 4 or 5 minutes for the CA connections to get
> > > > established.  Rebooting the server IOC is often not an option.
> > > >
> > > > Here is a proposal for Jeff:
> > > >
> > > > Would it be possible to create a new function named something like
> > > > vxCAClientStopAll.  This command would call close() on the CA
> > > > connections for all vxWorks CA clients, and hence would gracefully
> close
> > > > all of the socket connections on the server IOC.
> > > >
> > > > We could then make another new vxWorks command, "restart" which does
> > > > vxCAClientStopAll();
> > > > reboot();
> > >
> > > This is very awesome!!!
> > >
> > > Jeff can you implement this for the next EPICS RELEASE???
> > >
> > >
> > > Ernest
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > >
> > > > This would not solve the problem for hard reboots, but it would make
> it
> > > > possible in many cases to avoid these long delays in cases where an
> IOC
> > > > is being deliberately rebooted under software control.
> > > >
> > > > Cheers,
> > > > Mark
> > > >
> > > > Jeff's reply was:
> > > > Mark,
> > > >
> > > >
> > > > > - vxWorks assigns these ephemeral port numbers in ascending
> numerical
> > > > > order
> > > >
> > > > That's correct there could be several of these stale circuits and
> the
> > > > system
> > > > will sequentially step through ephemeral port assignments timing out
> > > > each
> > > > one until an open slot is found. One solution would be for WRS to
> store
> > > > the
> > > > last ephemeral port assignment in non-volatile RAM between boots.
> > > >
> > > > It's also true that this problem is mostly a development issue and
> not
> > > > an
> > > > operational issue because during operations machines typically stay
> in a
> > > > booted operational state for much longer than the stale circuit
> timeout
> > > > interval.
> > > >
> > > > > - It takes a very long time for the server IOC to kill the stale
> > > > > entries
> > > >
> > > > Yes, that's true. I do turn on the keep-alive timer, but it has a
> very
> > > > long
> > > > delay by default. This delay *can* however be changed globally for
> all
> > > > circuits.
> > > >
> > > > I don't know what RTEMS does, but I strongly suspect that windows,
> UNIX,
> > > > and
> > > > VMS systems hang up all connected circuits when the system is
> software
> > > > rebooted.
> > > >
> > > > Therefore, we have a vxWorks and possibly an RTEMS specific problem.
> > > >
> > > > > Would it be possible to create a new function named something like
> > > > > vxCAClientStopAll.  This command would call close() on the CA
> > > > > connections for all vxWorks CA clients, and hence would
> > > > > gracefully close all of the socket connections on the server IOC.
> > > > >
> > > >
> > > > Of course ca_context_destroy() and ca_task_exit() are fulfilling a
> > > > similar,
> > > > but context specific role. They do however shutdown only one context
> at
> > > > a
> > > > time, and the context identifier is private to the context.
> > > >
> > > > So perhaps we should do this:
> > > >
> > > > Implement an iocCore shutdown module where subsystems register for
> > > > callback
> > > > when iocCore is shutdown. There would be a command line function
> that
> > > > users
> > > > call to shutdown an IOC gracefully. This command line would call all
> of
> > > > the
> > > > callbacks in the LIFO order. The sequencer and the database links
> would
> > > > of
> > > > course call ca_context_destroy() in their IOC core shutdown
> callbacks.
> > > >
> > > > Jeff
> >



References:
Re: orderly shutdown Ernest L. Williams Jr.

Navigate by Date:
Prev: Re: vxStats Marty Kraimer
Next: RE: orderly shutdown Jeff Hill
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
Navigate by Thread:
Prev: Re: orderly shutdown Ernest L. Williams Jr.
Next: RE: orderly shutdown Mark Rivers
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·