Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
<== Date ==> <== Thread ==>

Subject: RE: orderly shutdown
From: "Jeff Hill" <johill@lanl.gov>
To: "'Jeff Hill'" <johill@lanl.gov>, "'Ernest L. Williams Jr.'" <ernesto@ornl.gov>
Cc: "'Mark Rivers'" <rivers@cars.uchicago.edu>, "'Dirk Zimoch'" <dirk.zimoch@psi.ch>, "'EPICS tech-talk'" <tech-talk@aps.anl.gov>
Date: Thu, 12 Jan 2006 14:30:19 -0700
Considering this further.... 

If we crudely close the sockets instead of requesting an orderly shutdown I
expect that CA auxiliary threads, which will still be running, will detect
the close as a circuit disconnect and they will immediately start
reconnecting the channels. So it will be a race to see if they can throw up
new circuits before the IOC can be shut down. One way to avoid their being
very successful might be to assure that the soft reboot is initiated by a
thread at a relatively higher priority (that will typically be the case if
the soft reboot request occurs from the vxWorks shell).

Nevertheless, the above approach smells like a kludge that might require
chronic fiddling. The best approach in the long run will be to initiate an
orderly shutdown by calling ca_clear_channl followed by ca_context_destroy. 

Jeff

> -----Original Message-----
> From: Jeff Hill [mailto:johill@lanl.gov]
> Sent: Thursday, January 12, 2006 2:12 PM
> To: 'Ernest L. Williams Jr.'
> Cc: 'Mark Rivers'; 'Dirk Zimoch'; 'EPICS tech-talk'
> Subject: RE: orderly shutdown
> 
> 
> Thanks for that Ernest.
> 
> This appears to be a useful code snippet particularly relevant in the
> context of an R3.14 patch that might be useful as a stop gap solution
> until the architectural upgrades can be accomplished in R3.15.
> 
> I created Mantis 235.
> 
> Jeff
> 
> > -----Original Message-----
> > From: Ernest L. Williams Jr. [mailto:ernesto@ornl.gov]
> > Sent: Wednesday, January 11, 2006 8:51 PM
> > To: Jeff Hill
> > Cc: 'Mark Rivers'; 'Dirk Zimoch'; 'EPICS tech-talk'
> > Subject: Re: orderly shutdown
> >
> > On Wed, 2006-01-11 at 14:40 -0700, Jeff Hill wrote:
> > > > > Here is a proposal for Jeff:
> > > > >
> > > > > Would it be possible to create a new function named something like
> > > > > vxCAClientStopAll.  This command would call close() on the CA
> > > > > connections for all vxWorks CA clients, and hence would gracefully
> > close
> > > > > all of the socket connections on the server IOC.
> > > > >
> > > > > We could then make another new vxWorks command, "restart" which
> does
> > > > > vxCAClientStopAll();
> > > > > reboot();
> > > >
> > > > This is very awesome!!!
> > > >
> > > > Jeff can you implement this for the next EPICS RELEASE???
> > > >
> > > >
> > > > Ernest
> > > >
> > >
> > > What Mark suggests is certainly a possible fix. If such a function
> were
> > > written its name, instead of vxCAClientStopAll(), might be instead
> > > ca_close_circuits_but_dont_shut_anything_else_down() because if the
> rest
> > of
> > > the CA infrastructure is not left in place the db threads that are
> still
> > > using it will crash and potentially disrupt the orderly shutdown.
> > >
> > > There are different perspectives on this. One perspective is that CA
> > already
> > > has such functions, ca_clear_channel and ca_context_destroy, and that
> > all
> > > that is needed is a function called dbStopAll that calls them ;-).
> There
> > > would be many advantages to such an approach. One of them would be
> that
> > > devices could be shutdown also. For example the Allen Bradley TCP/IP
> > > circuits might also need to be gracefully shutdown.
> > >
> > > Jeff
> >
> > Jeff, here is some code that WindRiver sent me:
> >
> ==========================================================================
> > =
> > /* tcpRstAll.c - send RST on All open TCP connections (prior to system
> > reset)
> > * Anton Langebner (anton.langebner@windriver.com)
> > *
> > * $Header:$
> > *
> > * $Log:$
> > */
> >
> > #include "vxWorks.h"
> > #include "sys/types.h"
> > #include "netinet/in.h"
> > #include "netinet/in_pcb.h"
> > #include "netinet/in_systm.h"
> > #include "netinet/ip.h"
> > #include "netinet/ip_var.h"
> > #include "netinet/tcp.h"
> > #include "netinet/tcp_fsm.h"
> > #include "netinet/tcp_seq.h"
> > #include "netinet/tcp_timer.h"
> > #include "netinet/tcp_var.h"
> > #include "netinet/tcpip.h"
> > #include "net/route.h"
> > #include "errno.h"
> > #include "string.h"
> > #include "stdio.h"
> >
> > IMPORT struct inpcbhead *_pTcpPcbHead;
> >
> > STATUS tcpRstAll(int startType)
> > {
> > int s;
> > struct inpcb *pInpcb; /* TCP: PCB Head */
> > struct inpcb *inp; /* TCP: Current PCB */
> > struct tcpcb *pTcpCb; /* TCP: Current TCP PCB */
> >
> > struct socket *pSock;
> > struct rtentry * pRouteEntry = NULL;
> > struct sockaddr * destAddr = NULL;
> > short timeout;
> >
> > s= splnet();
> >
> > if (_pTcpPcbHead==NULL)
> > {
> > splx(s);
> > printf("Reset TCP: no connections found!\n");
> > return(ERROR);
> > }
> >
> > printf("Reset TCP connections");
> >
> > pInpcb= _pTcpPcbHead->lh_first;
> >
> > for (inp= pInpcb; inp!= NULL; inp= inp->inp_list.le_next)
> > {
> > pTcpCb= (struct tcpcb *)inp->inp_ppcb;
> >
> > if (pTcpCb->t_state!=TCPS_ESTABLISHED)
> > continue;
> >
> > printf(".");
> > pTcpCb->t_state= TCPS_CLOSED;
> > tcp_output(pTcpCb);
> > }
> >
> > splx(s);
> > printf("done\n");
> > return(OK);
> > }
> >
> > STATUS tcpRstAllInit()
> > {
> > printf("Adding tcpRstAll() Reboot Hook\n");
> > rebootHookAdd((FUNCPTR)tcpRstAll);
> > return(OK);
> > }
> >
> > void tcpRstAllNow(int startType)
> > {
> > tcpRstAll(startType);
> > reboot(startType);
> > }
> > =====================================================================
> >
> >
> > Thanks,
> > Ernest
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > >
> > > > -----Original Message-----
> > > > From: Ernest L. Williams Jr. [mailto:ernesto@ornl.gov]
> > > > Sent: Wednesday, January 11, 2006 1:56 PM
> > > > To: Mark Rivers
> > > > Cc: Jeff Hill; Dirk Zimoch; EPICS tech-talk
> > > > Subject: RE: channel access
> > > >
> > > > On Wed, 2006-01-11 at 13:41 -0600, Mark Rivers wrote:
> > > > > Folks,
> > > > >
> > > > > > > we have a problem with CA since we upgraded our MV2300 IOCs
> > > > > > to Tornado2.
> > > > > > >
> > > > > > > After a reboot, often channel access links don't connect
> > > > > > immediately to
> > > > > > > the server. They connect a few minutes later when this
> > > > > > message is printed:
> > > > > > >
> > > > > > > CAC: Unable to connect port 5064 on "172.19.157.20:5064"
> because
> > > > > > >   22="S_errno_EINVAL"
> > > > >
> > > > > This is not just a problem with IOC to IOC sockets, but with any
> > vxWorks
> > > > > to vxWorks sockets.
> > > > >
> > > > > We recently purchased a Newport XPS motor controller.  It
> > communicates
> > > > > over Ethernet, and uses vxWorks as it's operating system.  We
> > control
> > > > > the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS
> > will
> > > > > not communicate again after the IOC reboots, because it does not
> > know
> > > > > the IOC rebooted, and the same ports are being used.  It is thus
> > > > > necessary to also reboot the XPS when rebooting the IOC.  But
> > rebooting
> > > > > the XPS requires re-homing all of the motors, which is sometimes
> > almost
> > > > > impossible because of installed equipment!  This is a real pain.
> > > > >
> > > > > This problem goes away if we control the XPS with a non-vxWorks
> IOC,
> > > > > such as Linux, probably because Linux closes the sockets when
> > killing
> > > > > the IOC.
> > > > >
> > > > > On a related topic, I am appending an exchange I had with Jeff
> Hill
> > and
> > > > > others on this topic in October 2003, that was not posted to tech-
> > talk.
> > > > >
> > > > > Cheers,
> > > > > Mark Rivers
> > > > >
> > > > >
> > > > >
> > > > > Folks,
> > > > >
> > > > > I'd like to revisit the problem of CA disconnects when rebooting a
> > > > > vxWorks client IOC that has CA links to a vxWorks server IOC (that
> > does
> > > > > not reboot).
> > > > >
> > > > > The EPICS 3.14.3 Release Notes say:
> > > > >
> > > > > "Recent versions of vxWorks appear to experience a connect failure
> > if
> > > > > the vxWorks IP kernel reassigns the same ephemeral TCP port number
> > as
> > > > > was assigned during a previous lifetime. The IP kernel on the
> > vxWorks
> > > > > system hosting the CA server might have a stale entry for this
> > ephemeral
> > > > > port that has not yet timed out which prevents the client from
> > > > > connecting with the ephemeral port assigned by the IP kernel.
> > > > > Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect
> > sequence is
> > > > > aborted and the client library closes the socket, opens a new
> socket,
> > > > > receives a new ephemeral port assignment, and successfully
> > connects."
> > > > >
> > > > > The last sentence is only partially correct.  The problem is that:
> > > > > - vxWorks assigns these ephemeral port numbers in ascending
> > numerical
> > > > > order
> > > > > - It takes a very long time for the server IOC to kill the stale
> > entries
> > > > >
> > > > > Thus, if I reboot the client many times in a row, it does not just
> > > > > result in one disconnect before the succesful connection, but
many.
> > I
> > > > > just did a test where I rebooted a vxWorks client IOC 11 times, as
> > one
> > > > > might do when debugging IOC software.  This IOC is running Marty's
> > > > > example sequence program, with 2 PVs connecting to a remote
> vxWorks
> > > > > server IOC.
> > > > >
> > > > > Here is the amount of time elapsed before the sequence program PVs
> > > > > connected:
> > > > > Reboot #  Time (sec)
> > > > > 1           0.1
> > > > > 2           5.7
> > > > > 3            30
> > > > > 4            60
> > > > > 5            90
> > > > > 6           120
> > > > > 7            30
> > > > > 8           150
> > > > > 9           150
> > > > > 10          180
> > > > > 11          210
> > > > >
> > > > > Here is the output of "casr" on the vxWorks server IOC that never
> > > > > rebooted after client reboot #11.
> > > > > Channel Access Server V4.11
> > > > > 164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel
> Count=1
> > > > > Priority=80
> > > > > 164.54.160.100:4453(miata): User="dac_user", V4.8, Channel
> Count=461
> > > > > Priority=0
> > > > > 164.54.160.75:1027(ioc13ida): User="iocboot", V4.11, Channel
> Count=1
> > > > > Priority=80
> > > > > 164.54.160.101:3379(lebaron): User="dac_user", V4.8, Channel
> > Count=18
> > > > > Priority=0
> > > > > 164.54.160.73:1025(ioc13lab): User="iocboot", V4.11, Channel
> Count=2
> > > > > Priority=0
> > > > > 164.54.160.73:1027(ioc13lab): User="iocboot", V4.11, Channel
> Count=2
> > > > > Priority=0
> > > > > 164.54.160.73:1028(ioc13lab): User="iocboot", V4.11, Channel
> Count=2
> > > > > Priority=0
> > > > > 164.54.160.73:1029(ioc13lab): User="iocboot", V4.11, Channel
> Count=2
> > > > > Priority=0
> > > > > 164.54.160.73:1026(ioc13lab): User="iocboot", V4.11, Channel
> Count=2
> > > > > Priority=0
> > > > > 164.54.160.73:1030(ioc13lab): User="iocboot", V4.11, Channel
> Count=2
> > > > > Priority=0
> > > > > 164.54.160.73:1031(ioc13lab): User="iocboot", V4.11, Channel
> Count=2
> > > > > Priority=0
> > > > > 164.54.160.111:55807(millenia.cars.aps.anl.gov): User="webmaster",
> > V4.8,
> > > > > Channel Count=291 Priority=0
> > > > > 164.54.160.73:1032(ioc13lab): User="iocboot", V4.11, Channel
> Count=2
> > > > > Priority=0
> > > > >
> > > > > There should only be one connection from the client, 164.54.160.73
> > > > > (ioc13lab).  All but the highest numbered port (1032) are stale.
> > > > >
> > > > > The connection times do not increase by 30 seconds every single
> time,
> > > > > because for some reason every once in a while one of the old port
> > > > > connections times out (?) and is reused.  You can see that 1026
> was
> > > > > reused in the above test. But in general they do increase by 30
> > seconds
> > > > > on each reboot.
> > > > >
> > > > > This situation makes it very difficult to do software development
> > under
> > > > > vxWorks in the case where CA connections to other vxWorks IOCs are
> > used.
> > > > > It starts to take 4 or 5 minutes for the CA connections to get
> > > > > established.  Rebooting the server IOC is often not an option.
> > > > >
> > > > > Here is a proposal for Jeff:
> > > > >
> > > > > Would it be possible to create a new function named something like
> > > > > vxCAClientStopAll.  This command would call close() on the CA
> > > > > connections for all vxWorks CA clients, and hence would gracefully
> > close
> > > > > all of the socket connections on the server IOC.
> > > > >
> > > > > We could then make another new vxWorks command, "restart" which
> does
> > > > > vxCAClientStopAll();
> > > > > reboot();
> > > >
> > > > This is very awesome!!!
> > > >
> > > > Jeff can you implement this for the next EPICS RELEASE???
> > > >
> > > >
> > > > Ernest
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > > This would not solve the problem for hard reboots, but it would
> make
> > it
> > > > > possible in many cases to avoid these long delays in cases where
> an
> > IOC
> > > > > is being deliberately rebooted under software control.
> > > > >
> > > > > Cheers,
> > > > > Mark
> > > > >
> > > > > Jeff's reply was:
> > > > > Mark,
> > > > >
> > > > >
> > > > > > - vxWorks assigns these ephemeral port numbers in ascending
> > numerical
> > > > > > order
> > > > >
> > > > > That's correct there could be several of these stale circuits and
> > the
> > > > > system
> > > > > will sequentially step through ephemeral port assignments timing
> out
> > > > > each
> > > > > one until an open slot is found. One solution would be for WRS to
> > store
> > > > > the
> > > > > last ephemeral port assignment in non-volatile RAM between boots.
> > > > >
> > > > > It's also true that this problem is mostly a development issue and
> > not
> > > > > an
> > > > > operational issue because during operations machines typically
> stay
> > in a
> > > > > booted operational state for much longer than the stale circuit
> > timeout
> > > > > interval.
> > > > >
> > > > > > - It takes a very long time for the server IOC to kill the stale
> > > > > > entries
> > > > >
> > > > > Yes, that's true. I do turn on the keep-alive timer, but it has a
> > very
> > > > > long
> > > > > delay by default. This delay *can* however be changed globally for
> > all
> > > > > circuits.
> > > > >
> > > > > I don't know what RTEMS does, but I strongly suspect that windows,
> > UNIX,
> > > > > and
> > > > > VMS systems hang up all connected circuits when the system is
> > software
> > > > > rebooted.
> > > > >
> > > > > Therefore, we have a vxWorks and possibly an RTEMS specific
> problem.
> > > > >
> > > > > > Would it be possible to create a new function named something
> like
> > > > > > vxCAClientStopAll.  This command would call close() on the CA
> > > > > > connections for all vxWorks CA clients, and hence would
> > > > > > gracefully close all of the socket connections on the server
IOC.
> > > > > >
> > > > >
> > > > > Of course ca_context_destroy() and ca_task_exit() are fulfilling a
> > > > > similar,
> > > > > but context specific role. They do however shutdown only one
> context
> > at
> > > > > a
> > > > > time, and the context identifier is private to the context.
> > > > >
> > > > > So perhaps we should do this:
> > > > >
> > > > > Implement an iocCore shutdown module where subsystems register for
> > > > > callback
> > > > > when iocCore is shutdown. There would be a command line function
> > that
> > > > > users
> > > > > call to shutdown an IOC gracefully. This command line would call
> all
> > of
> > > > > the
> > > > > callbacks in the LIFO order. The sequencer and the database links
> > would
> > > > > of
> > > > > course call ca_context_destroy() in their IOC core shutdown
> > callbacks.
> > > > >
> > > > > Jeff
> > >



Navigate by Date:
Prev: RE: orderly shutdown Jeff Hill
Next: EPICS CA security with the motorRecord? Ernest L. Williams Jr.
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
Navigate by Thread:
Prev: RE: orderly shutdown Mark Rivers
Next: pv connect/disconnect callbacks john sinclair
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·