Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
<== Date ==> <== Thread ==>

Subject: Re: orderly shutdown
From: "Ernest L. Williams Jr." <ernesto@ornl.gov>
To: Jeff Hill <johill@lanl.gov>
Cc: "'Mark Rivers'" <rivers@cars.uchicago.edu>, "'Dirk Zimoch'" <dirk.zimoch@psi.ch>, "'EPICS tech-talk'" <tech-talk@aps.anl.gov>
Date: Wed, 11 Jan 2006 22:50:44 -0500
On Wed, 2006-01-11 at 14:40 -0700, Jeff Hill wrote:
> > > Here is a proposal for Jeff:
> > >
> > > Would it be possible to create a new function named something like
> > > vxCAClientStopAll.  This command would call close() on the CA
> > > connections for all vxWorks CA clients, and hence would gracefully close
> > > all of the socket connections on the server IOC.
> > >
> > > We could then make another new vxWorks command, "restart" which does
> > > vxCAClientStopAll();
> > > reboot();
> > 
> > This is very awesome!!!
> > 
> > Jeff can you implement this for the next EPICS RELEASE???
> > 
> > 
> > Ernest
> >
> 
> What Mark suggests is certainly a possible fix. If such a function were
> written its name, instead of vxCAClientStopAll(), might be instead
> ca_close_circuits_but_dont_shut_anything_else_down() because if the rest of
> the CA infrastructure is not left in place the db threads that are still
> using it will crash and potentially disrupt the orderly shutdown.
> 
> There are different perspectives on this. One perspective is that CA already
> has such functions, ca_clear_channel and ca_context_destroy, and that all
> that is needed is a function called dbStopAll that calls them ;-). There
> would be many advantages to such an approach. One of them would be that
> devices could be shutdown also. For example the Allen Bradley TCP/IP
> circuits might also need to be gracefully shutdown.
> 
> Jeff

Jeff, here is some code that WindRiver sent me:
===========================================================================
/* tcpRstAll.c - send RST on All open TCP connections (prior to system
reset)
* Anton Langebner (anton.langebner@windriver.com)
*
* $Header:$
*
* $Log:$
*/

#include "vxWorks.h"
#include "sys/types.h"
#include "netinet/in.h"
#include "netinet/in_pcb.h"
#include "netinet/in_systm.h"
#include "netinet/ip.h"
#include "netinet/ip_var.h"
#include "netinet/tcp.h"
#include "netinet/tcp_fsm.h"
#include "netinet/tcp_seq.h"
#include "netinet/tcp_timer.h"
#include "netinet/tcp_var.h"
#include "netinet/tcpip.h"
#include "net/route.h"
#include "errno.h"
#include "string.h"
#include "stdio.h"

IMPORT struct inpcbhead *_pTcpPcbHead;

STATUS tcpRstAll(int startType)
{
int s;
struct inpcb *pInpcb; /* TCP: PCB Head */
struct inpcb *inp; /* TCP: Current PCB */
struct tcpcb *pTcpCb; /* TCP: Current TCP PCB */

struct socket *pSock;
struct rtentry * pRouteEntry = NULL;
struct sockaddr * destAddr = NULL;
short timeout;

s= splnet();

if (_pTcpPcbHead==NULL)
{
splx(s);
printf("Reset TCP: no connections found!\n");
return(ERROR);
}

printf("Reset TCP connections");

pInpcb= _pTcpPcbHead->lh_first;

for (inp= pInpcb; inp!= NULL; inp= inp->inp_list.le_next)
{
pTcpCb= (struct tcpcb *)inp->inp_ppcb;

if (pTcpCb->t_state!=TCPS_ESTABLISHED)
continue;

printf(".");
pTcpCb->t_state= TCPS_CLOSED;
tcp_output(pTcpCb);
}

splx(s);
printf("done\n");
return(OK);
}

STATUS tcpRstAllInit()
{
printf("Adding tcpRstAll() Reboot Hook\n");
rebootHookAdd((FUNCPTR)tcpRstAll);
return(OK);
}

void tcpRstAllNow(int startType)
{
tcpRstAll(startType);
reboot(startType);
}
=====================================================================


Thanks,
Ernest









> 
> > -----Original Message-----
> > From: Ernest L. Williams Jr. [mailto:ernesto@ornl.gov]
> > Sent: Wednesday, January 11, 2006 1:56 PM
> > To: Mark Rivers
> > Cc: Jeff Hill; Dirk Zimoch; EPICS tech-talk
> > Subject: RE: channel access
> > 
> > On Wed, 2006-01-11 at 13:41 -0600, Mark Rivers wrote:
> > > Folks,
> > >
> > > > > we have a problem with CA since we upgraded our MV2300 IOCs
> > > > to Tornado2.
> > > > >
> > > > > After a reboot, often channel access links don't connect
> > > > immediately to
> > > > > the server. They connect a few minutes later when this
> > > > message is printed:
> > > > >
> > > > > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> > > > >   22="S_errno_EINVAL"
> > >
> > > This is not just a problem with IOC to IOC sockets, but with any vxWorks
> > > to vxWorks sockets.
> > >
> > > We recently purchased a Newport XPS motor controller.  It communicates
> > > over Ethernet, and uses vxWorks as it's operating system.  We control
> > > the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will
> > > not communicate again after the IOC reboots, because it does not know
> > > the IOC rebooted, and the same ports are being used.  It is thus
> > > necessary to also reboot the XPS when rebooting the IOC.  But rebooting
> > > the XPS requires re-homing all of the motors, which is sometimes almost
> > > impossible because of installed equipment!  This is a real pain.
> > >
> > > This problem goes away if we control the XPS with a non-vxWorks IOC,
> > > such as Linux, probably because Linux closes the sockets when killing
> > > the IOC.
> > >
> > > On a related topic, I am appending an exchange I had with Jeff Hill and
> > > others on this topic in October 2003, that was not posted to tech-talk.
> > >
> > > Cheers,
> > > Mark Rivers
> > >
> > >
> > >
> > > Folks,
> > >
> > > I'd like to revisit the problem of CA disconnects when rebooting a
> > > vxWorks client IOC that has CA links to a vxWorks server IOC (that does
> > > not reboot).
> > >
> > > The EPICS 3.14.3 Release Notes say:
> > >
> > > "Recent versions of vxWorks appear to experience a connect failure if
> > > the vxWorks IP kernel reassigns the same ephemeral TCP port number as
> > > was assigned during a previous lifetime. The IP kernel on the vxWorks
> > > system hosting the CA server might have a stale entry for this ephemeral
> > > port that has not yet timed out which prevents the client from
> > > connecting with the ephemeral port assigned by the IP kernel.
> > > Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is
> > > aborted and the client library closes the socket, opens a new socket,
> > > receives a new ephemeral port assignment, and successfully connects."
> > >
> > > The last sentence is only partially correct.  The problem is that:
> > > - vxWorks assigns these ephemeral port numbers in ascending numerical
> > > order
> > > - It takes a very long time for the server IOC to kill the stale entries
> > >
> > > Thus, if I reboot the client many times in a row, it does not just
> > > result in one disconnect before the succesful connection, but many.  I
> > > just did a test where I rebooted a vxWorks client IOC 11 times, as one
> > > might do when debugging IOC software.  This IOC is running Marty's
> > > example sequence program, with 2 PVs connecting to a remote vxWorks
> > > server IOC.
> > >
> > > Here is the amount of time elapsed before the sequence program PVs
> > > connected:
> > > Reboot #  Time (sec)
> > > 1           0.1
> > > 2           5.7
> > > 3            30
> > > 4            60
> > > 5            90
> > > 6           120
> > > 7            30
> > > 8           150
> > > 9           150
> > > 10          180
> > > 11          210
> > >
> > > Here is the output of "casr" on the vxWorks server IOC that never
> > > rebooted after client reboot #11.
> > > Channel Access Server V4.11
> > > 164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1
> > > Priority=80
> > > 164.54.160.100:4453(miata): User="dac_user", V4.8, Channel Count=461
> > > Priority=0
> > > 164.54.160.75:1027(ioc13ida): User="iocboot", V4.11, Channel Count=1
> > > Priority=80
> > > 164.54.160.101:3379(lebaron): User="dac_user", V4.8, Channel Count=18
> > > Priority=0
> > > 164.54.160.73:1025(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1027(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1028(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1029(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1026(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1030(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.73:1031(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > > 164.54.160.111:55807(millenia.cars.aps.anl.gov): User="webmaster", V4.8,
> > > Channel Count=291 Priority=0
> > > 164.54.160.73:1032(ioc13lab): User="iocboot", V4.11, Channel Count=2
> > > Priority=0
> > >
> > > There should only be one connection from the client, 164.54.160.73
> > > (ioc13lab).  All but the highest numbered port (1032) are stale.
> > >
> > > The connection times do not increase by 30 seconds every single time,
> > > because for some reason every once in a while one of the old port
> > > connections times out (?) and is reused.  You can see that 1026 was
> > > reused in the above test. But in general they do increase by 30 seconds
> > > on each reboot.
> > >
> > > This situation makes it very difficult to do software development under
> > > vxWorks in the case where CA connections to other vxWorks IOCs are used.
> > > It starts to take 4 or 5 minutes for the CA connections to get
> > > established.  Rebooting the server IOC is often not an option.
> > >
> > > Here is a proposal for Jeff:
> > >
> > > Would it be possible to create a new function named something like
> > > vxCAClientStopAll.  This command would call close() on the CA
> > > connections for all vxWorks CA clients, and hence would gracefully close
> > > all of the socket connections on the server IOC.
> > >
> > > We could then make another new vxWorks command, "restart" which does
> > > vxCAClientStopAll();
> > > reboot();
> > 
> > This is very awesome!!!
> > 
> > Jeff can you implement this for the next EPICS RELEASE???
> > 
> > 
> > Ernest
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > >
> > > This would not solve the problem for hard reboots, but it would make it
> > > possible in many cases to avoid these long delays in cases where an IOC
> > > is being deliberately rebooted under software control.
> > >
> > > Cheers,
> > > Mark
> > >
> > > Jeff's reply was:
> > > Mark,
> > >
> > >
> > > > - vxWorks assigns these ephemeral port numbers in ascending numerical
> > > > order
> > >
> > > That's correct there could be several of these stale circuits and the
> > > system
> > > will sequentially step through ephemeral port assignments timing out
> > > each
> > > one until an open slot is found. One solution would be for WRS to store
> > > the
> > > last ephemeral port assignment in non-volatile RAM between boots.
> > >
> > > It's also true that this problem is mostly a development issue and not
> > > an
> > > operational issue because during operations machines typically stay in a
> > > booted operational state for much longer than the stale circuit timeout
> > > interval.
> > >
> > > > - It takes a very long time for the server IOC to kill the stale
> > > > entries
> > >
> > > Yes, that's true. I do turn on the keep-alive timer, but it has a very
> > > long
> > > delay by default. This delay *can* however be changed globally for all
> > > circuits.
> > >
> > > I don't know what RTEMS does, but I strongly suspect that windows, UNIX,
> > > and
> > > VMS systems hang up all connected circuits when the system is software
> > > rebooted.
> > >
> > > Therefore, we have a vxWorks and possibly an RTEMS specific problem.
> > >
> > > > Would it be possible to create a new function named something like
> > > > vxCAClientStopAll.  This command would call close() on the CA
> > > > connections for all vxWorks CA clients, and hence would
> > > > gracefully close all of the socket connections on the server IOC.
> > > >
> > >
> > > Of course ca_context_destroy() and ca_task_exit() are fulfilling a
> > > similar,
> > > but context specific role. They do however shutdown only one context at
> > > a
> > > time, and the context identifier is private to the context.
> > >
> > > So perhaps we should do this:
> > >
> > > Implement an iocCore shutdown module where subsystems register for
> > > callback
> > > when iocCore is shutdown. There would be a command line function that
> > > users
> > > call to shutdown an IOC gracefully. This command line would call all of
> > > the
> > > callbacks in the LIFO order. The sequencer and the database links would
> > > of
> > > course call ca_context_destroy() in their IOC core shutdown callbacks.
> > >
> > > Jeff
> 


Replies:
RE: orderly shutdown Jeff Hill
References:
orderly shutdown Jeff Hill

Navigate by Date:
Prev: About: timeout handler of epicsTimer Jun-ichi Odagiri
Next: Re: vxStats Marty Kraimer
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
Navigate by Thread:
Prev: orderly shutdown Jeff Hill
Next: RE: orderly shutdown Jeff Hill
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·