Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
<== Date ==> <== Thread ==>

Subject: RE: channel access
From: "Ernest L. Williams Jr." <ernesto@ornl.gov>
To: Mark Rivers <rivers@cars.uchicago.edu>
Cc: Jeff Hill <johill@lanl.gov>, Dirk Zimoch <dirk.zimoch@psi.ch>, EPICS tech-talk <tech-talk@aps.anl.gov>
Date: Wed, 11 Jan 2006 15:55:51 -0500
On Wed, 2006-01-11 at 13:41 -0600, Mark Rivers wrote:
> Folks,
> 
> > > we have a problem with CA since we upgraded our MV2300 IOCs 
> > to Tornado2.
> > > 
> > > After a reboot, often channel access links don't connect 
> > immediately to
> > > the server. They connect a few minutes later when this 
> > message is printed:
> > > 
> > > CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
> > >   22="S_errno_EINVAL"
> 
> This is not just a problem with IOC to IOC sockets, but with any vxWorks
> to vxWorks sockets.
> 
> We recently purchased a Newport XPS motor controller.  It communicates
> over Ethernet, and uses vxWorks as it's operating system.  We control
> the XPS from a vxWorks IOC. When we reboot our vxWorks IOC the XPS will
> not communicate again after the IOC reboots, because it does not know
> the IOC rebooted, and the same ports are being used.  It is thus
> necessary to also reboot the XPS when rebooting the IOC.  But rebooting
> the XPS requires re-homing all of the motors, which is sometimes almost
> impossible because of installed equipment!  This is a real pain.
> 
> This problem goes away if we control the XPS with a non-vxWorks IOC,
> such as Linux, probably because Linux closes the sockets when killing
> the IOC.
> 
> On a related topic, I am appending an exchange I had with Jeff Hill and
> others on this topic in October 2003, that was not posted to tech-talk.
> 
> Cheers,
> Mark Rivers
> 
> 
> 
> Folks,
> 
> I'd like to revisit the problem of CA disconnects when rebooting a
> vxWorks client IOC that has CA links to a vxWorks server IOC (that does
> not reboot).
> 
> The EPICS 3.14.3 Release Notes say:
> 
> "Recent versions of vxWorks appear to experience a connect failure if
> the vxWorks IP kernel reassigns the same ephemeral TCP port number as
> was assigned during a previous lifetime. The IP kernel on the vxWorks
> system hosting the CA server might have a stale entry for this ephemeral
> port that has not yet timed out which prevents the client from
> connecting with the ephemeral port assigned by the IP kernel.
> Eventually, after EPICS_CA_CONN_TMO seconds, the TCP connect sequence is
> aborted and the client library closes the socket, opens a new socket,
> receives a new ephemeral port assignment, and successfully connects."
> 
> The last sentence is only partially correct.  The problem is that:
> - vxWorks assigns these ephemeral port numbers in ascending numerical
> order
> - It takes a very long time for the server IOC to kill the stale entries
> 
> Thus, if I reboot the client many times in a row, it does not just
> result in one disconnect before the succesful connection, but many.  I
> just did a test where I rebooted a vxWorks client IOC 11 times, as one
> might do when debugging IOC software.  This IOC is running Marty's
> example sequence program, with 2 PVs connecting to a remote vxWorks
> server IOC. 
> 
> Here is the amount of time elapsed before the sequence program PVs
> connected:
> Reboot #  Time (sec)
> 1           0.1
> 2           5.7
> 3            30
> 4            60
> 5            90
> 6           120
> 7            30
> 8           150
> 9           150
> 10          180
> 11          210
> 
> Here is the output of "casr" on the vxWorks server IOC that never
> rebooted after client reboot #11.
> Channel Access Server V4.11
> 164.54.160.74:1067(ioc13bma): User="iocboot", V4.11, Channel Count=1
> Priority=80
> 164.54.160.100:4453(miata): User="dac_user", V4.8, Channel Count=461
> Priority=0
> 164.54.160.75:1027(ioc13ida): User="iocboot", V4.11, Channel Count=1
> Priority=80
> 164.54.160.101:3379(lebaron): User="dac_user", V4.8, Channel Count=18
> Priority=0
> 164.54.160.73:1025(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1027(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1028(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1029(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1026(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1030(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.73:1031(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 164.54.160.111:55807(millenia.cars.aps.anl.gov): User="webmaster", V4.8,
> Channel Count=291 Priority=0
> 164.54.160.73:1032(ioc13lab): User="iocboot", V4.11, Channel Count=2
> Priority=0
> 
> There should only be one connection from the client, 164.54.160.73
> (ioc13lab).  All but the highest numbered port (1032) are stale.  
> 
> The connection times do not increase by 30 seconds every single time,
> because for some reason every once in a while one of the old port
> connections times out (?) and is reused.  You can see that 1026 was
> reused in the above test. But in general they do increase by 30 seconds
> on each reboot.  
> 
> This situation makes it very difficult to do software development under
> vxWorks in the case where CA connections to other vxWorks IOCs are used.
> It starts to take 4 or 5 minutes for the CA connections to get
> established.  Rebooting the server IOC is often not an option.
> 
> Here is a proposal for Jeff:
> 
> Would it be possible to create a new function named something like
> vxCAClientStopAll.  This command would call close() on the CA
> connections for all vxWorks CA clients, and hence would gracefully close
> all of the socket connections on the server IOC.
> 
> We could then make another new vxWorks command, "restart" which does
> vxCAClientStopAll();
> reboot();

This is very awesome!!!

Jeff can you implement this for the next EPICS RELEASE???


Ernest







> 
> This would not solve the problem for hard reboots, but it would make it
> possible in many cases to avoid these long delays in cases where an IOC
> is being deliberately rebooted under software control.
> 
> Cheers,
> Mark
> 
> Jeff's reply was:
> Mark,
> 
> 
> > - vxWorks assigns these ephemeral port numbers in ascending numerical
> > order
> 
> That's correct there could be several of these stale circuits and the
> system
> will sequentially step through ephemeral port assignments timing out
> each
> one until an open slot is found. One solution would be for WRS to store
> the
> last ephemeral port assignment in non-volatile RAM between boots.
> 
> It's also true that this problem is mostly a development issue and not
> an
> operational issue because during operations machines typically stay in a
> booted operational state for much longer than the stale circuit timeout
> interval.
> 
> > - It takes a very long time for the server IOC to kill the stale 
> > entries
> 
> Yes, that's true. I do turn on the keep-alive timer, but it has a very
> long
> delay by default. This delay *can* however be changed globally for all
> circuits.
> 
> I don't know what RTEMS does, but I strongly suspect that windows, UNIX,
> and
> VMS systems hang up all connected circuits when the system is software
> rebooted.
> 
> Therefore, we have a vxWorks and possibly an RTEMS specific problem. 
> 
> > Would it be possible to create a new function named something like
> > vxCAClientStopAll.  This command would call close() on the CA
> > connections for all vxWorks CA clients, and hence would 
> > gracefully close all of the socket connections on the server IOC.
> >
> 
> Of course ca_context_destroy() and ca_task_exit() are fulfilling a
> similar,
> but context specific role. They do however shutdown only one context at
> a
> time, and the context identifier is private to the context.
> 
> So perhaps we should do this:
> 
> Implement an iocCore shutdown module where subsystems register for
> callback
> when iocCore is shutdown. There would be a command line function that
> users
> call to shutdown an IOC gracefully. This command line would call all of
> the
> callbacks in the LIFO order. The sequencer and the database links would
> of
> course call ca_context_destroy() in their IOC core shutdown callbacks.
> 
> Jeff


Replies:
orderly shutdown Jeff Hill
References:
RE: channel access Mark Rivers

Navigate by Date:
Prev: RE: channel access Mark Rivers
Next: orderly shutdown Jeff Hill
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
Navigate by Thread:
Prev: RE: channel access Mark Rivers
Next: orderly shutdown Jeff Hill
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·