EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: channel access
From: "Jeff Hill" <[email protected]>
To: "'Dirk Zimoch'" <[email protected]>
Cc: <[email protected]>
Date: Wed, 11 Jan 2006 11:52:53 -0700
> we have a problem with CA since we upgraded our MV2300 IOCs to Tornado2.
> 
> After a reboot, often channel access links don't connect immediately to
> the server. They connect a few minutes later when this message is printed:
> 
> CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
>   22="S_errno_EINVAL"
>
> My guess is the following:
> 
> The server does not know that the client rebooted.
> The client uses the same settings (e.g. dynamically assigned port
> number, etc) to connect to the server as last time. The packages now
> look exactly the same as during the last reboot.
> 
> Since TCP involves resending lost packages, the server thinks the
> packages are duplicates and drops them.
> 
> The server never replies and connect() fails.

Yes, this is one of the subtle compatibility issues with TCP/IP. For the guy
that is implementing the server part of an IP kernel it's a tradeoff. You
have to be robust when hackers attempt a denial of service attack against
the server, but you also might like to detect stale circuits and disconnect
them when a client who is reusing the same ephemeral port keeps transmitting
with the wrong TCP state or the wrong TCP sequence number. It's hard to have
both.

> I had the idea to install a rebootHook that closes all CAC sockets, just
> to see if that helps.
> 
> Unfortunately, I don't know how to get a list of all CAC sockets. I
> tried to iterate ca_static->ca_iiuList, but that didn't work. ca_static
> was either NULL (when I don't call ca_task_initialize) or the ca_iiuList
> is empty (when I do call ca_task_initialize).

When vxWorks does a soft reboot they do not initiate a TCP shutdown
procedure for any sockets that might be open.

Last time I investigated this idea of implementing a rebbootHook in CA I
concluded that vxWorks runs the reboot hooks in the wrong order (FIFO not
LIFO). This means that vxWorks has already placed the network stack in an
inaccessible state when it runs the CA installed reboot hook, and therefore
there is nothing CA can do about closing sockets in a rebootHook. I asked
WRS about this and they admitted that this was a mistake but said that they
would not fix it. That interaction with WRS occurred about 15 years ago.

Ok, it gets worse. In principal we don't care about rebootHook architectural
problems. We can have our own shell command that shuts down an EPICS IOC and
initiates a soft reboot.

A fundamental unresolved problem is that if CA tries to automatically clean
up at exit it can yank the carpet out from under other EPICS threads that do
not have an orderly shutdown and continue to use CA after the exit handlers
run. For example, with DB CA links that attach directly to in-memory DB
records we saw problems where a subscription update from the database would
continue to use a CA entity that had been cleaned up by CA's exit handler. 

Fundamentally, you have to decide who is responsible for initiating cleanup.
CA keeps a list of channels and it could clean them up, but that could cause
problems if the application tries to clean them up also. Everyone arrives
eventually at the same conclusion. It is best to stick strictly to this
rule. Whoever created it is responsible for deleting it. 

Following that rule, one is lead to a conclusion that the application needs
to first delete its CA channels and then delete the CA client context that
it created. 

With the DB CA links that responsibility lies with the database. The
database does not currently have orderly shutdown procedures, but presumably
this is high on the list for future versions of EPICS.

Mantis 126 was originally tracking this issue against R3.14. It has been
marked as "wont fix". This was the last comment on mantis 126: "When
ca_context_destroy was called it caused a crash when db_post_event was
called for a channel access link to a local channel. For 3.14 there are no
currently no plans to fix this since it probably means a complete cleanup of
all resourses used by database records."

I created mantis 234 (against R3.15).

Jeff 


> -----Original Message-----
> From: Dirk Zimoch [mailto:[email protected]]
> Sent: Wednesday, January 11, 2006 9:12 AM
> To: Jeff Hill
> Subject: channel access
> 
> Hi Jeff,
> 
> we have a problem with CA since we upgraded our MV2300 IOCs to Tornado2.
> 
> After a reboot, often channel access links don't connect immediately to
> the server. They connect a few minutes later when this message is printed:
> 
> CAC: Unable to connect port 5064 on "172.19.157.20:5064" because
>   22="S_errno_EINVAL"
> 
> 
> My guess is the following:
> 
> The server does not know that the client rebooted.
> The client uses the same settings (e.g. dynamically assigned port
> number, etc) to connect to the server as last time. The packages now
> look exactly the same as during the last reboot.
> 
> Since TCP involves resending lost packages, the server thinks the
> packages are duplicates and drops them.
> 
> The server never replies and connect() fails.
> 
> 
> I had the idea to install a rebootHook that closes all CAC sockets, just
> to see if that helps.
> 
> Unfortunately, I don't know how to get a list of all CAC sockets. I
> tried to iterate ca_static->ca_iiuList, but that didn't work. ca_static
> was either NULL (when I don't call ca_task_initialize) or the ca_iiuList
> is empty (when I do call ca_task_initialize).
> 
> How does this work?
> What can I do?
> 
> Yours
> Dirk
> 
> --
> Dr. Dirk Zimoch
> Swiss Light Source
> Paul Scherrer Institut
> Computing and Controls
> phone +41 56 310 5182
> fax   +41 56 310 4413



Navigate by Date:
Prev: Re: EPICS Patriarch Succumbs to Stronger Call Marty Kraimer
Next: RE: channel access Mark Rivers
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: EPICS Patriarch Succumbs to Stronger Call Marty Kraimer
Next: RE: channel access Mark Rivers
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·