EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: Very slow reconnection to medm after IOC reboot
From: "Mark Rivers" <[email protected]>
To: "Jeff Hill" <[email protected]>, "Andrew Johnson" <[email protected]>
Cc: Core-Talk <[email protected]>
Date: Tue, 3 Feb 2009 20:47:36 -0600
Jeff,
 
> Have you seen that this doesn't happen with R3.14.8? Is there any
> possibility that the network setup is what changed? I will see what happens
> here in the morning.

I definitely don't see it in R3.14.8.2, and absolutely nothing has changed.   On the same PC in the same shell session I run first an IOC application linked with 3.14.10 and then run the same IOC linked with 3.14.8.2.  Same environment variables, same everything.  With 3.14.10 I get the error and there are no CA beacons (with EPICS_CAS_BEACON_ADDR_LIST not set).  3.14.8.2 has no error, and it does send beacons.

I thought this could possibly be because I had a third-party network filter (for GigE cameras) running.  But I disabled that, and the problem persists.  The presence of that filter was what made me set EPICS_CA_AUTO_ADDR_LIST=NO and EPICS_CA_ADDR_LIST=164.54.160.255.

Mark

 



________________________________

From: Jeff Hill [mailto:[email protected]]
Sent: Tue 2/3/2009 7:46 PM
To: Mark Rivers; 'Andrew Johnson'
Cc: 'Core-Talk'
Subject: RE: Very slow reconnection to medm after IOC reboot



> Do you understand this?  Why do I have to manually set the beacon
> address list?  Under 3.14.8.2 I did not see those errors after iocInit,
> and casw shows beacons arriving without setting the environment
> variable.

The functions that discover what network interfaces are installed, and query
each of their IP broadcast addresses, have some OS specific variation in
their implementation. The implementation is in libCom/osi/os/win32. I don't
recall changing this recently, but will need to look in cvs in the morning.

Have you seen that this doesn't happen with R3.14.8? Is there any
possibility that the network setup is what changed? I will see what happens
here in the morning.

I don't recall seeing this problem during testing, but I do recall seeing an
issue where search messages were being seen (twice) probably because my
workstation has two network interfaces enabled. I might have manually
configured certain things because of this issue.

> - How is EPICS CA supposed to be working in 3.14.10.  Can you tell me
> quantitatively after an IOC is down for 5 minutes what is the maximum
> amount of time it should take clients to reconnect?  What are the
> specific time constants that govern this.

In later R3.14 releases (post R3.214.7 or so) there are revisions so the
search rate can be handled uniquely for each channel. Basically, there is an
array of search timers each with twice the delay of its index minus one
neighbor. When a timer expires a search is sent for all of the channels
attached to the timer, and the channels that were searched for are moved to
the index plus one timer (that's actually over simplification for brevity -
the number of channels per search attempt is dynamically adjusted based on
past success rates similar to TCP slow start algorithm).

There are two starting points in this array of timers. When a channel is
created it starts at the lowest index timer (the fastest rate). When a
beacon anomaly is detected all disconnected channels are moved to an
intermediate rate timer.

So, in theory, the longest it will take, if a beacon anomaly is detected, is
the number of disconnected channels divided by the number of channels per
search attempt times this intermediate search period.

> quantitatively after an IOC is down for 5 minutes what is the maximum
> amount of time it should take clients to reconnect? 

This of course has to be dependent on the number of disconnected channels
the client has, and is not in my experience (I have seen this in the LANSCE
control room) as instantaneous as in the past, but isn't taking a long time
either (this requires detection of a beacon anomaly). I can see the dm
screens cascading over from white boxes to connected widgets. In the past
they switched in an eye blink.

> - How would we actually like it to behave in the future.

Connect times should be as fast, and responsive to the appearance of a new
IOC, as possible without impacting overall network stability, but I suspect
that you are not surprised to hear that answer. I am actually quite
malleable to suggestions on how we can improve performance, but system
stability must be guaranteed as our first priority.

Jeff

> -----Original Message-----
> From: Mark Rivers [mailto:[email protected]]
> Sent: Tuesday, February 03, 2009 5:07 PM
> To: Jeff Hill; Andrew Johnson
> Cc: Core-Talk
> Subject: RE: Very slow reconnection to medm after IOC reboot
>
> Hi Jeff,
>
> I think we need to break this down into 3 questions:
>
> - How is EPICS CA supposed to be working in 3.14.10.  Can you tell me
> quantitatively after an IOC is down for 5 minutes what is the maximum
> amount of time it should take clients to reconnect?  What are the
> specific time constants that govern this.
>
> - Is EPICS CA actually behaving this way on all platforms?
>
> - How would we actually like it to behave in the future.
>
> I have found at least one thing that looks like a bug to me, and one
> that may have been introduced between 3.14.8.2 and 3.13.10.
>
> On a win32-x86 machine I have the following EPICS environment variables
> set:
>
> $ printenv | grep EPICS
> EPICS_CA_AUTO_ADDR_LIST=NO
> EPICS_CA_ADDR_LIST=localhost 164.54.160.255
> EPICS_CA_MAX_ARRAY_BYTES=10000000
>
> I then start my IOC, and I get the following message after iocInit:
>
> The CA server's beacon address list was empty after initialization?
>
> While this IOC was booting I was running casw on a Linux machine.  It
> did not see any CA beacons from the Windows IOC.
>
> I then manually set EPICS_CAS_BEACON_ADDR_LIST to my broadcast address
> 164.54.160.255
>
> $ printenv | grep EPICS
> EPICS_CA_AUTO_ADDR_LIST=NO
> EPICS_CAS_BEACON_ADDR_LIST=164.54.160.255
> EPICS_CA_ADDR_LIST=localhost 164.54.160.255
> EPICS_CA_MAX_ARRAY_BYTES=10000000
>
> and start the IOC again.  This time I do not get the error message after
> iocInit, and casw on the Linux host sees CA beacons from the Windows
> IOC.
>
> Do you understand this?  Why do I have to manually set the beacon
> address list?  Under 3.14.8.2 I did not see those errors after iocInit,
> and casw shows beacons arriving without setting the environment
> variable.
>
> I just did a couple of quick tests again with and without
> EPICS_CAS_BEACON_ADDR_LIST set.
>
> I found the following.  I turned off the IOC for 340 seconds each time.
> Windows IOC build with 3.14.10, Linux medm client built with 3.14.8.2
>
> With it set, so beacons being sent:  reconnect in 10 seconds
> Without it set, so no beacons being sent: reconnect in >3 minutes.
>
> So with this 3.14.8.2 client the response is OK as long as beacons are
> being sent.  I will test again tomorrow with 3.14.10 clients.
>
> Thanks,
> Mark
>
>
> -----Original Message-----
> From: Jeff Hill [mailto:[email protected]]
> Sent: Friday, January 23, 2009 4:40 PM
> To: Mark Rivers; 'Andrew Johnson'
> Cc: 'epics'
> Subject: RE: Very slow reconnection to medm after IOC reboot
>
>
> > But in this case I am doing development on a
> > single PC, running a local client connected to channels on that same
> PC,
> > but I have to wait 4 minutes for the channels to reconnect just
> because
> > I rebuilt the application.  This is not acceptable.
>
> Perhaps I am a weasel to look only at this usage case in isolation, but
> it
> seems that this might be what happened.
>
> O stopping the IOC
> O rebuilding application
> O maybe other stuff that takes a long time
> O starting the IOC
>
> But perhaps this would work just as well (with very fast reconnect
> times)
>
> O rebuild the application
> O reboot the IOC
>
> Nevertheless, if by chance one ends up with the first scenario,
> restarting
> the MEDM screen isn't, from my perspective, exactly heavy lifting -
> considering that we are presumably waiting at the edge of the seat for
> the
> channel's to reconnect. Synchronization might be the primary
> aggravation,
> but perhaps it will even be successful the first try if one waits
> approximately long enough for the IOC to finish rebooting before
> restarting
> the medm screen.
>
> > It seems like the client library should know what IOC previously
> hosted
> > the PVs that are now disconnected.  If that same IOC now comes back
> up,
> > it seems like a reasonable assumption that the same IOC is going to be
> > hosting those channels. So doing a high-rate search for those channels
> > has a very high liklihood of success?
> >
>
> The code doesn't currently do that. It seems that this might be a
> reasonable
> idea to add to the list for R3.15. The primary negative would be storage
> overhead for keeping the network address of the IOC last connected to
> with
> every channel. It's preferable to keep the network address in circuit
> specific snap-in since network addresses are circuit specific. Perhaps
> we
> could create a special type of pseudo circuit (satisfying the
> polymorphic
> circuit interface) for keeping track of last-connected-to state, and
> store a
> pointer to such an instance with the channel when it disconnects.
> Another
> negative of course is complexity, and maybe also CPU overhead for
> running
> the list of channels and checking each one of them for a match with the
> new
> IOC's address.
>
> > Is there an environment variable I can set to change the search rate
> or
> > algorithm?  I agree that in general we want to avoid anything that can
> > generate network storms.
>
> You can set EPICS_CA_MAX_SEARCH_PERIOD, but the code imposes, currently,
> a
> lower limit of 60 seconds. So this isn't probably what you want, but it
> is
> maybe what your EPICS system manager wants. Otherwise, sorry, but there
> are
> currently no other user accessible parameters. We try to minimize the
> complexity, and pitfalls, in the configuration interface.
>
> I am busy with other projects at the moment, but perhaps you might
> consider
> creating a mantis entry (we want to minimize changes to R3.14 so this
> would
> be against R3.15) describing the problem (and including a copy of this
> mail
> exchange), and perhaps some improvements can be made (within the
> constraints
> of reliable operation of course).
>
> Jeff
>
> > -----Original Message-----
> > From: Mark Rivers [mailto:[email protected]]
> > Sent: Friday, January 23, 2009 12:52 PM
> > To: Jeff Hill; Andrew Johnson
> > Cc: epics
> > Subject: RE: Very slow reconnection to medm after IOC reboot
> >
> > Jeff,
> >
> > As I said in my first message yesterday there were 4 medm screens with
> a
> > total of about 200 channels.
> >
> > > If, after 15 minutes, the IOC comes up and the CA client library
> > manages to
> > > find some of its disconnected channels it has no way of knowing that
> > the
> > > balance of its remaining disconnected channels are on this new IOC.
> >
> > It seems like the client library should know what IOC previously
> hosted
> > the PVs that are now disconnected.  If that same IOC now comes back
> up,
> > it seems like a reasonable assumption that the same IOC is going to be
> > hosting those channels. So doing a high-rate search for those channels
> > has a very high liklihood of success?
> >
> > Is there an environment variable I can set to change the search rate
> or
> > algorithm?  I agree that in general we want to avoid anything that can
> > generate network storms.  But in this case I am doing development on a
> > single PC, running a local client connected to channels on that same
> PC,
> > but I have to wait 4 minutes for the channels to reconnect just
> because
> > I rebuilt the application.  This is not acceptable.  I don't want to
> > have to close and reopen all the windows in medm just so I can see
> what
> > is going on again.
> >
> >
> > Mark
> >
> > -----Original Message-----
> > From: Jeff Hill [mailto:[email protected]]
> > Sent: Friday, January 23, 2009 12:27 PM
> > To: Mark Rivers; 'Andrew Johnson'
> > Cc: 'epics'
> > Subject: RE: Very slow reconnection to medm after IOC reboot
> >
> >
> > Mark,
> >
> > How many channels were involved in this disconnect / reconnect
> scenario?
> >
> > > Stop IOC, wait 10 seconds, restart IOC.
> > > Time for medm to reconnect all channels: less than 10 seconds.
> >
> > > Stop IOC, wait 15 minutes, restart IOC.
> > > Time for medm to reconnect all channels: 4 minutes
> >
> > When the CA library disconnects from the IOC it starts looking for the
> > consequently disconnected channels with a fast search interval, but if
> > the
> > library isn't immediately successful it exponentially backs off to a
> > much
> > lower rate. We want that behavior in order to reduce the load on the
> > network.
> >
> > If, after 15 minutes, the IOC comes up and the CA client library
> manages
> > to
> > find some of its disconnected channels it has no way of knowing that
> the
> > balance of its remaining disconnected channels are on this new IOC.
> > Consider
> > a worst case scenario; what if this particular CA client had a very
> > large
> > number of disconnected channels. From my perspective, it wouldn't be a
> > good
> > idea to search at the highest rate for these channels just because a
> new
> > IOC
> > was seen. The client library does however adopt a compromise design
> > where it
> > searches at an intermediate rate if it sees a beacon anomaly, and
> > depending
> > on how many channels are disconnected this intermediate rate might
> > result in
> > a 4 minute reconnect time. Furthermore, I think I recall correctly
> that
> > the
> > event of connecting to a new IOC will also cause this same
> intermediate
> > search rate boost that is granted when a beacon anomaly event is
> > detected.
> >
> > Part of the rationale behind the current design is that if it was
> > necessary
> > to wait a long time for the IOC to come up then waiting as long as the
> > number of channels divided by the number of channels per search
> attempt
> > times this intermediate delay (maybe as much as 4 minutes in this
> > situation)
> > for the channels to reconnect isn't going to shatter our world. It is
> > particularly important to avoid designs which cause unstable traffic
> > insertion feedback when there is network congestion.
> >
> > Jeff
> >
> >
> > > -----Original Message-----
> > > From: Mark Rivers [mailto:[email protected]]
> > > Sent: Thursday, January 22, 2009 3:28 PM
> > > To: Andrew Johnson
> > > Cc: Jeff Hill; epics
> > > Subject: RE: Very slow reconnection to medm after IOC reboot
> > >
> > > Hi Andrew,
> > >
> > > Thanks for the response.
> > >
> > > Here is the output of all env settings that start with the string
> > EPICS,
> > > before I manually set EPICS_CAS_BEACON_ADDR_LIST:
> > >
> > > $ printenv | grep EPICS
> > > EPICS_CA_AUTO_ADDR_LIST=NO
> > > EPICS_HOST_ARCH=win32-x86
> > > EPICS_CA_ADDR_LIST=localhost 164.54.160.255
> > > EPICS_DISPLAY_PATH=C:\EPICS\adls\
> > > EPICS_CA_MAX_ARRAY_BYTES=10000000
> > >
> > > If I run with just those env settings I get the following message
> when
> > I
> > > boot the IOC:
> > >
> > > The CA server's beacon address list was empty after initialization?
> > >
> > > If I add the env you suggested so they now look like this:
> > >
> > > $ printenv | grep EPICS
> > > EPICS_CA_AUTO_ADDR_LIST=NO
> > > EPICS_CAS_BEACON_ADDR_LIST=164.54.160.255
> > > EPICS_HOST_ARCH=win32-x86
> > > EPICS_CA_ADDR_LIST=localhost 164.54.160.255
> > > EPICS_DISPLAY_PATH=C:\EPICS\adls\
> > > EPICS_CA_MAX_ARRAY_BYTES=10000000
> > >
> > > Then when I start the IOC I don't get the message about the empty
> > beacon
> > > address list.
> > >
> > > This is a new problem with 3.14.10.  I run identical IOCs built with
> > > 3.14.8.2 and I don't get that error message with exactly the same
> > > environment settings.
> > >
> > > However, even though the error message is gone the performance is
> only
> > > marginally improved, if at all.
> > >
> > > Here is the result with EPICS_CAS_BEACON_ADDR_LIST set:
> > >
> > > ******************************
> > > Windows IOC (3.14.10), Windows medm client (built with 3.14.9) on
> same
> > > PC.
> > >
> > > Stop IOC, wait 10 seconds, restart IOC.
> > > Time for medm to reconnect all channels: less than 10 seconds.
> > >
> > > Stop IOC, wait 15 minutes, restart IOC.
> > > Time for medm to reconnect first channel: 10 seconds
> > > Time for medm to reconnect all channels: 4 minutes
> > >
> > >
> > > The differences from my previous test (when I had the CA server
> beacon
> > > address error):
> > >
> > > The time for the very first channel to reconnect was reduced from 60
> > > seconds to 10 seconds.  I am not sure how reproducible this is.
> > >
> > > The time for all channels to connect was reduced from 4 minutes and
> 25
> > > seconds to 4 minutes.  Again, this may not be a statistically
> > > significant improvement.
> > >
> > > Thus, even with EPICS_CAS_BEACON_ADDR_LIST set the performance is
> > really
> > > bad.  And I am not the only one seeing it, Lewis Muir reported
> similar
> > > long reconnection times earlier this afternoon.
> > >
> > > Mark
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Andrew Johnson [mailto:[email protected]]
> > > Sent: Thursday, January 22, 2009 3:47 PM
> > > To: Mark Rivers
> > > Cc: Jeff Hill; epics
> > > Subject: Re: Very slow reconnection to medm after IOC reboot
> > >
> > > On Thursday 22 January 2009 15:32:47 Mark Rivers wrote:
> > > > I noticed that when I start win32-x86 IOCs built with 3.14.10 I
> get
> > > the
> > > > following message just after iocInit:
> > > >
> > > > *******
> > > > The CA server's beacon address list was empty after
> initialization?
> > > > *******
> > > >
> > > > I do not get this message when running the identical IOC built
> with
> > > > 3.14.8.2.
> > > >
> > > > Is this significant?  Why am I getting this message?
> > >
> > > Highly, it means that the IOC is not sending out any CA beacons at
> > all,
> > > which
> > > explains why the clients are not reconnecting to it immediately.
> Try
> > > setting
> > > EPICS_CAS_BEACON_ADDR_LIST on the IOC to your 164.54.160.255
> broadcast
> > > address, although it should default to the value of
> EPICS_CA_ADDR_LIST
> > > if not
> > > set so I don't understand why it's ignoring that (unless you have
> > other
> > > env
> > > settings for the IOC that you haven't told us about).
> > >
> > > - Andrew
> > > --
> > > The best FOSS code is written to be read by other humans -- Harold
> > Welte
>







Replies:
Re: Very slow reconnection to medm after IOC reboot Eric Norum
References:
RE: Very slow reconnection to medm after IOC reboot Mark Rivers
RE: Very slow reconnection to medm after IOC reboot Jeff Hill

Navigate by Date:
Prev: RE: Very slow reconnection to medm after IOC reboot Jeff Hill
Next: Re: Very slow reconnection to medm after IOC reboot Eric Norum
Index: 2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: Very slow reconnection to medm after IOC reboot Jeff Hill
Next: Re: Very slow reconnection to medm after IOC reboot Eric Norum
Index: 2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 02 Feb 2012 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·