EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: Delays in receipt of CA monitors
From: [email protected] (Jeff Hill)
To: "Garrett D. Rinehart" <[email protected]>
Cc: "EPICS-tech-talk" <[email protected]>
Date: Wed, 8 Dec 1999 13:43:11 -0700
Garrett,

We have been looking at Ethernet errors such as collisions with
"ifShow" on vxWorks, and with "netstat -i" on UNIX. You can also
look for IP level errors with "ipstatShow" on vxWorks, and
"netstat -s" on UNIX. Are you using a switched Ethernet?

Jeff

> We suffer from this very same affliction. We drive a panel full
> of digital
> readouts (the old LED kind) with data from a remote IOC at 30Hz.
> The readouts
> freeze occasionally for as little time as a long blink to almost
> two minutes.
> MEDM screens with the same data also freeze, though not
> necessarily at the same
> times.
>
> As in Peregrine's case, it's mostly one IOC, though we see it
> occasionally on
> others. I can find no correlation to CPU load, file descriptor
> usage, network
> load, or memory usage. It does not seem to be related to "time
> since reboot" as
> it fluctuates constantly. It is sometimes seconds between stops
> and sometimes
> days... well, at least so infrequent no one notices 'em.
>
> I have not checked timestamps, but I can see the 30Hz pulse
> counts and toroid
> summations on the IOC continuing normally. An MEDM screen
> attached to those
> channels will also temporarily freeze, but all the numbers jump
> to where they
> ought to be when the pause is over.
>
> Another interesting observation is that one number on the DRO
> panel always
> updates, even when the rest of 'em freeze. It's the beam current
> number which is
> calculate in the local ioc (the one that drives the panel) off a
> toroid value
> from the "stalled" ioc. The copy of that same toroid value that
> is displayed
> directly on the DROs is stalled, but the one linked between the
> databases is
> okay.
>
> We've had this through at least a couple of versions of EPICS (currently
> 3.13.0b11) and two CPUs (mv167 and now mv172). I don't think the
> cpu is the
> problem though, I think it's in the network (or the software that
> drives it).
> The problem has been especially bad in the last several days and
> I finally have
> access to a sniffer and the guy to run it, so I'm hoping to learn
> more in the
> next day or two. If anyone has an idea what to look for, I'm all ears.
>
> Garrett Rinehart
> Intense Pulsed Neutron Source
> Argonne National Laboratory
> 9700 S. Cass Ave
> Argonne, IL  60439
> (630)252-6561
>
> > X-Accept-Language: en
> > MIME-Version: 1.0
> > To: [email protected]
> > Subject: Delays in receipt of CA monitors
> > Content-Transfer-Encoding: 7bit
> >
> > Here at the Low Energy Demonstration Accelerator we are seeing
> > intermittent significant delays in the receipt of CA monitors. As these
> > delays are large enough to trigger a hardware protection system that
> > places the accelerator in a safe mode (e.g. turn the beam off) we have
> > spent considerable time over the past months to understand this.
> >
> > We don't yet understand the root cause and are asking if anyone has seen
> > a similar effect. We realize that TCP is non-realtime but are concerned
> > since these delays can as large as 3 to 10 seconds.
> >
> > Here's the context:
> >
> > In order to provide prompt detection of a failure in the control system
> > IOCs or network at timescales shorter than the default CA timeout (30
> > seconds), each IOC provides a heartbeat. This heartbeat is implemented
> > as a calculation record scanned at .5 second that toggles between 0 and
> > 1. On our run permit IOC we have a genSub record scanned at 1 second
> > that has an input CP link to that heartbeat. Thus the genSub record will
> > process both at 1 second intervals and when a monitor comes in from the
> > heartbeat. If a monitor is not received with a specific interval (1.5
> > seconds) we assume that there is a problem and shut down the
> > accelerator.
> >
> > Here's what we see:
> >
> > For a small subset of IOCs, and mostly from one, we see several times a
> > day, delays of several seconds on the receipt of a monitor. The effect
> > is one of the monitors being buffered, but not lost. We can also observe
> > this behavior by running camonitor on an IOC heartbeat channel from a
> > workstation.
> >
> > The timestamps generated by record processing of the heartbeats are
> > always seen (by dbCaGetTimeStamp) at .5 second intervals.
> >
> > Here's what we don't see:
> >
> > We don't see dependence on IOC architecture.
> > We don't see correlation with long-term resource (CPU, memory, FDs)
> > usage on either IOC.
> > We don't see correlation with network traffic.
> > We don't see correlation on short-term CPU usage (down to 1 second
> > sampling) on either IOC.
> > We don't see delays in the heartbeat timestamps generated at the target
> > IOC.
> >
> > Here's what we suspect:
> >
> > 1) Possible short term CPU loading on the target IOC by a task with a
> > priority less than the .5 second scan task but greater than CA or TCP.
> > 2) Possible buffering within CA or the TCP/IP stack.
> >
> > Aloha,
> > 	Peregrine
> > --
> > Peregrine M. McGehee	Coordinator, Los Alamos Astrophysics
> > (505) 667-3273 		MS H820, LANL, Los Alamos, NM 87545
>
>



References:
Re: Delays in receipt of CA monitors Garrett D. Rinehart

Navigate by Date:
Prev: IOC woes Dennis M Reichhold
Next: I/O Intr and Asynchronouse device Lifang Zheng
Index: 1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: Delays in receipt of CA monitors Garrett D. Rinehart
Next: RE: Delays in receipt of CA monitors Garrett D. Rinehart
Index: 1994  1995  1996  1997  1998  <19992000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 10 Aug 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·