EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: R3.13.10 ca_event problem [sls]
From: "Jeff Hill" <[email protected]>
To: "'Al Honey'" <[email protected]>, <[email protected]>
Date: Wed, 25 Jan 2006 12:46:15 -0700
Title: Message

 

> tt 0x1e3a648

> trcStack aborted: error in top frame

 

This may indicate that some errant code has completely obliterated the stack of this thread and, who knows, the ring buffer data structure may also have been obliterated.

 

More recent versions of vxWorks have ICE support and also crash dump analysis support (its about time). Those capabilities might really help in a situation like this.

 

We are purchasing vxWorks, but unfortunately could not afford the ICE upgrade. I think we understand correctly that WRS now has a low cost ICE based product lacking TCP/IP attached debugging. If you need TCP/IP attached debugging then you pay more, $15k per seat, and must go begging for additional $$$ for the ICE upgrade option. Also, I think I understand correctly that only two different types of ICE units, available sole source from WRS of course, are supported.

 

:-(

 

Retooling to use the new memory management features (EX: code page protection, stack overrun protection) in vxWorks 6 just might help to fault isolate situations like this as an errant thread might be suspended before it could do any damage.

 

I am still guessing that there is an errant snap-in (possibly device support) but thats still a wild guess.

 

One way to track this down might be to:

A) Discover first how to reproduce the problem quickly (thats never easy). You might try artificially increasing the external interrupt rate, the scan rates of the records, or the CA client induced load.

B) Next, suspend threads or set break points until the problem goes away.

C) Or, alternatively, eliminate devices until the problem goes away.

 

Should no way be found to reproduce the problem quicky you might try spliting the IOC in half with 1/2 of the devices on one IOC and 1/2 on another.

 

 

> Interestingly, one of the Adaptive Optics IOCs had the identical problem

 

What do these IOCs have in common. Obviously EPICS of course, but if there is a limited set of device support (or some other source code) shared only by the two IOCs that information might be useful input to the debugging process outlined above.

 

Jeff

-----Original Message-----
From: Al Honey [mailto:[email protected]]
Sent: Wednesday, January 25, 2006 11:50 AM
To: Jeff Hill; [email protected]
Subject: RE: R3.13.10 ca_event problem [sls]

Hey Jeff

 

Interestingly, one of the Adaptive Optics IOCs had the identical problem (a couple of months back) with R3.13.0beta12 and generated the identical invalid address. To correct the problem Erik increased the ring buffer to 3000. However, the problem occurred again on that IOC, yesterday, and with the same 0x30303030 address. Erik stated he tried ‘tt’ but it failed.

 

tt 0x1e3a648

> trcStack aborted: error in top frame

 

This is identical to the ‘tt’ attempt on the R3.13.10 empowered IOC last Friday (per K. Tsubota).

 

Seems to me that the overflow affects taskWd in such a way that the memory corruption occurs at a low level and not from within the application code (two different memory maps on two different IOCs with two wildly different EPICS versions generating the same invalid instruction address). If taskWd simply attempts to suspend the offending task then does that mean the corruption occurs when the ring buffer overflow message is generated (perhaps something as simple as a malformed log message)?

 

I believe that the highest rate of interrupts, on the IOCs affected, are from the encoder counter boards (40 hz) and that has been solid for more than a decade, and unchanged since 1998. Of course one of the bar code readers could cause a burst of interrupts greater than that but doubtful during the aforementioned events.

 

Next time the problem occurs I will try to remember to check the processor loading, however, during the events when the ring buffer overflows occurred but a task was not suspended, the loading did not seem excessive.

 

I have not reviewed the callback request code and will not have time to do so today but I will when I get a chance.

 

AH

 

-----Original Message-----
From: Jeff Hill [mailto:[email protected]]
Sent: Tuesday, January 24, 2006 10:10 AM
To: Al Honey; [email protected]
Subject: RE: R3.13.10 ca_event problem [sls]

 

 

Allan,

 

The "callbackRequest ring buffer full" message comes from the database function

callbackRequest(). This typically indicates that a device's interrupt production rate

exceeds the the record processing rate (due to CPU saturation). This is related to

a device/record that are configured "scanned on interrupt".

 

> task 36a5718 CA_event suspended

 

My best guess at this time is that there is object code

corruption resulting from a wild pointer, but that is only

a wild guess, based on very limited information.

 

To debug this, we of course need the stack trace from that CA event thread after

this instruction access exception occurs. Type "tt ( event thread's task id )".

 

> appeared to lose it’s connections to all it’s clients

 

This would happen if there is CPU saturation. Also watch out for MBUF starvation

occurring with newer versions of vxWorks (which dont dynamically expand the

MBUF pool should it becomes depleted).

 

Jeff

 

-----Original Message-----
From: Al Honey [mailto:[email protected]]
Sent: Tuesday, January 24, 2006 12:03 PM
To: [email protected]
Subject: R3.13.10 ca_event problem

Aloha

 

Finally, 8 years after deploying R3.13.0beta12, we at the Keck observatory are attempting to deploy R3.13.10.

 

During about 80 hours of testing we have had 3 events of the form “ring buffer overflow” within a CA_event task. On two of those occasions the Vxworks 5.3.1 system (pPC), on which the error occurred  appeared to lose it’s connections to all it’s clients, even other EPICS empowered IOC systems. The events lasted for 7-12 minutes, after which most connections were re-established (at least one channel to another IOC did not reconnect!). On the third such event a CA_event task was suspended and the following was in the iocLog:

 

tdc2:1040 Fri Jan 20 13:42:39 2006 filename="../taskwd.c" line number=159

tdc2:1040 Fri Jan 20 13:42:39 2006 task 36a5718 CA_event suspended

tdc2:1040 Fri Jan 20 13:43:28 2006 interrupt: ./taskwd.c" line number=159

tdc2:1040 Fri Jan 20 13:43:28 2006 callbackRequest ring buffer full

 

It so happened that I was monitoring the IOC via a telnet connection, thereby logging the following, which was not in the iocLog:

 

instruction access

Exception next instruction address: 0x30303030

Machine Status Register: 0x4000b030

Condition Register: 0x42000040

Task: 0x36a5718 "CA_event"^G

filename="../taskwd.c" line number=159

task 36a5718 CA_event suspended

interrupt: callbackRequest ring buffer full

 

Seeing as the instruction address 0x30303030 is outside of the address space…  life does not seem too good.

I was not physically present during this event so I did not get a chance to perform a task trace before the IOC was rebooted.

 

I really hate to think that there is an errant pointer somewhere.

 

Any thoughts would be appreciated.

 

Thanks,

            Al Honey

            Kevin Tsubota

 

 

 


References:
RE: R3.13.10 ca_event problem [sls] Al Honey

Navigate by Date:
Prev: RE: R3.13.10 ca_event problem [sls] Al Honey
Next: Re: R3.13.10 ca_event problem [sls] Andrew Johnson
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: R3.13.10 ca_event problem [sls] Al Honey
Next: Re: R3.13.10 ca_event problem [sls] Andrew Johnson
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·