1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 <2006> 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 <2006> 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 |
<== Date ==> | <== Thread ==> |
---|
Subject: | RE: R3.13.10 ca_event problem [sls] |
From: | "Al Honey" <[email protected]> |
To: | "Jeff Hill" <[email protected]>, <[email protected]> |
Date: | Wed, 25 Jan 2006 08:49:59 -1000 |
Hey Jeff
Interestingly, one of the Adaptive Optics IOCs had the identical problem (a couple of months back) with R3.13.0beta12 and generated the identical invalid address. To correct the problem Erik increased the ring buffer to 3000. However, the problem occurred again on that IOC, yesterday, and with the same 0x30303030 address. Erik stated he tried ‘tt’ but it failed.
tt 0x1e3a648 > trcStack aborted: error in top frame
This is identical to the ‘tt’ attempt on the R3.13.10 empowered IOC last Friday (per K. Tsubota).
Seems to me that the overflow affects taskWd in such a way that the memory corruption occurs at a low level and not from within the application code (two different memory maps on two different IOCs with two wildly different EPICS versions generating the same invalid instruction address). If taskWd simply attempts to suspend the offending task then does that mean the corruption occurs when the ring buffer overflow message is generated (perhaps something as simple as a malformed log message)?
I believe that the highest rate of interrupts, on the IOCs affected, are from the encoder counter boards (40 hz) and that has been solid for more than a decade, and unchanged since 1998. Of course one of the bar code readers could cause a burst of interrupts greater than that but doubtful during the aforementioned events.
Next time the problem occurs I will try to remember to check the processor loading, however, during the events when the ring buffer overflows occurred but a task was not suspended, the loading did not seem excessive.
I have not reviewed the callback request code and will not have time to do so today but I will when I get a chance.
AH
-----Original Message-----
Allan,
The "callbackRequest ring buffer full" message comes from the database function callbackRequest(). This typically indicates that a device's interrupt production rate exceeds the the record processing rate (due to CPU saturation). This is related to a device/record that are configured "scanned on interrupt".
> task 36a5718 CA_event suspended
My best guess at this time is that there is object code corruption resulting from a wild pointer, but that is only a wild guess, based on very limited information.
To debug this, we of course need the stack trace from that CA event thread after this instruction access exception occurs. Type "tt ( event thread's task id )".
> appeared to lose it’s connections to all it’s clients
This would happen if there is CPU saturation. Also watch out for MBUF starvation occurring with newer versions of vxWorks (which dont dynamically expand the MBUF pool should it becomes depleted).
Jeff
|