1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 <2006> 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 <2006> 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 |
<== Date ==> | <== Thread ==> |
---|
Subject: | RE: R3.13.10 ca_event problem [sls] |
From: | "Jeff Hill" <[email protected]> |
To: | "'Al Honey'" <[email protected]>, <[email protected]> |
Date: | Wed, 25 Jan 2006 12:46:15 -0700 |
> tt 0x1e3a648 > trcStack aborted: error in top frame
This may indicate that some errant code has completely obliterated the stack of this thread and, who knows, the ring buffer data structure may also have been obliterated.
More recent versions of vxWorks have ICE support and also crash dump analysis support (its about time). Those capabilities might really help in a situation like this.
We are purchasing vxWorks, but unfortunately could not afford the ICE upgrade. I think we understand correctly that WRS now has a low cost ICE based product lacking TCP/IP attached debugging. If you need TCP/IP attached debugging then you pay more, $15k per seat, and must go begging for additional $$$ for the ICE upgrade option. Also, I think I understand correctly that only two different types of ICE units, available sole source from WRS of course, are supported.
:-(
Retooling to use the new memory management features (EX: code page protection, stack overrun protection) in vxWorks 6 just might help to fault isolate situations like this as an errant thread might be suspended before it could do any damage.
I am still guessing that there is an errant snap-in (possibly device support) but thats still a wild guess.
One way to track this down might be to: A) Discover first how to reproduce the problem quickly (thats never easy). You might try artificially increasing the external interrupt rate, the scan rates of the records, or the CA client induced load. B) Next, suspend threads or set break points until the problem goes away. C) Or, alternatively, eliminate devices until the problem goes away.
Should no way be found to reproduce the problem quicky you might try spliting the IOC in half with 1/2 of the devices on one IOC and 1/2 on another.
> Interestingly, one of the Adaptive Optics IOCs had the identical problem
What do these IOCs have in common. Obviously EPICS of course, but if there is a limited set of device support (or some other source code) shared only by the two IOCs that information might be useful input to the debugging process outlined above.
Jeff
|