Hey Jeff
Interestingly, one of the Adaptive Optics
IOCs had the identical problem (a couple of months back) with R3.13.0beta12 and
generated the identical invalid address. To correct the problem Erik increased
the ring buffer to 3000. However, the problem occurred again on that IOC,
yesterday, and with the same 0x30303030 address. Erik stated he tried ‘tt’
but it failed.
tt 0x1e3a648
> trcStack aborted: error
in top frame
This is identical to the ‘tt’
attempt on the R3.13.10 empowered IOC last Friday (per K. Tsubota).
Seems to me that the overflow affects
taskWd in such a way that the memory corruption occurs at a low level and not
from within the application code (two different memory maps on two different
IOCs with two wildly different EPICS versions generating the same invalid
instruction address). If taskWd simply attempts to suspend the offending task
then does that mean the corruption occurs when the ring buffer overflow message
is generated (perhaps something as simple as a malformed log message)?
I believe that the highest rate of
interrupts, on the IOCs affected, are from the encoder counter boards (40 hz) and
that has been solid for more than a decade, and unchanged since 1998. Of course
one of the bar code readers could cause a burst of interrupts greater than that
but doubtful during the aforementioned events.
Next time the problem occurs I will try to
remember to check the processor loading, however, during the events when the
ring buffer overflows occurred but a task was not suspended, the loading did
not seem excessive.
I have not reviewed the callback request
code and will not have time to do so today but I will when I get a chance.
AH
-----Original Message-----
From: Jeff Hill
[mailto:[email protected]]
Sent: Tuesday, January 24, 2006
10:10 AM
To: Al Honey;
[email protected]
Subject: RE: R3.13.10 ca_event
problem [sls]
Allan,
The "callbackRequest ring
buffer full" message comes from the database function
callbackRequest(). This typically
indicates that a device's interrupt production rate
exceeds the the record
processing rate (due to CPU saturation). This is related to
a device/record that are
configured "scanned on interrupt".
> task 36a5718 CA_event suspended
My best guess at this time
is that there is object code
corruption resulting from a
wild pointer, but that is only
a wild guess, based on very
limited information.
To debug this, we of course need the
stack trace from that CA event thread after
this instruction access exception
occurs. Type "tt ( event thread's task id )".
> appeared to lose it’s
connections to all it’s clients
This would happen if there is CPU
saturation. Also watch out for MBUF starvation
occurring with newer versions of
vxWorks (which dont dynamically expand the
MBUF pool should it becomes
depleted).
Jeff
-----Original
Message-----
From: Al Honey
[mailto:[email protected]]
Sent: Tuesday, January 24, 2006
12:03 PM
To: [email protected]
Subject: R3.13.10 ca_event problem
Aloha
Finally, 8 years after deploying
R3.13.0beta12, we at the Keck observatory are attempting to deploy R3.13.10.
During about 80 hours of testing we
have had 3 events of the form “ring buffer overflow” within a
CA_event task. On two of those occasions the Vxworks 5.3.1 system (pPC), on
which the error occurred appeared to lose it’s connections to all
it’s clients, even other EPICS empowered IOC systems. The events lasted
for 7-12 minutes, after which most connections were re-established (at least
one channel to another IOC did not reconnect!). On the third such event a
CA_event task was suspended and the following was in the iocLog:
tdc2:1040 Fri Jan
20 13:42:39 2006 filename="../taskwd.c" line number=159
tdc2:1040 Fri Jan
20 13:42:39 2006 task 36a5718 CA_event suspended
tdc2:1040 Fri Jan
20 13:43:28 2006 interrupt: ./taskwd.c" line number=159
tdc2:1040 Fri Jan
20 13:43:28 2006 callbackRequest ring buffer full
It so happened that I was monitoring
the IOC via a telnet connection, thereby logging the following, which was not
in the iocLog:
instruction access
Exception next instruction
address: 0x30303030
Machine Status Register:
0x4000b030
Condition Register:
0x42000040
Task: 0x36a5718
"CA_event"^G
filename="../taskwd.c"
line number=159
task 36a5718 CA_event
suspended
interrupt: callbackRequest
ring buffer full
Seeing as the instruction address
0x30303030 is outside of the address space… life does not seem too
good.
I was not physically present during
this event so I did not get a chance to perform a task trace before the IOC was
rebooted.
I really hate to think that there is
an errant pointer somewhere.
Any thoughts would be appreciated.
Thanks,
Al Honey
Kevin Tsubota