Hey
Jeff
Interestingly, one of
the Adaptive Optics IOCs had the identical problem (a couple of months back)
with R3.13.0beta12 and generated the identical invalid address. To correct the
problem Erik increased the ring buffer to 3000. However, the problem occurred
again on that IOC, yesterday, and with the same 0x30303030 address. Erik
stated he tried ‘tt’ but it failed.
tt
0x1e3a648
> trcStack aborted:
error in top frame
This is identical to
the ‘tt’ attempt on the R3.13.10 empowered IOC last Friday (per K.
Tsubota).
Seems to me that the
overflow affects taskWd in such a way that the memory corruption occurs at a
low level and not from within the application code (two different memory maps
on two different IOCs with two wildly different EPICS versions generating the
same invalid instruction address). If taskWd simply attempts to suspend the
offending task then does that mean the corruption occurs when the ring buffer
overflow message is generated (perhaps something as simple as a malformed log
message)?
I believe that the
highest rate of interrupts, on the IOCs affected, are from the encoder counter
boards (40 hz) and that has been solid for more than a decade, and unchanged
since 1998. Of course one of the bar code readers could cause a burst of
interrupts greater than that but doubtful during the aforementioned events.
Next time the problem
occurs I will try to remember to check the processor loading, however, during
the events when the ring buffer overflows occurred but a task was not
suspended, the loading did not seem excessive.
I have not reviewed
the callback request code and will not have time to do so today but I will
when I get a chance.
AH
-----Original
Message-----
From: Jeff Hill
[mailto:[email protected]]
Sent: Tuesday, January 24, 2006 10:10
AM
To: Al Honey;
[email protected]
Subject: RE: R3.13.10 ca_event problem
[sls]
Allan,
The "callbackRequest ring buffer
full" message comes from the database function
callbackRequest(). This typically
indicates that a device's interrupt production rate
exceeds the the record
processing rate (due to CPU saturation). This is related to
a device/record that are
configured "scanned on interrupt".
> task
36a5718 CA_event suspended
My best guess
at this time is that there is object code
corruption
resulting from a wild pointer, but that is only
a wild guess,
based on very limited information.
To debug this, we of course need
the stack trace from that CA event thread after
this instruction access exception
occurs. Type "tt ( event thread's task id )".
> appeared to lose it’s
connections to all it’s clients
This would happen if there is CPU
saturation. Also watch out for MBUF starvation
occurring with newer versions of
vxWorks (which dont dynamically expand the
MBUF pool should it becomes
depleted).
Jeff
-----Original
Message-----
From: Al Honey
[mailto:[email protected]]
Sent: Tuesday, January 24, 2006 12:03
PM
To:
[email protected]
Subject: R3.13.10 ca_event
problem
Aloha
Finally, 8 years after deploying
R3.13.0beta12, we at the Keck observatory are attempting to deploy
R3.13.10.
During about 80 hours of testing
we have had 3 events of the form “ring buffer overflow” within a CA_event
task. On two of those occasions the Vxworks 5.3.1 system (pPC), on which the
error occurred appeared to lose it’s connections to all it’s clients,
even other EPICS empowered IOC systems. The events lasted for 7-12 minutes,
after which most connections were re-established (at least one channel to
another IOC did not reconnect!). On the third such event a CA_event task was
suspended and the following was in the iocLog:
tdc2:1040 Fri
Jan 20 13:42:39 2006 filename="../taskwd.c" line
number=159
tdc2:1040 Fri
Jan 20 13:42:39 2006 task 36a5718 CA_event suspended
tdc2:1040 Fri
Jan 20 13:43:28 2006 interrupt: ./taskwd.c" line
number=159
tdc2:1040 Fri
Jan 20 13:43:28 2006 callbackRequest ring buffer full
It so happened that I was
monitoring the IOC via a telnet connection, thereby logging the following,
which was not in the iocLog:
instruction
access
Exception
next instruction address: 0x30303030
Machine
Status Register: 0x4000b030
Condition
Register: 0x42000040
Task:
0x36a5718 "CA_event"^G
filename="../taskwd.c"
line number=159
task
36a5718 CA_event suspended
interrupt:
callbackRequest ring buffer full
Seeing as the instruction
address 0x30303030 is outside of the address space… life does not seem
too good.
I was not physically present
during this event so I did not get a chance to perform a task trace before
the IOC was rebooted.
I really hate to think that
there is an errant pointer somewhere.
Any thoughts would be
appreciated.
Thanks,
Al Honey
Kevin Tsubota