Dear Gurus,
since Friday we're observing an unpleasant behaviour on one of our
IOCs:
Occasionally (approx. every 20 to 30 minutes at normal load) the
scanOnce task gets suspended due to an Access Fault. After a couple of
seconds, the dbCaLink task follows after writing "rngBufPut overflow in
scanOnce" to the log. After that there is no more ca based record
processing, i.e. no more ca links and operator access.
The console output looks like this:
Access Fault
Program Counter: 0x006ec252
Status Register: 0x3004
Access Address : 0xb4007745
Special Status : 0x0505
Task: 0x4550e0 "scanOnce"
task: 0X526b70 taskwd
task 4550e0 scanOnce suspended
task: 0X506370 EV dbCaLink
S_errno_ENOENT rngBufPut overflow in scanOnce
We tried to check what's happening ...
-> lkAddr 0x006ec252
0x006ec246 _dbScanLock text
0x006ec308 _dbScanUnlock text
0x006ec33c _dbLockGetLockId text
-> l dbScanLock
_dbScanLock:
6ec246 4e56 0000 LINK .W A6,#0
6ec24a 2f0b MOVE .L A3,-(A7)
6ec24c 2f0a MOVE .L A2,-(A7)
6ec24e 266e 0008 MOVEA .L (0x8,A6),A3
6ec252 246b 00dc MOVEA .L (0xdc,A3),A2
6ec256 4a8a TST .L A2
6ec258 661a BNE 0x006ec274
and after comparing to dbLock.c:
void dbScanLock(dbCommon *precord)
{
lockRecord *plockRecord;
lockSet *plockSet;
STATUS status;
if(!(plockRecord= (lockRecord *)precord->lset)) {
epicsPrintf("dbScanLock plockRecord is NULL record %s\n",
precord->name);
exit(1);
}
we assume that the first executable line of dbScanLock (the
dereferencing of precord) crashes, i.e. dbScanLock is called with a
non-NULL junk precord (0xb4007669 in the above example - where 0xdc ist the
lset offset).
Unfortunately the scanOnce task is not restarted by the taskwd (as
opposed to the periodic scan tasks), so rebooting the IOC is the only way
out.
So - who the hell is calling scanOnce() with a junk record pointer
(this is the way we think dbScanLock gets called)?
We inserted a patch in scanOnce to check the precord argument for
validity and start a tt() to see which task is the bad guy here. Another
advantage of that patch is that we won't process the record and therefore
extend the IOC's uptime.
Some more data:
- IOC: MVME 162-042 (25MHz 8MB), VxWorks 5.2, EPICS core R3.13.0.beta4
- CheckStack looks absolutely normal.
- casr says (at the end):
There are currently 52420 bytes on the server's free list
1 client(s), 134 channel(s), and 133 event(s) (monitors)
The server's resource id conversion table:
Bucket entries in use = 474 bytes in use = 23988
Free list bytes in use = 2144
Bucket entries/hash id - mean = 0.115723 std dev = 0.319892 max = 1
- dbcar states:
ncalinks 1876 not connected 0 no_read_access 0 no_write_access 0
nDisconnect 0 nNoWrite 0
- dbnr lists:
0088 ai
0066 ao
0795 bi
0234 bo
0394 calc
0002 fanout
0114 longout
0080 mbbiDirect
0038 mbboDirect
0038 stringout
0002 sub
Total Records: 1851
- Memory is uncritical:
status bytes blocks avg block max block
------ --------- -------- ---------- ----------
current
free 2006276 13 154328 1957672
Anyone any hint anywhere?
Clueless,
Ralph
--
__ Ralph Lange Email: [email protected]
/\ \ WWW: http://www.bessy.de/~lange
/ \ \ BESSY II
/ /\ \ \ Berliner Elektronenspeicherring- Snail: BESSY II
/ / /\ \ \ Gesellschaft fuer Synchrotron- Rudower Chaussee 5
/ / /__\_\ \ strahlung m.b.H. D-12489 Berlin, Germany
/ / /________\ Phone: +49 30 6392-4862
\/___________/ Control System Group Fax: ... -4859
- Replies:
- Re: IOC hangs (scanOnce crashes) Marty Kraimer
- Re: IOC hangs (scanOnce crashes) Ralph Lange
- Navigate by Date:
- Prev:
Re: Converting ascii to dbd Marty Kraimer
- Next:
Re: IOC hangs (scanOnce crashes) Marty Kraimer
- Index:
1994
1995
1996
<1997>
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: Converting ascii to dbd Tim Mooney
- Next:
Re: IOC hangs (scanOnce crashes) Marty Kraimer
- Index:
1994
1995
1996
<1997>
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|