EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Database hanging
From: Andrew Johnson <[email protected]>
To: "Rees, NP (Nick)" <[email protected]>
Cc: EPICS tech-talk <[email protected]>
Date: Fri, 03 Nov 2006 14:53:22 -0600
Hi Nick,

Rees, NP (Nick) wrote:
We are having channel access problems occasionally on some R3.14.8.2
vxWorks IOC's. It seems that one of the database semaphores isn't being
released for some reason and this is screwing everything up. No task is
suspended, and there are no inverted priorities indicating a deadlock
(but this is no guarantee). Has anyone else seen this?

The details follow.

The simplest symptom is caget fails as follows:

[npr78@i06-ws002 ~]$ caget BL06I-AL-SLITS-01:YA
Read operation timed out: some PV data was not read.
BL06I-AL-SLITS-01:YA           0
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "op=0, channel=BL06I-AL-SLITS-01:YA, type=DBR_TIME_DOUBLE,
count=1, ctx="BL06I-MO-IOC-01.diamond.ac.uk:5064""
    Source File: ../getCopy.cpp line 82
    Current Time: Fri Nov 03 2006 16:30:00.426411000
..................................................................

If I try a dbgf on the IOC to try and get the same parameter it hangs
and the stack trace emitted after Ctrl-C is as follows:

BL06I-MO-IOC-01 -> dbgf "BL06I-AL-SLITS-01:YA"
231f7c vxTaskEntry    +68 : shell ()
1f81e0 shell          +190: 1f820c ()
1f840c shell          +3bc: execute ()
1f8590 execute        +d8 : yyparse ()
2122d0 yyparse        +71c: 210668 ()
2107ec yystart        +96c: dbgf ()
1e752e18 dbgf           +15c: dbGetField ()
1e73f3fc dbGetField     +68 : dbScanLock ()
1e73c70c dbScanLock     +1b4: epicsMutexLock ()
1e8089c0 epicsMutexLock +24 : semTake ()
2283ac semTake        +13c: semMTake ()
tShell restarted.

So the shell is hanging because it can't get a lock in dbScanLock.
Address 0x1e73c70c is somewhere in the middle of dbScanLock - the next
routine is dbScanUnlock and the addresses of each routine is:

BL06I-MO-IOC-01 -> lkup "dbScanLock"
dbScanLock                0x1e73c558 text     (BL06I-MO-IOC-01.munch)
BL06I-MO-IOC-01 -> lkup "dbScanUnlock"
dbScanUnlock              0x1e73c8b8 text     (BL06I-MO-IOC-01.munch)

In the middle of dbScanLock there are various statements of the form:
   epicsMutexMustLock(plockSet->lock);
   epicsMutexMustLock(lockSetModifyLock);

... and so the problem is presumably in one of these semaphores.

Has anyone seem something similar? Does anyone have suggestions of what
I should do next time for diagnostics?

Unlike dbgf, dbpr doesn't try to lock the lockset, so I would definitely recommend that you use that instead next time. There are also a couple of tools that you should run: dblsr which can optionally take the name of a record and an interest level, and dbLockShowLocked which takes an interest level.


What is the record type and device type of the BL06I-AL-SLITS-01:YA record? From your symptoms it does appear that the record's lockset is locked, but this could easily be due to a device support issue of some kind, which you would have to investigate. Device support is responsible for locking a record before reprocessing it as asynchronous I/O completion time, and it's possible that it might not unlocked it again afterwards in some circumstances. Check the state of the tasks associated with all devices that are connected with the record.

This record might also not have been the one that actually locked the lockset either, so you'll have to bear that in mind during your investigations and look at any other device supports related to that lockset.

HTH,

- Andrew
--
There is considerable overlap between the intelligence of the smartest
bears and the dumbest tourists -- Yosemite National Park Ranger

References:
Database hanging Rees, NP (Nick)

Navigate by Date:
Prev: Re: HP8116A signal generator Till Straumann
Next: RE: Multihomed IOC Jeff Hill
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Database hanging Rees, NP (Nick)
Next: edm error again, help~~~~ marco_hair
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·