Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
<== Date ==> <== Thread ==>

Subject: RE: No more CA connections on different IOC
From: "Jeff Hill" <johill@lanl.gov>
To: "'Kay-Uwe Kasemir'" <kasemirk@ornl.gov>
Cc: "'Ernest Williams'" <ernesto@ornl.gov>, "EPICS-tech-talk" <tech-talk@aps.anl.gov>
Date: Tue, 25 Apr 2006 11:59:35 -0600
> Task trace for that IOC server thread:
> 
> scl-llrf-ioc21d> tt 0xe32580
> 1fa774 vxTaskEntry +68 : 1b60b48 ()
> 1b60bb8 epicsThreadPrivateGet+f8 : camsgtask ()
> 1a777f4 camsgtask +36c: camessage ()
> 1a7c750 camessage +320: 1a79df0 ()
> 1a79df0 casUserNameInitiatingCurrentThread+2404: asAddClient ()
> 1a88cf8 asAddClient +110: epicsMutexLock ()
> 1b57888 epicsMutexLock +24 : semTake ()
> 1f0c74 semTake +13c: semMTake ()
> 
> Does that tell you something?
> 

If remains stuck in the above spot indefinitely then I can think of some
causes.

1) We might have a deadlock. A deadlock looks usually like this:

a) There is a thread that has the access security lock and is calling
subsystem PDQ that takes its lock.

b) There is a thread that has the PDQ subsystem lock and it is calling the
access security system and is trying to take the access security lock.

Presuming a dead lock then you have in a glass jar one half of what is
needed. If you can find the other thread participating in the dead lock and
send its stack trace we can identify the vulnerability and get it fixed.
This isn't always easy.

HOWEVER, I look in the CA server source code and it does __not__ appear to
hold any CA server lock at all when it calls asAddClient so it appears (at
least initially) that a deadlock isn't occurring between the CA server and
access security (some other subsystems might still be deadlocked with access
security). 

2) A lock might have been compromised in the access security system. (i.e.
the lock was taken, but the access security code somehow returned but
neglected to release its lock). Do these events correlate in time in any way
with changes to the access security rules (reloading them or changing their
CA based inputs).

3) The lock data structure was damaged. That usually manifests as a lock
request failure error message or, more typically, multiple issues - so this
is less likely.

Jeff


Navigate by Date:
Prev: Re: No more CA connections? R3.14.7, 3.14.8.2, CAJ Kay-Uwe Kasemir
Next: Re: drvAscii decode string Maren Purves
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
Navigate by Thread:
Prev: RE: StripTool bogs down host? Chu, Paul C.
Next: assert failed in db_post_single_event_private for CALC record Kay-Uwe Kasemir
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  <20062007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·