Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  <20142015  2016  2017  Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  <20142015  2016  2017 
<== Date ==> <== Thread ==>

Subject: Bug# 1279147 - Gateway sigsegv's when cleaning up channels using ca_clear_channel
From: "Shankar, Murali" <mshankar@slac.stanford.edu>
To: "core-talk@aps.anl.gov" <core-talk@aps.anl.gov>
Date: Tue, 11 Feb 2014 18:08:24 -0800

I’ve opened a bug for this but I also had an exception related question.

 

At LCLS, the archiver appliances connect to the IOCs thru a CA gateway. The gateway crashes once in a while. This does not seem to be related to an “out-of-memory” issue or a “Gateway has been running for a long time” issue. Instead, it seems to be related to a IOC that is CPU overloaded and keeps disconnecting.

 

Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting

Feb 07 02:21:23 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068

 

Feb 07 02:21:23 !!! Errlog message received (message is above)

Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting

 

Feb 07 02:41:49 !!! Errlog message received (message is above)

Feb 07 02:41:49 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068

Feb 07 04:42:32 PV Gateway Aborting (SIGSEGV)

 

 

I have core dumps and I am able to examine the variables etc and indeed the gateway is trying to clean up the PVs from this IOC using ca_clear_channel. However, the place where this crashes is in a fundamental place (tsDLList.h:238) in EPICS base. What seems to be happening here is that we have an element in the linked list that has a previousNode of 0 but is itself not the pFirst element. I can provide more details/core(s) if needed.

 

This does not seem to be a gateway bug; it seems to be some issue in ca_clear_channel. However, I don’t want to change EPICS base; perhaps I can catch the exception in gatePv.cc:240 and then move on. Should I consider patching this like so in the gateway code? I know this has memory leaks but this does not happen often.

 

Any help is appreciated.

 

Regards,

Murali

 

(gdb) bt

#0  0x0016c410 in __kernel_vsyscall ()

#1  0x0086de30 in raise () from /lib/libc.so.6

#2  0x0086f741 in abort () from /lib/libc.so.6

#3  0x080513a4 in sig_end (sig=11) at ../gateway.cc:300

#4  <signal handler called>

#5  0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238

#6  tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981

#7  0x007512b7 in nciu::destroy (this=0x17e24b88, guard=...) at ../nciu.cpp:93

#8  0x00768347 in oldChannelNotify::destructor (this=0x17e179f0, guard=...) at ../oldChannelNotify.cpp:71

#9  0x00749039 in ca_clear_channel (pChan=0x17e179f0) at ../access.cpp:386

#10 0x080582e0 in gatePvData::~gatePvData (this=0x157f79b0, __in_chrg=<value optimized out>) at ../gatePv.cc:240

#11 0x08062064 in gatePvNode::destroy (this=0x1ca02110) at ../gateServer.h:69

#12 0x0805d6e7 in gateServer::inactiveDeadCleanup (this=0x925af40) at ../gateServer.cc:1490

#13 0x08060fc8 in gateServer::mainLoop (this=0x925af40) at ../gateServer.cc:285

#14 0x0804ef18 in startEverything (prefix=0xbfd7bbe2 "GWLCLSARCH") at ../gateway.cc:656

#15 0x080511a8 in main (argc=16, argv=0xbfd7b494) at ../gateway.cc:1299

……

(gdb) up

#4  <signal handler called>

(gdb) up

#5  0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238

238                 prevNode.pNext = theNode.pNext;

(gdb) print theNode

$1 = (tsDLNode<nciu> &) @0x17e24b98: {pNext = 0x17d44d68, pPrev = 0x0}

(gdb) up            

#6  tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981

1981               this->createReqPend.remove ( chan );

(gdb) print chan

$2 = (nciu &) @0x17e24b88: {<cacChannel> = {_vptr.cacChannel = 0x781168, static priorityMax = 99, static priorityMin = 0, static priorityDefault = 0, static priorityLinksDB = 99,

    static priorityArchive = 49, static priorityOPI = 0, callback = @0x17e179f0}, <chronIntIdRes<nciu>> = {<chronIntId> = {<intId<unsigned int, 8u, 32u>> = {

        id = 833073}, <No data fields>}, <tsSLNode<nciu>> = {pNext = 0x0}, <No data fields>}, <channelNode> = {<tsDLNode<nciu>> = {pNext = 0x17d44d68, pPrev = 0x0},

    listMember = cs_createReqPend}, <privateInterfaceForIO> = {_vptr.privateInterfaceForIO = 0x7811d8}, eventq = {pFirst = 0x0, pLast = 0x0, itemCount = 0}, accessRightState = {

    f_readPermit = false, f_writePermit = false, f_operatorConfirmationRequest = false}, cacCtx = @0x925e2d8, pNameStr = 0x1c5838a8 "BLM:UND1:MP01:XILINX_CELS.LOW", piiu = 0xaf728260,

  sid = 4294967295, count = 0, retry = 1, nameLength = 30, typeCode = 65535, priority = 0 '\000'}

(gdb) quit

 


Replies:
Re: Bug# 1279147 - Gateway sigsegv's when cleaning up channels using ca_clear_channel Michael Davidsaver
Re: Bug# 1279147 - Gateway sigsegv's when cleaning up channels using ca_clear_channel Andrew Johnson

Navigate by Date:
Prev: Jenkins build is back to normal : epics-base-3.15-linux32 #3 APS Jenkins
Next: Re: Bug# 1279147 - Gateway sigsegv's when cleaning up channels using ca_clear_channel Michael Davidsaver
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  <20142015  2016  2017 
Navigate by Thread:
Prev: Jenkins build is back to normal : epics-base-3.15-linux32 #3 APS Jenkins
Next: Re: Bug# 1279147 - Gateway sigsegv's when cleaning up channels using ca_clear_channel Michael Davidsaver
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  <20142015  2016  2017 
ANJ, 16 May 2014 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·