EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp
From: "Jeff Hill" <[email protected]>
To: "'Ralph Lange'" <[email protected]>
Cc: "'EPICS Core Talk'" <[email protected]>
Date: Wed, 18 Aug 2010 17:09:10 -0600
> My point is: my fix did not change the code in cac::destroyIIU at all.
> I also did not switch back to looping over the circuitList.count()
> instead of looping over iiuExistenceCount.
> My fix just switched from gradually increasing iiuExistenceCount
> whenever a new circuit gets added to the list, towards setting
> iiuExistenceCount to circuitList.count() under the lock before shutting
> down the circuits.
> As (in your fix) iiuExistenceCount++ happens immediately after adding
> the circuit to the circuitList(), I still think these two versions are
> equivalent.

Not a problem with delving into the details, it's good to clarify what 
the issues are.

Consider this potential scenario using the source code in revision 12099 (your version).

o There are some number of TCP send threads simultaneously in cac :: destroyIIU at 
line 1228. They can be at this location in the source for many different reasons; this can be 
precipitated by tcp circuit disconnect, client context shutdown, a c++ exception, and many 
other reasons. The lock is released at line 1228 to avoid shutdown deadlocks. Each of them, one 
at a time, have in their recent past taken the lock and removed themselves from cac's circuitList
(see line 1218 in cac.cpp). They are currently running the tcpiiu dtor w/o holding the lock
(at the before mentioned line 1228 in cac.cpp).

o The user asynchronously destroys the ca context and the user's thread ends up at line 289 
in ~cac where your version is setting "this->iiuExistenceCount = this->circuitList.count();".
This can happen because the lock isn't currently held by any of the send threads.

o Now the iiuExistenceCount is far less than the number of tcpiiu that still exist and have not
finished shutting done.

o ~cac happily continues running, w/o waiting for the send threads to finish shutting down, and 
de-allocates it's free list. This happens before the send threads finish returning the memory 
allocated for their tcpiiu object to the free store. See line 1232 in cac.cpp. Death is imminent.

Jeff
______________________________________________________
Jeffrey O. Hill           Email        [email protected]
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA


> -----Original Message-----
> From: Ralph Lange [mailto:[email protected]]
> Sent: Wednesday, August 18, 2010 8:52 AM
> To: Jeff Hill
> Cc: 'EPICS Core Talk'
> Subject: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp
> 
>   Hi Jeff,
> 
> On 17.08.2010 20:05, Jeff Hill wrote:
> > Hi Ralph,
> >> I'm not sure about that race condition, though.
> >> But don't see where my fix would break the fixed behavior, as it is
> >> still using a countre, decrementing the iiuExistenceCount is still
> >> happening at the end of cac::destroyIIU(), and incrementing
> >> iiuExistenceCount (original fix) happens exactly when it is added to
> >> the
> >> list, so that (my fix) using circuitList.count() to set
> >> iiuExistenceCount yields the same result.
> > The cac::destroyIIU has roughly three sections of code. In the first
> section there is a lock protected remove of the per-tcp-circuit data
> structure from cac's list etc, in the second section the per-tcp-
> circuit destructor is run without holding the lock so we will not
> deadlock, and in the third section we are protected by the lock when
> returning the per-tcp-circuit's memory to the free list and
> decrementing the iiuExistenceCount.
> >
> > So, of course the business at hand is that we need to prevent the cac
> destructor from finishing until each of the tcp circuits is completely
> shut down. We do not make the cac destructor wait for the list count to
> become zero because we have to wait for stuff that happens after the
> per-tcp-circuit is removed from the list. Hence the need for the
> iiuExistenceCount.
> >
> > Note that _multiple_ threads, each managing private TCP circuits, can
> all be shutting down simultaneously and that the lock is released
> briefly between removing the circuit from the list, and decrementing
> the iiuExistenceCount.
> 
> Thanks for the explanation.
> 
> But - as I said in my mail - I perfectly understand bug #541362 and see
> why a counter fixes the issue that appears when using a simple flag.
> 
> My point is: my fix did not change the code in cac::destroyIIU at all.
> I
> also did not switch back to looping over the circuitList.count()
> instead
> of looping over iiuExistenceCount.
> My fix just switched from gradually increasing iiuExistenceCount
> whenever a new circuit gets added to the list, towards setting
> iiuExistenceCount to circuitList.count() under the lock before shutting
> down the circuits.
> As (in your fix) iiuExistenceCount++ happens immediately after adding
> the circuit to the circuitList(), I still think these two versions are
> equivalent.
> 
> Sorry about insisting here ... This is really not about trying to
> always
> be right, I just want to understand why I'm wrong.
> 
> >   Furthermore, at the same instant in time it is possible that a new
> TCP circuits might also be created.
> 
> How?
> The only way a new circuit gets created is after an incoming result of
> the name resolution process. The UDP task has been shut down earlier in
> the cac dtor. Who is creating the new circuit?
> 
> >   So setting the count to the number of items on the list can
> seriously scramble the proper state of the iiuExistenceCount even
> before the cac dtor runs.
> 
> iiuExistenceCount is not used outside the cac dtor. I am setting it
> inside the dtor with the lock being held. I don't see how it can be
> scrambled after that, and a variable being scrambled before it is set
> is
> not relevant.
> 
> >   And of course if that happens then we can have a crash because the
> cac dtor might not wait for all of the threads managing the circuits to
> shutdown, or alternatively hang forever waiting for phantom circuits to
> shutdown.
> 
> Using your fix, if circuits are added after the unlinkAllChannels block
> (line ~290), they will be waited for in the line ~307 loop (as
> iiuExistenceCount is increased), but unlinkAllChannels has never been
> called for them, so the loop will hang forever.
> Afaik this behavior has never been observed, which I would take as a
> hint that shutting down the UDP task first reliably keeps new circuits
> from being created at that time.
> 
> (This has changed with the introduction of TCP name requests - but
> that's the other mail thread.)
> 
> Cheers,
> Ralph
> 
> >> -----Original Message-----
> >> From: Ralph Lange [mailto:[email protected]]
> >> Sent: Tuesday, August 17, 2010 10:08 AM
> >> To: Jeff Hill
> >> Cc: EPICS Core Talk
> >> Subject: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-
> tcp
> >>
> >>    Hi Jeff,
> >>
> >> thanks for the fix!
> >>
> >>
> >> I'm not sure about that race condition, though.
> >>
> >> In my fix:
> >> Setting iiuExistenceCount to the circuitList.count() (line ~290)
> >> happens
> >> in the cac::~cac destructor while the lock is being held,
> immediately
> >> followed by calling unlinkAllChannels() on all circuits in that
> list.
> >>
> >> The only place where the lock is given up is one block down (line
> >> ~307),
> >> when waiting for the tcp threads to exit.
> >>
> >> What race condition are you seeing? I looked at the 11786 change and
> >> bug#541362. I perfectly see why a counter fixes the issue that
> appears
> >> when using a simple flag.
> >> But don't see where my fix would break the fixed behavior, as it is
> >> still using a countre, decrementing the iiuExistenceCount is still
> >> happening at the end of cac::destroyIIU(), and incrementing
> >> iiuExistenceCount (original fix) happens exactly when it is added to
> >> the
> >> list, so that (my fix) using circuitList.count() to set
> >> iiuExistenceCount yields the same result.
> >>
> >> I'm perfectly fine with using the original (increment) fix, but I
> would
> >> like to understand where the difference is.
> >>
> >>
> >> I think I also see another new race condition introduced by ca-over-
> >> tcp,
> >> which probably breaks the shutdown behavior for clients using
> >> ca-over-tcp, and needs to be fixed.
> >>
> >> When the lock is given up in cac::~cac() and the loop (line ~307)
> waits
> >> for termination of all TCP tasks, an incoming name-res reply from a
> >> still active TCP circuit (the UDP task has exited by that time in
> the
> >> cac::~cac destructor) could open a new circuit (tcpiiu.c #2188),
> which
> >> would then be added to the circuitList (and iiuExistenceCount be
> >> incremented). For that new circuit, that the loop would be waiting
> for,
> >> unlinkAllChannels() is not being called, so it does not have a
> reason
> >> to
> >> terminate its threads.
> >>
> >> Maybe we still need a flag: to avoid creating new circuits while the
> >> cac:~cac destructor runs?!
> >>
> >> What do you think?
> >>
> >>
> >> Thanks a lot,
> >> Ralph
> >>
> >>
> >> On 16.08.2010 19:46, Jeff Hill wrote:
> >>>> + this->iiuExistenceCount = this->circuitList.count();
> >>> Examining this change a bit closer I see that it will introduce a
> >> race condition when circuits are being created and destroyed at
> close
> >> to the same instant in time. See revision 11786 of cac.cpp which
> fixes
> >> mantis 334 if you are interested in what this code does. After
> running
> >> the regression tests, I pushed in a fix (which is removing above
> >> mentioned change and restoring the increment of iiuExistenceCount in
> >> cac::findOrCreateVirtCircuit.
> >>> Jeff
> >>>
> >>>> -----Original Message-----
> >>>> From: Ralph Lange [mailto:[email protected]]
> >>>> Sent: Monday, August 16, 2010 10:03 AM
> >>>> To: Andrew Johnson
> >>>> Cc: Jeff Hill; EPICS Core Talk
> >>>> Subject: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp
> >>>>
> >>>>     Hi Andrew,
> >>>>
> >>>> it looks as I have found the reason for the segfaults, which I
> think
> >> is
> >>>> a bug in CAC.
> >>>>
> >>>> In the cac::~cac() destructor, a private member
> >>>> "this->iiuExistenceCount" is used to determine if cac should wait
> >> for
> >>>> TCP threads to shutdown. While this counter gets correctly
> decreased
> >> in
> >>>> cac::destroyIIU(), it never gets increased or set (except
> >> initialized to
> >>>> 0 in the cpp constructor).
> >>>>
> >>>> So cac::~cac() was never waiting for the TCP threads to shutdown,
> >> and
> >>>> happily kept tearing down the timer/mutex infrastructure. If the
> >> threads
> >>>> were not fast enough, they were hitting on destroyed semaphores
> when
> >>>> cancelling their watchdog timers - that was causing the exceptions
> >> and
> >>>> segfaults.
> >>>>
> >>>> This one-line fix will take care of the problem:
> >>>>
> >>>>
> >>>> === modified file 'src/ca/cac.cpp'
> >>>> --- src/ca/cac.cpp      2010-04-15 21:06:16 +0000
> >>>> +++ src/ca/cac.cpp      2010-08-16 15:24:13 +0000
> >>>> @@ -287,6 +287,7 @@
> >>>>                 //
> >>>>                 // shutdown all tcp circuits
> >>>>                 //
> >>>> +            this->iiuExistenceCount = this->circuitList.count();
> >>>>                 tsDLIter<   tcpiiu>   iter = this-
> >>> circuitList.firstIter ();
> >>>>                 while ( iter.valid() ) {
> >>>>                     // this causes a clean shutdown to occur
> >>>>
> >>>>
> >>>> Maybe it actually takes care of several shutdown related issues...
> >>>>
> >>>> Yay!
> >>>> Ralph



Replies:
Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
References:
Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange

Navigate by Date:
Prev: Re: Bazaar commit messages: style guide Andrew Johnson
Next: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
Index: 2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
Next: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
Index: 2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 02 Feb 2012 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·