EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp
From: "Jeff Hill" <[email protected]>
To: "'Ralph Lange'" <[email protected]>
Cc: "'EPICS Core Talk'" <[email protected]>
Date: Tue, 17 Aug 2010 19:00:19 -0600
oops, didn’t answer all of the message

> I think I also see another new race condition introduced by ca-over-
> tcp, which probably breaks the shutdown behavior for clients using
> ca-over-tcp, and needs to be fixed.
> 
> When the lock is given up in cac::~cac() and the loop (line ~307) waits
> for termination of all TCP tasks, an incoming name-res reply from a
> still active TCP circuit (the UDP task has exited by that time in the
> cac::~cac destructor) could open a new circuit (tcpiiu.c #2188), which
> would then be added to the circuitList (and iiuExistenceCount be
> incremented). For that new circuit, that the loop would be waiting for,
> unlinkAllChannels() is not being called, so it does not have a reason
> to terminate its threads.
> 
> Maybe we still need a flag: to avoid creating new circuits while the
> cac:~cac destructor runs?!

Looks like a potential issue.

option 1) 

You might insert "this->chanTable.removeAll ()" after line 295 in ~cac 
which will cause cac::transferChanToVirtCircuit to noop return at line 
583. You would need to supply a temporary (auto) tsSLList<T> target to 
removeAll but could safely ignore (discard) what gets placed there because 
cleanup responsibilities for the channels lies with the client applications. 
The library unlinks the channels from itself when it exits, but does not 
destroy them. The biggest downside will be some runtime overhead emptying 
large hash tables for many channels. Something like this.

// protected by a Guard ....
{
	tsSLList < tcpiiu > dummy;
	this-> serverTable.removeAll ( dummy );
}

option 2)

Just add a bool flag to cac as you propose, set it just after line 295 
in ~cac, and check it at the beginning of cac::transferChanToVirtCircuit. 
It would also need to be initialized in the cac ctor. Of course the set
and test will need to be protected by a Guard. The downside is that 
this requires slightly more memory at runtime, but this is no-doubt 
insignificant. Perhaps the only downside is needing to remember the
purpose of yet another binary flag a few years down the road. 

Jeff
______________________________________________________
Jeffrey O. Hill           Email        [email protected]
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA


> -----Original Message-----
> From: Ralph Lange [mailto:[email protected]]
> Sent: Tuesday, August 17, 2010 10:08 AM
> To: Jeff Hill
> Cc: EPICS Core Talk
> Subject: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp
> 
>   Hi Jeff,
> 
> thanks for the fix!
> 
> 
> I'm not sure about that race condition, though.
> 
> In my fix:
> Setting iiuExistenceCount to the circuitList.count() (line ~290)
> happens
> in the cac::~cac destructor while the lock is being held, immediately
> followed by calling unlinkAllChannels() on all circuits in that list.
> 
> The only place where the lock is given up is one block down (line
> ~307),
> when waiting for the tcp threads to exit.
> 
> What race condition are you seeing? I looked at the 11786 change and
> bug#541362. I perfectly see why a counter fixes the issue that appears
> when using a simple flag.
> But don't see where my fix would break the fixed behavior, as it is
> still using a countre, decrementing the iiuExistenceCount is still
> happening at the end of cac::destroyIIU(), and incrementing
> iiuExistenceCount (original fix) happens exactly when it is added to
> the
> list, so that (my fix) using circuitList.count() to set
> iiuExistenceCount yields the same result.
> 
> I'm perfectly fine with using the original (increment) fix, but I would
> like to understand where the difference is.
> 
> 
> I think I also see another new race condition introduced by ca-over-
> tcp,
> which probably breaks the shutdown behavior for clients using
> ca-over-tcp, and needs to be fixed.
> 
> When the lock is given up in cac::~cac() and the loop (line ~307) waits
> for termination of all TCP tasks, an incoming name-res reply from a
> still active TCP circuit (the UDP task has exited by that time in the
> cac::~cac destructor) could open a new circuit (tcpiiu.c #2188), which
> would then be added to the circuitList (and iiuExistenceCount be
> incremented). For that new circuit, that the loop would be waiting for,
> unlinkAllChannels() is not being called, so it does not have a reason
> to
> terminate its threads.
> 
> Maybe we still need a flag: to avoid creating new circuits while the
> cac:~cac destructor runs?!
> 
> What do you think?
> 
> 
> Thanks a lot,
> Ralph
> 
> 
> On 16.08.2010 19:46, Jeff Hill wrote:
> >> + this->iiuExistenceCount = this->circuitList.count();
> > Examining this change a bit closer I see that it will introduce a
> race condition when circuits are being created and destroyed at close
> to the same instant in time. See revision 11786 of cac.cpp which fixes
> mantis 334 if you are interested in what this code does. After running
> the regression tests, I pushed in a fix (which is removing above
> mentioned change and restoring the increment of iiuExistenceCount in
> cac::findOrCreateVirtCircuit.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Ralph Lange [mailto:[email protected]]
> >> Sent: Monday, August 16, 2010 10:03 AM
> >> To: Andrew Johnson
> >> Cc: Jeff Hill; EPICS Core Talk
> >> Subject: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp
> >>
> >>    Hi Andrew,
> >>
> >> it looks as I have found the reason for the segfaults, which I think
> is
> >> a bug in CAC.
> >>
> >> In the cac::~cac() destructor, a private member
> >> "this->iiuExistenceCount" is used to determine if cac should wait
> for
> >> TCP threads to shutdown. While this counter gets correctly decreased
> in
> >> cac::destroyIIU(), it never gets increased or set (except
> initialized to
> >> 0 in the cpp constructor).
> >>
> >> So cac::~cac() was never waiting for the TCP threads to shutdown,
> and
> >> happily kept tearing down the timer/mutex infrastructure. If the
> threads
> >> were not fast enough, they were hitting on destroyed semaphores when
> >> cancelling their watchdog timers - that was causing the exceptions
> and
> >> segfaults.
> >>
> >> This one-line fix will take care of the problem:
> >>
> >>
> >> === modified file 'src/ca/cac.cpp'
> >> --- src/ca/cac.cpp      2010-04-15 21:06:16 +0000
> >> +++ src/ca/cac.cpp      2010-08-16 15:24:13 +0000
> >> @@ -287,6 +287,7 @@
> >>                //
> >>                // shutdown all tcp circuits
> >>                //
> >> +            this->iiuExistenceCount = this->circuitList.count();
> >>                tsDLIter<  tcpiiu>  iter = this-
> >circuitList.firstIter ();
> >>                while ( iter.valid() ) {
> >>                    // this causes a clean shutdown to occur
> >>
> >>
> >> Maybe it actually takes care of several shutdown related issues...
> >>
> >> Yay!
> >> Ralph



Replies:
Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
References:
Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange

Navigate by Date:
Prev: RE: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Jeff Hill
Next: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
Index: 2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
Next: Re: Fixed: Segfaults in 3.14 branch since merging ca-over-tcp Ralph Lange
Index: 2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 02 Feb 2012 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·