EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: CA subscription synchronisation shutdown problem
From: "Hill, Jeff" <[email protected]>
To: "[email protected]" <[email protected]>, "[email protected]" <[email protected]>
Date: Mon, 13 May 2013 19:26:51 +0000
Hi Michael,

Sorry about the late response, this has obviously fallen through the cracks. Maybe it would have helped to make an entry in the bug tracking system. Nevertheless I apologies; I have been busy and have not been as responsive as I should be. 

After changing "#!/bin/sh" to "#!/bin/bash" in make_test_db I am reproducing your issue.

My first comment is that this would not occur in the default "non-preemptive callback" mode CA client application. Nevertheless, this is certainly also a serious bug that needs to be fixed.

This is bug entry 1179642. 

I have some ideas about a fix. I will post a summary hopefully latter today.

Jeff

> -----Original Message-----
> From: [email protected] [mailto:tech-talk-
> [email protected]] On Behalf Of [email protected]
> Sent: Monday, May 13, 2013 12:21 AM
> To: [email protected]
> Subject: RE: CA subscription synchronisation shutdown problem
> 
> I'd like to resend the message below.  I would be grateful if someone coud
> please try to reproduce the bug using the attached test.
> 
> I'd also like to point out that this bug is not as trivial and contrived as may
> appear -- any client application which closes camonitor subscriptions is liable
> to the synchronisation error described here and may thus suffer a
> segmentation fault or any other misbehaviour as a result.
> 
> 
> > This e-mail is really a follow up to this thread from a year ago:
> > http://www.aps.anl.gov/epics/tech-talk/2012/msg00584.php .  (Alas, I
> > can't check this link because the APS web site seems to be poorly this
> > morning.)
> >
> > Back then I was seeing signs that CA subscription callbacks were being
> > called after returning from ca_clear_subscription ... in this e-mail I
> > have what looks like a definitive demonstration!
> >
> > In the attached test IOC I repeatedly create 500 subscriptions to 500
> > locally published PVs, pause a few hundred microseconds, and then
> > proceed to tear them all down again.  The context pointer I pass
> > (args.usr) just contains a validity flag which I reset after
> > ca_clear_subscription returns -- and which I test in the callback.
> >
> > Below is a typical run:
> >
> > $ ./test 10 500
> > dbLoadDatabase("dbd/TEST.dbd", NULL, NULL)
> > TEST_registerRecordDeviceDriver(pdbbase)
> > dbLoadRecords("db/TEST.db", NULL)
> > iocInit()
> > Starting iocInit
> >
> ###############################################################
> ########
> > #####
> > ## EPICS R3.14.11 $R3-14-11$ $2009/08/28 18:47:36$
> > ## EPICS Base built Nov  4 2011
> >
> ###############################################################
> ########
> > #####
> > iocRun: All initialization complete
> > All channels connected
> > Testing 10 cycles, interval 500 us
> > [......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > ...............................................................whoops!
> > ][
> >
> >
> > The two arguments to `test` are number of times to try and how long to
> > pause between create and clear (in microseconds, passed to usleep(3)).
> > [ and ] are printed at the start and end of a cycle (so [ is
> > immediately followed by a burst of ca_create_subscription() calls) and
> > each . represents a successful callback.  An unsuccessful (invalid)
> > callback is shown by 'whoops!' which is followed by an exit() call.
> >
> > This test can be very delicate and difficult to reproduce, and may need
> > to be run many times with slightly different pause intervals before
> > being even partially repeatable -- the fault only appears to show when
> > there isn't time for all 500 PVs to complete their initial updates, but
> > there has to be enough time for them all to make the effort.
> >
> > Another interesting detail follows from some locking I'm doing.  Here
> > is an extract of the relevant code (LOCK() is just
> > pthread_mutex_lock(3p) on a global mutex):
> >
> > 1	static void on_update(struct event_handler_args)
> > 2	{
> > 3	    struct event *event = args.usr;
> > 4	    LOCK();
> > 5	    bool valid = event->valid;
> > 6	    UNLOCK();
> > 7	    if (valid) ...
> > 8	}
> >
> > 	...
> >
> > 9	    LOCK();		// This should trigger deadlock
> > 10	    ca_clear_subscription(event->event_id);
> > 11	    event->valid = false;
> > 12	    UNLOCK();
> >
> > It seems to me that if ca_clear_subscription() is correctly doing what
> > we discussed a year ago, which is to say, if it is waiting for all
> > outstanding callbacks to complete before returning, then the LOCK() on
> > line 9 should trigger a deadlock when ca_clear_subscription() is called
> > with its associated callback still only on line 3 (or earlier).  But I
> > never see my test deadlock.
> >
> > I'm seeing this problem occur on test code which is repeatedly creating
> > and destroying subscriptions, but I've previously reported this on CA
> > client shutdown, so it does look to me like there is a general
> > synchronisation problem here.  I believe I have a workaround, which is
> > to delay releasing the callback context to give time for outstanding
> > callbacks to complete, but this is a bit worrysome...
> 
> 
> 
> --
> 
> This e-mail and any attachments may contain confidential, copyright and or
> privileged material, and are for the use of the intended addressee only. If you
> are not the intended addressee or an authorised recipient of the addressee
> please notify us of receipt by returning the e-mail and do not use, copy,
> retain, distribute or disclose the information in or attached to the e-mail.
> 
> Any opinions expressed within this e-mail are those of the individual and not
> necessarily of Diamond Light Source Ltd.
> 
> Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> attachments are free from viruses and we cannot accept liability for any
> damage which you may sustain as a result of software viruses which may be
> transmitted in or with the message.
> 
> Diamond Light Source Limited (company no. 4375679). Registered in England
> and Wales with its registered office at Diamond House, Harwell Science and
> Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
> 
> 
> 
> 
> 
> 
> 
> 



References:
RE: CA subscription synchronisation shutdown problem michael.abbott

Navigate by Date:
Prev: RE: Timer Queue crash Hill, Jeff
Next: Re: Help beginner with ASYN device support, ai record conversion to EGU Ralph Lange
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: CA subscription synchronisation shutdown problem michael.abbott
Next: aps website down? James F Ross
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 20 Apr 2015 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·