Hi Michael,
Sorry about the late response, this has obviously fallen through the cracks. Maybe it would have helped to make an entry in the bug tracking system. Nevertheless I apologies; I have been busy and have not been as responsive as I should be.
After changing "#!/bin/sh" to "#!/bin/bash" in make_test_db I am reproducing your issue.
My first comment is that this would not occur in the default "non-preemptive callback" mode CA client application. Nevertheless, this is certainly also a serious bug that needs to be fixed.
This is bug entry 1179642.
I have some ideas about a fix. I will post a summary hopefully latter today.
Jeff
> -----Original Message-----
> From: [email protected] [mailto:tech-talk-
> [email protected]] On Behalf Of [email protected]
> Sent: Monday, May 13, 2013 12:21 AM
> To: [email protected]
> Subject: RE: CA subscription synchronisation shutdown problem
>
> I'd like to resend the message below. I would be grateful if someone coud
> please try to reproduce the bug using the attached test.
>
> I'd also like to point out that this bug is not as trivial and contrived as may
> appear -- any client application which closes camonitor subscriptions is liable
> to the synchronisation error described here and may thus suffer a
> segmentation fault or any other misbehaviour as a result.
>
>
> > This e-mail is really a follow up to this thread from a year ago:
> > http://www.aps.anl.gov/epics/tech-talk/2012/msg00584.php . (Alas, I
> > can't check this link because the APS web site seems to be poorly this
> > morning.)
> >
> > Back then I was seeing signs that CA subscription callbacks were being
> > called after returning from ca_clear_subscription ... in this e-mail I
> > have what looks like a definitive demonstration!
> >
> > In the attached test IOC I repeatedly create 500 subscriptions to 500
> > locally published PVs, pause a few hundred microseconds, and then
> > proceed to tear them all down again. The context pointer I pass
> > (args.usr) just contains a validity flag which I reset after
> > ca_clear_subscription returns -- and which I test in the callback.
> >
> > Below is a typical run:
> >
> > $ ./test 10 500
> > dbLoadDatabase("dbd/TEST.dbd", NULL, NULL)
> > TEST_registerRecordDeviceDriver(pdbbase)
> > dbLoadRecords("db/TEST.db", NULL)
> > iocInit()
> > Starting iocInit
> >
> ###############################################################
> ########
> > #####
> > ## EPICS R3.14.11 $R3-14-11$ $2009/08/28 18:47:36$
> > ## EPICS Base built Nov 4 2011
> >
> ###############################################################
> ########
> > #####
> > iocRun: All initialization complete
> > All channels connected
> > Testing 10 cycles, interval 500 us
> > [......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > ...............................................................whoops!
> > ][
> >
> >
> > The two arguments to `test` are number of times to try and how long to
> > pause between create and clear (in microseconds, passed to usleep(3)).
> > [ and ] are printed at the start and end of a cycle (so [ is
> > immediately followed by a burst of ca_create_subscription() calls) and
> > each . represents a successful callback. An unsuccessful (invalid)
> > callback is shown by 'whoops!' which is followed by an exit() call.
> >
> > This test can be very delicate and difficult to reproduce, and may need
> > to be run many times with slightly different pause intervals before
> > being even partially repeatable -- the fault only appears to show when
> > there isn't time for all 500 PVs to complete their initial updates, but
> > there has to be enough time for them all to make the effort.
> >
> > Another interesting detail follows from some locking I'm doing. Here
> > is an extract of the relevant code (LOCK() is just
> > pthread_mutex_lock(3p) on a global mutex):
> >
> > 1 static void on_update(struct event_handler_args)
> > 2 {
> > 3 struct event *event = args.usr;
> > 4 LOCK();
> > 5 bool valid = event->valid;
> > 6 UNLOCK();
> > 7 if (valid) ...
> > 8 }
> >
> > ...
> >
> > 9 LOCK(); // This should trigger deadlock
> > 10 ca_clear_subscription(event->event_id);
> > 11 event->valid = false;
> > 12 UNLOCK();
> >
> > It seems to me that if ca_clear_subscription() is correctly doing what
> > we discussed a year ago, which is to say, if it is waiting for all
> > outstanding callbacks to complete before returning, then the LOCK() on
> > line 9 should trigger a deadlock when ca_clear_subscription() is called
> > with its associated callback still only on line 3 (or earlier). But I
> > never see my test deadlock.
> >
> > I'm seeing this problem occur on test code which is repeatedly creating
> > and destroying subscriptions, but I've previously reported this on CA
> > client shutdown, so it does look to me like there is a general
> > synchronisation problem here. I believe I have a workaround, which is
> > to delay releasing the callback context to give time for outstanding
> > callbacks to complete, but this is a bit worrysome...
>
>
>
> --
>
> This e-mail and any attachments may contain confidential, copyright and or
> privileged material, and are for the use of the intended addressee only. If you
> are not the intended addressee or an authorised recipient of the addressee
> please notify us of receipt by returning the e-mail and do not use, copy,
> retain, distribute or disclose the information in or attached to the e-mail.
>
> Any opinions expressed within this e-mail are those of the individual and not
> necessarily of Diamond Light Source Ltd.
>
> Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> attachments are free from viruses and we cannot accept liability for any
> damage which you may sustain as a result of software viruses which may be
> transmitted in or with the message.
>
> Diamond Light Source Limited (company no. 4375679). Registered in England
> and Wales with its registered office at Diamond House, Harwell Science and
> Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
>
>
>
>
>
>
>
>
- References:
- RE: CA subscription synchronisation shutdown problem michael.abbott
- Navigate by Date:
- Prev:
RE: Timer Queue crash Hill, Jeff
- Next:
Re: Help beginner with ASYN device support, ai record conversion to EGU Ralph Lange
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
<2013>
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: CA subscription synchronisation shutdown problem michael.abbott
- Next:
aps website down? James F Ross
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
<2013>
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|