Hi Michael
I think this could be a bug in EPICS base that has been fixed in 3.14.12.2,
see http://www.aps.anl.gov/epics/base/R3-14/12-docs/RELEASE_NOTES.html Section
"Changes between 3.14.12.1 and 3.14.12.2" which links to this
https://bugs.launchpad.net/epics-base/+bug/878372 entry in the bug tracker.
Cheers
Ben
On Monday, May 13, 2013 06:21:07 [email protected] wrote:
> I'd like to resend the message below. I would be grateful if someone coud
> please try to reproduce the bug using the attached test.
>
> I'd also like to point out that this bug is not as trivial and contrived as
> may appear -- any client application which closes camonitor subscriptions
> is liable to the synchronisation error described here and may thus suffer a
> segmentation fault or any other misbehaviour as a result.
> > This e-mail is really a follow up to this thread from a year ago:
> > http://www.aps.anl.gov/epics/tech-talk/2012/msg00584.php . (Alas, I
> > can't check this link because the APS web site seems to be poorly this
> > morning.)
> >
> > Back then I was seeing signs that CA subscription callbacks were being
> > called after returning from ca_clear_subscription ... in this e-mail I
> > have what looks like a definitive demonstration!
> >
> > In the attached test IOC I repeatedly create 500 subscriptions to 500
> > locally published PVs, pause a few hundred microseconds, and then
> > proceed to tear them all down again. The context pointer I pass
> > (args.usr) just contains a validity flag which I reset after
> > ca_clear_subscription returns -- and which I test in the callback.
> >
> > Below is a typical run:
> >
> > $ ./test 10 500
> > dbLoadDatabase("dbd/TEST.dbd", NULL, NULL)
> > TEST_registerRecordDeviceDriver(pdbbase)
> > dbLoadRecords("db/TEST.db", NULL)
> > iocInit()
> > Starting iocInit
> > #######################################################################
> > #####
> > ## EPICS R3.14.11 $R3-14-11$ $2009/08/28 18:47:36$
> > ## EPICS Base built Nov 4 2011
> > #######################################################################
> > #####
> > iocRun: All initialization complete
> > All channels connected
> > Testing 10 cycles, interval 500 us
> > [......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > ...............................................................whoops!
> > ][
> >
> >
> > The two arguments to `test` are number of times to try and how long to
> > pause between create and clear (in microseconds, passed to usleep(3)).
> > [ and ] are printed at the start and end of a cycle (so [ is
> > immediately followed by a burst of ca_create_subscription() calls) and
> > each . represents a successful callback. An unsuccessful (invalid)
> > callback is shown by 'whoops!' which is followed by an exit() call.
> >
> > This test can be very delicate and difficult to reproduce, and may need
> > to be run many times with slightly different pause intervals before
> > being even partially repeatable -- the fault only appears to show when
> > there isn't time for all 500 PVs to complete their initial updates, but
> > there has to be enough time for them all to make the effort.
> >
> > Another interesting detail follows from some locking I'm doing. Here
> > is an extract of the relevant code (LOCK() is just
> > pthread_mutex_lock(3p) on a global mutex):
> >
> > 1 static void on_update(struct event_handler_args)
> > 2 {
> > 3 struct event *event = args.usr;
> > 4 LOCK();
> > 5 bool valid = event->valid;
> > 6 UNLOCK();
> > 7 if (valid) ...
> > 8 }
> >
> > ...
> >
> > 9 LOCK(); // This should trigger deadlock
> > 10 ca_clear_subscription(event->event_id);
> > 11 event->valid = false;
> > 12 UNLOCK();
> >
> > It seems to me that if ca_clear_subscription() is correctly doing what
> > we discussed a year ago, which is to say, if it is waiting for all
> > outstanding callbacks to complete before returning, then the LOCK() on
> > line 9 should trigger a deadlock when ca_clear_subscription() is called
> > with its associated callback still only on line 3 (or earlier). But I
> > never see my test deadlock.
> >
> > I'm seeing this problem occur on test code which is repeatedly creating
> > and destroying subscriptions, but I've previously reported this on CA
> > client shutdown, so it does look to me like there is a general
> > synchronisation problem here. I believe I have a workaround, which is
> > to delay releasing the callback context to give time for outstanding
> > callbacks to complete, but this is a bit worrysome...
Attachment:
signature.asc
Description: This is a digitally signed message part.
- Replies:
- RE: CA subscription synchronisation shutdown problem michael.abbott
- References:
- RE: CA subscription synchronisation shutdown problem michael.abbott
- Navigate by Date:
- Prev:
Re: Slow updates from EPICS-OPC interface Carsten Winkler
- Next:
RE: CSS 3.1.x PVManager / JCA ClassCastException marcus . michalsky
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
<2013>
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: CA subscription synchronisation shutdown problem michael.abbott
- Next:
RE: CA subscription synchronisation shutdown problem michael.abbott
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
<2013>
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|