Hi Jeff,
This e-mail is really a follow up to this thread from a year ago: http://www.aps.anl.gov/epics/tech-talk/2012/msg00584.php . (Alas, I can't check this link because the APS web site seems to be poorly this morning.)
Back then I was seeing signs that CA subscription callbacks were being called after returning from ca_clear_subscription ... in this e-mail I have what looks like a definitive demonstration!
In the attached test IOC I repeatedly create 500 subscriptions to 500 locally published PVs, pause a few hundred microseconds, and then proceed to tear them all down again. The context pointer I pass (args.usr) just contains a validity flag which I reset after ca_clear_subscription returns -- and which I test in the callback.
Below is a typical run:
$ ./test 10 500
dbLoadDatabase("dbd/TEST.dbd", NULL, NULL)
TEST_registerRecordDeviceDriver(pdbbase)
dbLoadRecords("db/TEST.db", NULL)
iocInit()
Starting iocInit
############################################################################
## EPICS R3.14.11 $R3-14-11$ $2009/08/28 18:47:36$
## EPICS Base built Nov 4 2011
############################################################################
iocRun: All initialization complete
All channels connected
Testing 10 cycles, interval 500 us
[........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................whoops!
][
The two arguments to `test` are number of times to try and how long to pause between create and clear (in microseconds, passed to usleep(3)). [ and ] are printed at the start and end of a cycle (so [ is immediately followed by a burst of ca_create_subscription() calls) and each . represents a successful callback. An unsuccessful (invalid) callback is shown by 'whoops!' which is followed by an exit() call.
This test can be very delicate and difficult to reproduce, and may need to be run many times with slightly different pause intervals before being even partially repeatable -- the fault only appears to show when there isn't time for all 500 PVs to complete their initial updates, but there has to be enough time for them all to make the effort.
Another interesting detail follows from some locking I'm doing. Here is an extract of the relevant code (LOCK() is just pthread_mutex_lock(3p) on a global mutex):
1 static void on_update(struct event_handler_args)
2 {
3 struct event *event = args.usr;
4 LOCK();
5 bool valid = event->valid;
6 UNLOCK();
7 if (valid) ...
8 }
...
9 LOCK(); // This should trigger deadlock
10 ca_clear_subscription(event->event_id);
11 event->valid = false;
12 UNLOCK();
It seems to me that if ca_clear_subscription() is correctly doing what we discussed a year ago, which is to say, if it is waiting for all outstanding callbacks to complete before returning, then the LOCK() on line 9 should trigger a deadlock when ca_clear_subscription() is called with its associated callback still only on line 3 (or earlier). But I never see my test deadlock.
I'm seeing this problem occur on test code which is repeatedly creating and destroying subscriptions, but I've previously reported this on CA client shutdown, so it does look to me like there is a general synchronisation problem here. I believe I have a workaround, which is to delay releasing the callback context to give time for outstanding callbacks to complete, but this is a bit worrysome...
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
Attachment:
test-ioc.tgz
Description: test-ioc.tgz
- Navigate by Date:
- Prev:
Re: EPICS Extensions Compilation Errors murad ali
- Next:
Re: Change in behavior of the fullPathName.pl Andrew Johnson
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
<2013>
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
areaDetector driver for Princeton Instruments PICAM cameras? Mark Rivers
- Next:
RE: CA subscription synchronisation shutdown problem michael.abbott
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
<2013>
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|