Hi,
I'm seeing a deadlock reported with the EPICS Java CAJ/JCA library. It is extremely rare, but the CAJChannel.destroy method will sometimes throw a CAException because the ReferenceCountingLock
cannot acquire a lock before the timeout (I believe 20 seconds). In my application there are many threads creating and destroying channels and adding and removing monitors concurrently on the same shared context.
It appears this is caused by inconsistent lock ordering as seen in the source of the create and destroy channel methods CAJContext.createChannel and CAJChannel.destroy (https://github.com/epics-base/jca/blob/master/src/core/com/cosylab/epics/caj/).
Consider the NamedLockPattern > ReferenceCountingLock > ReentrantLock lock (let's call it A) and the
CAJChannel intrinsic synchronization lock (let's call it B). In the CAJContext.createChannel method lock A is obtained first then B. However, in the CAJChannel.destroy method locks are obtained B then A. If one thread is attempting
to close a channel while another thread is attempting to create a channel of the same name a deadlock may occur and I believe this may be what I am seeing in the following stack trace:
java.io.IOException: Unable to close channel
at org.jlab.epics2web.epics.ChannelMonitor.close(ChannelMonitor.java:126)
at org.jlab.epics2web.epics.ChannelManager.removeFromChannels(ChannelManager.java:333)
at org.jlab.epics2web.epics.ChannelManager.removeListener(ChannelManager.java:353)
at org.jlab.epics2web.websocket.WebSocketSessionManager.removeClient(WebSocketSessionManager.java:169)
at org.jlab.epics2web.Application$1.run(Application.java:86)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: gov.aps.jca.CAException: Failed to obtain synchronization lock for 'R011PROT.F', possible deadlock.
at com.cosylab.epics.caj.CAJContext.destroyChannel(CAJContext.java:1017)
at com.cosylab.epics.caj.CAJChannel.destroy(CAJChannel.java:379)
at com.cosylab.epics.caj.CAJChannel.destroy(CAJChannel.java:366)
at org.jlab.epics2web.epics.ChannelMonitor.close(ChannelMonitor.java:124)
... 9 more
There appears to be an attempt to avoid this scenario as CAJConext.createChannel checks for an existing channel twice, first while not holding any lock and then again while holding lock A. This
reduces the opportunity, but does not prevent deadlock. In the rare case in which two or more threads are attempting to create the same channel one must wait for lock A while the other is free to obtain lock B, create the channel, drop both locks, and then
immediately ask to destroy the channel and thus obtain lock B. A thread context switch occurs. Now the second thread obtains lock A, but cannot obtain lock B so neither thread cannot continue. However, lock A acquisition timeout occurs and thread one fails
to destroy the channel and an Exception is thrown. Thread two waiting on the channel object synchronization lock is finally able to obtain it and continue in the CAJChannel.addConnectionListenerAndFireIfConnected method.
In the epics2web application this scenario (create and then immediately destroy) might occur if a user requests to view a screen and decides to close their web browser before the page fully loads.
Meanwhile another user is requesting the same screen.
I think the fix is to remove the "synchronized" keyword from the CAJChannel.destroy methods (both of them, one calls the other). The methods already ultimately call CAJContext.destroyChannel, which
obtains lock A and then calls CAJChannel.destroyChannel, which obtains lock B. This would make the lock acquisition ordering consistent between create and destroy methods. A workaround in the meantime is to not use the CAJChannel.destroy method and instead
call CAJContext.destroyChannel directly.
More info:
https://github.com/JeffersonLab/epics2web/issues/2
This is hard to verify so someone with more CAJ/JCA experience please take a look.
Ryan Slominski