Experimental Physics and
Industrial Control System

"Jeff Hill" <[email protected]> · Wed, 26 Apr 2006 14:49:47 -0600

Kay,

I believe that these issues are fundamental. How we, the EPICS developers,
address them has a very strong impact on a quality implementation.

Recall that a guard class is a class that takes the mutex lock in its
constructor and releases the mutex lock in its destructor. Placing a
reference to a guard class in the arguments to a member function establishes
a requirement that the lock must be held when calling the member function
(this member function can verify that the guard is using the correct lock
instance of course).

You may have noticed that there were guard classes in recent CA interfaces
that I have proposed. A reviewer might naively conclude that guard classes
(all issues related to locking) are low level details that must always
remain hidden from consumers of the class. Closer inspection of the
situation reveals that __not__ including the guard class in the public
interface can be quite problematic. Here are some reasons that I can think
of at the moment.

1) User code and library code are forced always to use different locks. If
guard classes are in the public interfaces the user can assign a lock to the
object created with the library. The user is in control of locking
granularity. For example, if we have a private timer queue in the CA client
why should it be using a different lock than the rest of the CA client
library? In the timer expiration callback we need to hold the CA client
library's lock. It seems natural in that situation for the lock used to
protect the timer queue to be the same lock that is used to protect the
client library. The client library should choose what locking granularity is
needed. Too many locks can make the design more vulnerable to deadlocks. The
designer/integrator should choose.

2) If guard classes are not in public interfaces we must be very careful to
not create hidden lock hierarchy inversions in a callback based system. We
must, for example, release the library's lock before calling the callback.
Otherwise if a guard class is in the callback interface it is well
documented that a (user assigned) lock is held when in the callback. Nothing
is hidden.

3) When calling a batch of callbacks the library is required to have an
unlock/lock pair around every call to the callback. This is very
inefficient. For example with a timer library several timers might expire at
the same time. We can construct the guard once and pass it to a batch of
timer expiration callbacks.

4) When the user is invoking requests in the library we must have a
lock/unlock pair in every request stub. In contrast, when there is a guard
class in the interface, we can construct a guard once and reuse it over
multiple requests. For example we could lock once, perform 10000 ca gets,
and then unlock.

5) If we have an object oriented systems with many fine grained classes it
is simply too much overhead to embed a lock in every instance of every
class.

6) We can construct the interface necessary for users that do not (should
not) know about multi-threading on top of the guard based interface, but not
visa-versa.

> Depending on the CA implementation, another channel connection 
> callback might also arrive.

Even though you might enable preemptive callback, the client library will
allow only one callback to be active at a time.

> That of course creates a deadlock:
> 1) locks Engine, then CA library
> 2) locks CA library, then Engine

The R3.14 CA client library does not hold any lock when calling its
callbacks which might be locked as the result of calling any of the
functions in its API.

> The other way round, your CA library has to lock the channel
> information, since otherwise a third thread could delete the 
> channel while it's in the middle of the connection callback.

When the user deletes a channel the library first disables future callbacks
through the channel, and then waits until any callback in progress is
completed. The lock required to synchronize this is not held when calling
the callback. Furthermore, not deviating from the CA client library's lock
hierarchy design protects the library from deadlocks in this situation.

> I don't see how dropping locks helps in here.
> Does this require locks with timeouts,
> and then the user code has to deal with the errors?

With a proper lock hierarchy design, and no deviation from it within the
implementation, mutex lock requests that specify timeouts are not required.
Nevertheless, the robustness that the timeouts provide would be, from my
perspective, desirable should the implementation fail to be conformant with
the design. The problem is, with a C based implementation, all of the
spaghetti code required to recover from potential timeouts. Adding
significant amounts of extra code of course has many negative consequences.
However, with a C++ or Java implementation we can throw an exception when
the lock request times out, and with a proper design recover without writing
reams of extra code. That approach (throwing an exception when the lock
request times out) probably should be considered.

In summary: 

1) I believe that placing a guard classes in the public CA client interfaces
is a better way to deal with the types of trouble that you describe when
compared with what I have currently implemented in R3.14.

2) Nevertheless, within the R3.14 CA client library implementation we _are_
releasing any lock, that might be taken as a consequence of the user's
calling the public CA client API, before calling the user's callback. We are
therefore protected in that situation from hidden lock hierarchy inversion.
This should go a long way to eliminate the type of trouble that you
described.

Jeff

> -----Original Message-----
> From: Kay-Uwe Kasemir [mailto:[email protected]]
> Sent: Wednesday, April 26, 2006 1:01 PM
> To: Jeff Hill
> Subject: General lock question
> 
> Hi Jeff:
> 
> I'm working on the archive engine (again), and I'm checking for
> deadlocks
> in there, too.
> I think a future database with online add/delete has the same issues.
> 
> As for your comment
> > O Recent convention in EPICS base is to __not__ hold a lock when
> > calling
> > callbacks.
> 
> Some things that need to happen:
> 1) Engine (or database) startup:
>       foreach channel:
>          start channel
> 2) CA sends connection callback:
>       increment engine's "# of connected channels" count.
> 
> Ideally, this is allowed at runtime, so the Engine's HTTP daemon is
> running
> (respectively IOC shell),
> providing the user with buttons to start/stop the engine, add/delete
> channels...
> 
> To prevent changes to the engine's channel list during startup(),
> step 1) would lock the engine, then invoke ca_create_channel etc.
> 
> 2) would have to lock the engine while the count is modified,
> since otherwise some other HTTP thread might read it while being
> modified. Depending on the CA implementation, another channel
> connection callback might also arrive.
> 
> That of course creates a deadlock:
> 1) locks Engine, then CA library
> 2) locks CA library, then Engine
> 
> Since I cannot drop the CA library lock, I drop the Engine lock
> before invoking ca_create_channel etc.
> But that means somebody can use the HTTPD and delete the channel
> from the engine list while I'm in the middle of starting it.
> 
> The other way round, your CA library has to lock the channel
> information,
> since otherwise a third thread could delete the channel while
> it's in the middle of the connection callback.
> 
> I don't see how dropping locks helps in here.
> Does this require locks with timeouts,
> and then the user code has to deal with the errors?
> 
> -Kay

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System