Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017 
<== Date ==> <== Thread ==>

Subject: Re: Race conditions in SNL programs
From: Carl Lionberger <calionberger@lbl.gov>
To: Benjamin Franksen <benjamin.franksen@bessy.de>
Cc: core-talk@aps.anl.gov, tech-talk@aps.anl.gov
Date: Thu, 20 May 2010 16:20:24 -0700
Hi Ben,

I vote for having the pvLock()/pvUnlock(), if only as a stopgap sort of thing.  That would make it possible to work around this until an elegant solution is arrived at.

Carl

On Thu, May 20, 2010 at 3:53 PM, Benjamin Franksen <benjamin.franksen@bessy.de> wrote:
Hi Everyone

sorry for a somewhat lengthy post. Executive summary: many if not most SNL
programs in existence suffer from random misbehaviour due to race
conditions between read access from the program and write access from
monitor callbacks. Actual misbehaviour occurs very rarely for typical
programs, making it hard to reproduce, but the bug can be easily verified
with an articifial test program. Affected are all programs that use
monitored variables of array or string type or any scalar numeric type to
which read access in C is not atomic (which depends on the target
platform), and for which a coincidence between monitor event and read
access is not (typically accidentally) excluded by timing constraints like
auspicious record scan rates. I don't know of any release of the sequencer
that is not affected. Possible solutions are either ugly and hard to use
(at least for array and string variables) or incompatible.

Long story:

Today, working here and there on the new sequencer manual, it occurred to me
to ask myself: what happens if my SNL program contains an assigned and
monitored variable, which it reads (for instance to check a state change
condition) and at the same time a monitor event happens that updates this
variable. More specifically I asked myself: what mechanism does the
sequencer use to prevent the read access from the program on the one hand,
and the write access from the CA monitor callback on the other hand, to be
interleaved? Surely, I said to myself, there must be some code somewhere in
the sequencer run-time library that somehow achieves this. So I went
looking.

To my great astonishment and dismay I found: nothing. I had a hard time
believing this. It would mean that each time I write seemingly harmless
code like

 double x;
 assign x to "whatever";
 monitor x;
 ...
 ss xxx {
   state bla {
     when(x > 5.0) {
       ..
     } state next
   }
   ...
 }

then my program is completely, utterly broken! Because there is a small
(but, in general, non-zero) chance, that a monitor happens precisely at the
time the condition is tested, so -- assuming that read access to a double
is not guaranteed to be atomic -- parts of the read and write access may be
interleaved, resulting in a corrupt value read from the variable. Which
means that once in a while the condition might fire without the value of
the PV actually being > 5.0, while at other times it might indeed be > 5.0
without any state change happening. Similar considerations apply to
programs that access such a variable from within the transition action
statements.

Certainly such a grave error would have been detected during the last
several years (or maybe 10s of years)?

I told my colleague Götz Pfeiffer, then, who couldn't believe it either.
Together we searched through the code again and again, considered various
scenarios, etc, but in the end we came again to the conclusion that there
is nothing to prevent this from happening.

I wrote a test then, to provoke such an error, using a string variable,
assuming that it would be easier to get such an interleaving to happen, and
also assuming that with a string I could be certain that read access is
non-atomic. And indeed: After some trial and error we finally managed to
arrive at a simple program that proves that such an interleaving can happen
and that in this case the value read by the program is indeed inconsistent,
i.e. "half-way" overwritten by the monitor callback.

This bug must have been in the sequencer for a long, long time, probably
right from the beginning. And strangely, nobody ever complained. I think
the reason is that such an error actually happening is *very* unlikely for
the kinds of programs we typically write. One reason is that the data is
copied in the callback function using memcpy which is very fast, so the
probability that interleaving happens is quite low. It usually takes a
*long* time for normal SNL programs to exhibit an error due to this race
condition, thus it is almost impossible to reproduce, so everybody thinks
it must have been something else (noise, cosmic rays, take your pick).

                       * * * * * * * * *

Now, what can we do about it?

This is not as easy as one might like to think. First, it is not enough to
lock out monitors when checking the when-conditions, as read access to
monitored variables might as well (and often does) happen inside the action
statements. Of course, one could lock out monitors during execution of the
transition actions, too. Note, however, that an SNL program can contain
more than one state set, each running independently in its own task. The
more state sets you have in your program, the more often monitors will
become blocked. It is not clear to me how this will scale, or whether it
might even lead to starvation (of the CA task that executes the callbacks).
Note that at the moment there is only one CA client context for the whole
program, and thus only one task that issues callbacks. That is, only one
callback can be active at any time, thus blocking one blocks all other
pending events, too.

All this suggests that locking might not be such a good idea. There are
alternatives: one could devise a scheme in which the program optimistically
reads the variables, then checks whether a callback intervened, and if it
did repeat the read. This is similar to how things are done in Software
Transactional Memory. It has the advantage that callbacks can proceed
without any delay, and repeating the read is limited to variables that have
actually been written to.

Another problem is that (currently) access to the variables from the SNL
program is directly compiled to an equivalent variable access in C. For
scalar (numeric) variables, it would be possible to change this (in the
compiler) and wrap each read access into a C function call (or macro) that
locks out monitors. BUT: this does not work for strings and arrays, because
they are not read en block in C, but rather (necessarily) element by
element. There is no syntactically apparent place where lock/unlock pairs
should be inserted by the compiler.

One possible solution would be to offer the SNL programmer a pair of
primitive pvLock/pvUnlock functions, with the attached warning that code in
between pvLock and pvUnlock should be as short as possible and must never
call anything that might block, notably not any synchronous variant of
pvGet or pvPut (even for unrelated variables). (The warnign would be less
severe with an optimistic scheme as sketched above.)

Another idea is to offer some pvCopy function that atomically creates a copy
of its argument variable. How to best create/store these for strings and
arrays is unclear to me.

Note that none of these are programmer friendly APIs, but at least there
would be *some* way to prevent random misbehaviour of SNL programs, whereas
at the moment there is none (apart form sprinkling your program with
intLock/intUnlock calls, which is not portable, and I doubt that similar
calls are available to user level programs on platforms like linux or
windows).

I have thought a lot about more elegant ways to solve the problem, but none
of the ideas I had were backwards compatible. To solve the immediate
problem, we need to offer the SNL programmer some way to ensure atomicity
of a read operation even for monitored strings and array vars. I urge
everyone to offer any suggestions you might have!

For me, releasing a sequencer version 2.1 without fixing this problem is
completely out of the question.

In the long term I think it is inevitable to devise a more abstract API
where such errors are prevented from happening by design. If I ever
complete such a design (and manage to implement it) it will carry a new
major release number, as old programs will no longer compile without
significant changes.

Cheers
Ben




--
Carl Lionberger
Software Developer
LBNL Engineering Division
510 684 7503

References:
Race conditions in SNL programs Benjamin Franksen

Navigate by Date:
Prev: Race conditions in SNL programs Benjamin Franksen
Next: areaDetector R1-6 released Mark Rivers
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017 
Navigate by Thread:
Prev: Race conditions in SNL programs Benjamin Franksen
Next: RE: Race conditions in SNL programs Chestnut, Ronald P.
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017 
ANJ, 02 Sep 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·