g+
g+ Communities
Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  Index 2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014 
<== Date ==> <== Thread ==>

Subject: Race conditions in SNL programs
From: Benjamin Franksen <benjamin.franksen@bessy.de>
To: tech-talk@aps.anl.gov, core-talk@aps.anl.gov
Date: Fri, 21 May 2010 00:53:04 +0200
Hi Everyone

sorry for a somewhat lengthy post. Executive summary: many if not most SNL 
programs in existence suffer from random misbehaviour due to race 
conditions between read access from the program and write access from 
monitor callbacks. Actual misbehaviour occurs very rarely for typical 
programs, making it hard to reproduce, but the bug can be easily verified 
with an articifial test program. Affected are all programs that use 
monitored variables of array or string type or any scalar numeric type to 
which read access in C is not atomic (which depends on the target 
platform), and for which a coincidence between monitor event and read 
access is not (typically accidentally) excluded by timing constraints like 
auspicious record scan rates. I don't know of any release of the sequencer 
that is not affected. Possible solutions are either ugly and hard to use 
(at least for array and string variables) or incompatible.

Long story:

Today, working here and there on the new sequencer manual, it occurred to me 
to ask myself: what happens if my SNL program contains an assigned and 
monitored variable, which it reads (for instance to check a state change 
condition) and at the same time a monitor event happens that updates this 
variable. More specifically I asked myself: what mechanism does the 
sequencer use to prevent the read access from the program on the one hand, 
and the write access from the CA monitor callback on the other hand, to be 
interleaved? Surely, I said to myself, there must be some code somewhere in 
the sequencer run-time library that somehow achieves this. So I went 
looking.

To my great astonishment and dismay I found: nothing. I had a hard time 
believing this. It would mean that each time I write seemingly harmless 
code like

  double x;
  assign x to "whatever";
  monitor x;
  ...
  ss xxx {
    state bla {
      when(x > 5.0) {
        ..
      } state next
    }
    ...
  }

then my program is completely, utterly broken! Because there is a small 
(but, in general, non-zero) chance, that a monitor happens precisely at the 
time the condition is tested, so -- assuming that read access to a double 
is not guaranteed to be atomic -- parts of the read and write access may be 
interleaved, resulting in a corrupt value read from the variable. Which 
means that once in a while the condition might fire without the value of 
the PV actually being > 5.0, while at other times it might indeed be > 5.0 
without any state change happening. Similar considerations apply to 
programs that access such a variable from within the transition action 
statements.

Certainly such a grave error would have been detected during the last 
several years (or maybe 10s of years)?

I told my colleague Götz Pfeiffer, then, who couldn't believe it either. 
Together we searched through the code again and again, considered various 
scenarios, etc, but in the end we came again to the conclusion that there 
is nothing to prevent this from happening.

I wrote a test then, to provoke such an error, using a string variable, 
assuming that it would be easier to get such an interleaving to happen, and 
also assuming that with a string I could be certain that read access is 
non-atomic. And indeed: After some trial and error we finally managed to 
arrive at a simple program that proves that such an interleaving can happen 
and that in this case the value read by the program is indeed inconsistent, 
i.e. "half-way" overwritten by the monitor callback.

This bug must have been in the sequencer for a long, long time, probably 
right from the beginning. And strangely, nobody ever complained. I think 
the reason is that such an error actually happening is *very* unlikely for 
the kinds of programs we typically write. One reason is that the data is 
copied in the callback function using memcpy which is very fast, so the 
probability that interleaving happens is quite low. It usually takes a 
*long* time for normal SNL programs to exhibit an error due to this race 
condition, thus it is almost impossible to reproduce, so everybody thinks 
it must have been something else (noise, cosmic rays, take your pick).

                        * * * * * * * * *

Now, what can we do about it?

This is not as easy as one might like to think. First, it is not enough to 
lock out monitors when checking the when-conditions, as read access to 
monitored variables might as well (and often does) happen inside the action 
statements. Of course, one could lock out monitors during execution of the 
transition actions, too. Note, however, that an SNL program can contain 
more than one state set, each running independently in its own task. The 
more state sets you have in your program, the more often monitors will 
become blocked. It is not clear to me how this will scale, or whether it 
might even lead to starvation (of the CA task that executes the callbacks). 
Note that at the moment there is only one CA client context for the whole 
program, and thus only one task that issues callbacks. That is, only one 
callback can be active at any time, thus blocking one blocks all other 
pending events, too.

All this suggests that locking might not be such a good idea. There are 
alternatives: one could devise a scheme in which the program optimistically 
reads the variables, then checks whether a callback intervened, and if it 
did repeat the read. This is similar to how things are done in Software 
Transactional Memory. It has the advantage that callbacks can proceed 
without any delay, and repeating the read is limited to variables that have 
actually been written to.

Another problem is that (currently) access to the variables from the SNL 
program is directly compiled to an equivalent variable access in C. For 
scalar (numeric) variables, it would be possible to change this (in the 
compiler) and wrap each read access into a C function call (or macro) that 
locks out monitors. BUT: this does not work for strings and arrays, because 
they are not read en block in C, but rather (necessarily) element by 
element. There is no syntactically apparent place where lock/unlock pairs 
should be inserted by the compiler.

One possible solution would be to offer the SNL programmer a pair of 
primitive pvLock/pvUnlock functions, with the attached warning that code in 
between pvLock and pvUnlock should be as short as possible and must never 
call anything that might block, notably not any synchronous variant of 
pvGet or pvPut (even for unrelated variables). (The warnign would be less 
severe with an optimistic scheme as sketched above.)

Another idea is to offer some pvCopy function that atomically creates a copy 
of its argument variable. How to best create/store these for strings and 
arrays is unclear to me.

Note that none of these are programmer friendly APIs, but at least there 
would be *some* way to prevent random misbehaviour of SNL programs, whereas 
at the moment there is none (apart form sprinkling your program with 
intLock/intUnlock calls, which is not portable, and I doubt that similar 
calls are available to user level programs on platforms like linux or 
windows).

I have thought a lot about more elegant ways to solve the problem, but none 
of the ideas I had were backwards compatible. To solve the immediate 
problem, we need to offer the SNL programmer some way to ensure atomicity 
of a read operation even for monitored strings and array vars. I urge 
everyone to offer any suggestions you might have!

For me, releasing a sequencer version 2.1 without fixing this problem is 
completely out of the question.

In the long term I think it is inevitable to devise a more abstract API 
where such errors are prevented from happening by design. If I ever 
complete such a design (and manage to implement it) it will carry a new 
major release number, as old programs will no longer compile without 
significant changes.

Cheers
Ben


Navigate by Date:
Prev: Re: calcout record precision Tim Mooney
Next: Re: Race conditions in SNL programs Benjamin Franksen
Index: 2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014 
Navigate by Thread:
Prev: Re: calcout record precision Tim Mooney
Next: Re: Race conditions in SNL programs Benjamin Franksen
Index: 2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014 
ANJ, 02 Feb 2012 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICSv4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·