Experimental Physics and
Industrial Control System

Kay-Uwe Kasemir <[email protected]> · Thu, 07 Oct 2004 17:51:43 -0400

Greetings!
We experienced quite some trouble at the SNS as a result of a
save/restore-type program writing NaN = nan = not-a-number
values to process variables all across the linac.
There are actually several binary bit patterns for floating
point values that are considered "NaN".
Accidentally writing all 0xFF into a double could result
in NaN, but in our case it was the "official" IEEE NaN
pattern that was sent to the PVs.
The root cause has hopefully been fixed:

That save/restore program, meant to take snapshots of the machine state

similar to "burt" et. al, logged "NaN" for PVs to which it couldn't
connect

for various reasons and later simply wrote those "NaN"s back

when restoring the snapshot file.
Problems arouse because of the way NaN is handled by C/C++:
   NaN + xxx  ->  NaN (same for -, *, /)
   NaN < xxx  -> false
   NaN > xxx  -> false
   NaN == xxx -> false
   NaN != xxx -> true
   (long)NaN == 0x80000000
Other languages might be similar, in any case these were
the symptoms on the affected EPICS IOCs:
On analog records, all checks against HOPR, LOPR, DRVH, DRVL
fail for NaN.
-> the value is accepted even though you might think
   you've locked "bad" values out.
The test abs(NaN - previous value)>MDEL  fails,
so no ChannelAccess monitors are sent,
hence you don't see that the records are at NaN
in operator displays or archives.
When NaN ends up in analog record's RVAL or
is written to other non-floating point records,
you get a huge or tiny number (depends on whether you
use it signed or unsigned),
which in the case of the AO record sneaked past
the DRVH/DRVL check, and can now cause all sorts
of interesting results when passed to the unexpecting
hardware.
At the SNS, we have differing opinions on how to react.
For sure the cause, our save/restore program gets fixed.
Is it worthwhile to prevent the symptoms from happening again?
Should Channel Access and records check for NaN and
block it?
Or should a user be free to send NaN if that's what he/she
really wants, except MDEL/HOPR/LOPR/DRVH/DRVL ought to
handle it?
Brings up practical issues:
Is this a lot of work?
Who volunteers?
How do you catch NaN? Different from most C/C++ compilers
and standard C/C++ libraries,
the one we use for vxWorks/Tornado 2.0.2 does not define "isnan()".
Shen Peng suggested to use something like this:
#define isnan(x)    ((x)!=(x))
Who knows if this works in all cases?
The bottom line: Avoid NaN.
- SNS controls group

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System