EPICS Not-a-number (nan) issues

Experimental Physics and Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 <2004> 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 <2004> 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
<== Date ==>		<== Thread ==>

Subject:	Not-a-number (nan) issues
From:	Kay-Uwe Kasemir <[email protected]>
To:	EPICS Tech Talk <[email protected]>
Date:	Thu, 07 Oct 2004 17:51:43 -0400

Greetings!

We experienced quite some trouble at the SNS as a result of a
save/restore-type program writing NaN = nan = not-a-number
values to process variables all across the linac.

There are actually several binary bit patterns for floating
point values that are considered "NaN".
Accidentally writing all 0xFF into a double could result
in NaN, but in our case it was the "official" IEEE NaN
pattern that was sent to the PVs.

The root cause has hopefully been fixed: That save/restore program, meant to take snapshots of the machine state similar to "burt" et. al, logged "NaN" for PVs to which it couldn't connect for various reasons and later simply wrote those "NaN"s back when restoring the snapshot file.

Problems arouse because of the way NaN is handled by C/C++:
   NaN + xxx  ->  NaN (same for -, *, /)
   NaN < xxx  -> false
   NaN > xxx  -> false
   NaN == xxx -> false
   NaN != xxx -> true
   (long)NaN == 0x80000000
Other languages might be similar, in any case these were
the symptoms on the affected EPICS IOCs:

On analog records, all checks against HOPR, LOPR, DRVH, DRVL
fail for NaN.
-> the value is accepted even though you might think
   you've locked "bad" values out.

The test abs(NaN - previous value)>MDEL  fails,
so no ChannelAccess monitors are sent,
hence you don't see that the records are at NaN
in operator displays or archives.

When NaN ends up in analog record's RVAL or
is written to other non-floating point records,
you get a huge or tiny number (depends on whether you
use it signed or unsigned),
which in the case of the AO record sneaked past
the DRVH/DRVL check, and can now cause all sorts
of interesting results when passed to the unexpecting
hardware.

At the SNS, we have differing opinions on how to react.
For sure the cause, our save/restore program gets fixed.
Is it worthwhile to prevent the symptoms from happening again?
Should Channel Access and records check for NaN and
block it?
Or should a user be free to send NaN if that's what he/she
really wants, except MDEL/HOPR/LOPR/DRVH/DRVL ought to
handle it?
Brings up practical issues:
Is this a lot of work?
Who volunteers?
How do you catch NaN? Different from most C/C++ compilers
and standard C/C++ libraries,
the one we use for vxWorks/Tornado 2.0.2 does not define "isnan()".
Shen Peng suggested to use something like this:
#define isnan(x)    ((x)!=(x))
Who knows if this works in all cases?

The bottom line: Avoid NaN.

- SNS controls group

Replies:: RE: Not-a-number (nan) issues Jeff Hill; Re: Not-a-number (nan) issues Andrew Johnson; Re: Not-a-number (nan) issues Marty Kraimer

Navigate by Date:: Prev: Re: cmlog compile problems Jie Chen; Next: RE: Not-a-number (nan) issues Jeff Hill; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 <2004> 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Navigate by Thread:: Prev: Re: cmlog compile problems Jie Chen; Next: RE: Not-a-number (nan) issues Jeff Hill; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 <2004> 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024