EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  <20142015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  <20142015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Small bug in caget
From: Torsten Bögershausen <[email protected]>
To: Benjamin Franksen <[email protected]>, <[email protected]>
Date: Mon, 24 Nov 2014 08:37:39 +0100


On 22/11/14 15:15, Benjamin Franksen wrote:
Am Freitag, 21. November 2014, 12:23:16 schrieb Andrew Johnson:
Hi Michael,

On 11/20/2014 05:48 AM, [email protected] wrote:
I've just noticed that caget incorrectly prints values of type
DBR_CHAR, or to be precise, the behaviour of caget depends on whether
the compiler is operating with signed or unsigned characters.

This is actually the case within the IOC as well, a DBF_CHAR field (and
the epicsInt8 C type) may be signed or unsigned depending on the ABI of
the IOC's target architecture. I'm thus not 100% convinced whether this
is actually a bug or not.

I was slightly surprised to see that the dbr_char_t is explicitly
defined in db_access.h to be unsigned (it's a typedef for epicsUInt8),
since dbr_short_t and dbr_long_t are both signed types. However for
historical reasons many of our types are a bit messed up anyway, for
example a dbr_int_t is a 16-bit quantity, reflecting the processor
standards at the time when this code was created.

I have looked at cleaning this kind of thing up inside the IOC at least,
but it's not easy since compilers don't like passing pointers to
unsigned char into functions such as strlen() that expect unqualified
char pointers.

It is notable how pedantic C compilers can be for perfectly harmless re-
interpretations via pointer casting such as in in this case, where the
compiler *knows* that the underlying types have the same representation. The
compiler warning suggests that it would be better to add a type cast, which we
do and so get rid of the warning. Unfortunately this makes the code *worse*.
Because the type cast asserts that "the programmer knows what he does", so the
compiler never complains or warns again. If we later change the source type to
something that has a different representation, the code with the cast remains
"valid" when in fact it completely breaks the program now. Had the compiler
accepted the code with the implicit char* to unsigned char* pointer conversion
(because it can see that it is a safe conversion) it could now complain and
would save us from shooting ourselves in the foot.

We should probably have separate DBF_CHAR and DBF_INT8
field types to distinguish whether a field should be represented as a
character or integer value, but we don't currently.

Slightly paraphrasing, DBF_CHAR would represent text, while DBF_INT8 and
DBF_UINT8 represent (8 bit) integer values. That means we make a /semantic/
distinction in the type, not just a representational one. I guess this was the
idea behind making char, signed char, and unsigned char three distinct types
in C. (The problem with this idea is of course that char should have >=21
bits, representing a Unicode code point, not 8 bit. Which is the reason most
modern languages offer a dedicated string or text type that hides the internal
representation. We don't have the luxury to use such a language, so we have to
compromise.)

It all depends how text is stored in memory when using C (or C++).
In the old days, ASCII was used, (or EBCDIC, or something different), but in any
case a char was big enough to hold a "character".
A "char" with 8 bits could hold ASCII which defines 7 bits, and even more, like
ISO-8859-1.
Then Unicode was developed, a "character" is called a "code point", and the char
was too small to hold all code points, and wchar_t was introduced.

wchar_t is 16 bit under Windows, which was OK for Unicode 1.0.
When Unicode 2.0 came, more than ​0xFFFF code points where defined.
wchar_t under Windows is still 16 bit, meaning that code points >0xFFFF can be encoded
in UTF-16. (That is what the Unicode aware file system API uses, e.g. _wopen()).

wchar_t is typically 32 bits under Linux and Mac OS X, I don't know about other systems.

Whenever a system communicates with another system, most often neither UTF-32 nor UTF-16
are used, but rather UTF-8, as it typically needs less bandwidth.
When we need to store Unicode code points in memory, many systems store them in memory as UTF-8 strings.
Or we can say as C-strings, encoded in UTF-8.

When it comes to EPICS, I have seen both ISO-8859-1 and UTF-8, and there are probably other encodings
used as well.
Question: Which encodings are used ?

I could find it useful if channel access (or PV access) could tell
the remote which encodings is uses for strings, as I can see a transition from e.g. 8859 to UTF-8,
and a mixture of old and new, speaking different encodings.
I don't know if this is feasable at all, or if this is too much work when most strings are ASCII,
and only a handful code points outside ASCII are used (like '°' for Degree. or  'µ' for "my".
Once again: What do you use in reality ?

Back to Andrews question:

Slightly paraphrasing, DBF_CHAR would represent text, while DBF_INT8 and
DBF_UINT8 represent (8 bit) integer values. That means we make a /semantic/
...

What do we gain from such a distinction ?
One advantage is that strlen() wants "char *" rather than "unsigned char *",
but what more do we gain ?

My feeling is that we may want to stick with DBF_CHAR being defined as "unsigned char" (or UINT8),
but does it makes sense to define
DBF_UTF_8 ?

I tend to think the answer is yes.
You do not need to use it, but when it is used, we are sure that the string is UTF-8.
Most (if not all) Linux system today are configured to understand UTF-8,  and so is Mac OS X.
CYGWIN does support UTF-8 in the terminal, not sure about MSYS or native Windows.
(However, you do net need to use DBF_UTF_8, the old DBF_CHAR works well)

As we may want to introduce  64 bit integers, and my understanding is that this breaks
binary compabilty (is this right) we may  introduce UTF-8 at the same time.

Which leads to the next question:
How to contribute code to EPICS base ?
Do I need a Launchpad account, which seems to be connected to an Ubuntu One account,
(https://login.launchpad.net/privacy/) or are there other ways ?




https://en.wikipedia.org/wiki/C_string_handling#wchar_t
http://en.wikipedia.org/wiki/Unicode













References:
Small bug in caget michael.abbott
Re: Small bug in caget Andrew Johnson
Re: Small bug in caget Benjamin Franksen

Navigate by Date:
Prev: Job Opening Ronald Ruth
Next: RE: Small bug in caget michael.abbott
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  <20142015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: Small bug in caget Benjamin Franksen
Next: RE: Small bug in caget michael.abbott
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  <20142015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 17 Dec 2015 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·