On 22/11/14 15:15, Benjamin Franksen wrote:
Am Freitag, 21. November 2014, 12:23:16 schrieb Andrew Johnson:
Hi Michael,
On 11/20/2014 05:48 AM, [email protected] wrote:
I've just noticed that caget incorrectly prints values of type
DBR_CHAR, or to be precise, the behaviour of caget depends on whether
the compiler is operating with signed or unsigned characters.
This is actually the case within the IOC as well, a DBF_CHAR field (and
the epicsInt8 C type) may be signed or unsigned depending on the ABI of
the IOC's target architecture. I'm thus not 100% convinced whether this
is actually a bug or not.
I was slightly surprised to see that the dbr_char_t is explicitly
defined in db_access.h to be unsigned (it's a typedef for epicsUInt8),
since dbr_short_t and dbr_long_t are both signed types. However for
historical reasons many of our types are a bit messed up anyway, for
example a dbr_int_t is a 16-bit quantity, reflecting the processor
standards at the time when this code was created.
I have looked at cleaning this kind of thing up inside the IOC at least,
but it's not easy since compilers don't like passing pointers to
unsigned char into functions such as strlen() that expect unqualified
char pointers.
It is notable how pedantic C compilers can be for perfectly harmless re-
interpretations via pointer casting such as in in this case, where the
compiler *knows* that the underlying types have the same representation. The
compiler warning suggests that it would be better to add a type cast, which we
do and so get rid of the warning. Unfortunately this makes the code *worse*.
Because the type cast asserts that "the programmer knows what he does", so the
compiler never complains or warns again. If we later change the source type to
something that has a different representation, the code with the cast remains
"valid" when in fact it completely breaks the program now. Had the compiler
accepted the code with the implicit char* to unsigned char* pointer conversion
(because it can see that it is a safe conversion) it could now complain and
would save us from shooting ourselves in the foot.
We should probably have separate DBF_CHAR and DBF_INT8
field types to distinguish whether a field should be represented as a
character or integer value, but we don't currently.
Slightly paraphrasing, DBF_CHAR would represent text, while DBF_INT8 and
DBF_UINT8 represent (8 bit) integer values. That means we make a /semantic/
distinction in the type, not just a representational one. I guess this was the
idea behind making char, signed char, and unsigned char three distinct types
in C. (The problem with this idea is of course that char should have >=21
bits, representing a Unicode code point, not 8 bit. Which is the reason most
modern languages offer a dedicated string or text type that hides the internal
representation. We don't have the luxury to use such a language, so we have to
compromise.)
It all depends how text is stored in memory when using C (or C++).
In the old days, ASCII was used, (or EBCDIC, or something different), but in any
case a char was big enough to hold a "character".
A "char" with 8 bits could hold ASCII which defines 7 bits, and even more, like
ISO-8859-1.
Then Unicode was developed, a "character" is called a "code point", and the char
was too small to hold all code points, and wchar_t was introduced.
wchar_t is 16 bit under Windows, which was OK for Unicode 1.0.
When Unicode 2.0 came, more than 0xFFFF code points where defined.
wchar_t under Windows is still 16 bit, meaning that code points >0xFFFF can be encoded
in UTF-16. (That is what the Unicode aware file system API uses, e.g. _wopen()).
wchar_t is typically 32 bits under Linux and Mac OS X, I don't know about other systems.
Whenever a system communicates with another system, most often neither UTF-32 nor UTF-16
are used, but rather UTF-8, as it typically needs less bandwidth.
When we need to store Unicode code points in memory, many systems store them in memory as UTF-8 strings.
Or we can say as C-strings, encoded in UTF-8.
When it comes to EPICS, I have seen both ISO-8859-1 and UTF-8, and there are probably other encodings
used as well.
Question: Which encodings are used ?
I could find it useful if channel access (or PV access) could tell
the remote which encodings is uses for strings, as I can see a transition from e.g. 8859 to UTF-8,
and a mixture of old and new, speaking different encodings.
I don't know if this is feasable at all, or if this is too much work when most strings are ASCII,
and only a handful code points outside ASCII are used (like '°' for Degree. or 'µ' for "my".
Once again: What do you use in reality ?
Back to Andrews question:
Slightly paraphrasing, DBF_CHAR would represent text, while DBF_INT8 and
DBF_UINT8 represent (8 bit) integer values. That means we make a /semantic/
...
What do we gain from such a distinction ?
One advantage is that strlen() wants "char *" rather than "unsigned char *",
but what more do we gain ?
My feeling is that we may want to stick with DBF_CHAR being defined as "unsigned char" (or UINT8),
but does it makes sense to define
DBF_UTF_8 ?
I tend to think the answer is yes.
You do not need to use it, but when it is used, we are sure that the string is UTF-8.
Most (if not all) Linux system today are configured to understand UTF-8, and so is Mac OS X.
CYGWIN does support UTF-8 in the terminal, not sure about MSYS or native Windows.
(However, you do net need to use DBF_UTF_8, the old DBF_CHAR works well)
As we may want to introduce 64 bit integers, and my understanding is that this breaks
binary compabilty (is this right) we may introduce UTF-8 at the same time.
Which leads to the next question:
How to contribute code to EPICS base ?
Do I need a Launchpad account, which seems to be connected to an Ubuntu One account,
(https://login.launchpad.net/privacy/) or are there other ways ?
https://en.wikipedia.org/wiki/C_string_handling#wchar_t
http://en.wikipedia.org/wiki/Unicode
- References:
- Small bug in caget michael.abbott
- Re: Small bug in caget Andrew Johnson
- Re: Small bug in caget Benjamin Franksen
- Navigate by Date:
- Prev:
Job Opening Ronald Ruth
- Next:
RE: Small bug in caget michael.abbott
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
<2014>
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: Small bug in caget Benjamin Franksen
- Next:
RE: Small bug in caget michael.abbott
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
<2014>
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|