Benjamin Franksen wrote:
Just one little thought about STRING data type:
UINTN the number of UTF-8 tokens
OCTET sequence UTF-8 encoded character string sequence
I take it that 'number of UTF-8 tokens' means 'number of octets',
right?
It must do, for reasons discussed below.
Maybe it would be worthwhile to consider adding a 'number of
/characters/' count in addition to the byte count. This could improve
performance, particularly when converting to other encodings on the
client side. Of course any gain must be offset against the increased
protocol overhead.
In EPICS we really don't want to deal with UTF-8 'characters' or the
Unicode code points they encode, we'd much rather leave all that up to
the user interfaces and just count octets everywhere.
This is exactly what most operating system routines do too - if you call
printf() in a UTF-8 locale and give a %s with a width specification in
your format string, the width will be counted in bytes (C chars), so it
could break up a UTF-8 multi-byte character in the middle of a sequence.
Marty Kraimer replied:
Java 5 uses 16 bits for char, which is not sufficient to encode all
uni-code character sets.
It uses 2 consecutive chars to hold a unicode character that does not
fit in 16 bits.
At least some C/C++ implementations use 32 bits for wchar which is
sufficient for all unicode characters.
But what if an implementation uses 16 bits?
Thus how will the number of characters in a UTF-8 string be used?
Unicode/UTF-8 (which is what we really mean when we say UTF-8) is
well-defined in that if a routine understands the multi-byte encoding
rules it can scan a UTF-8 string and count the number of Unicode 'code
points' contained in it, which is probably what Benjamin means when he
talks about a character count.
However like Marty I would strongly question the usefulness of this
information to anything other than the final GUI display widget that is
going to put the thing on a screen; even if it were using a monospaced
font, some Unicode code points actually encode 'combining' characters
like accents so the number of code points wouldn't always match the
width of the final output.
Visit http://www.unicode.org/faq/ to find out more than you ever wanted
to know about Unicode...
Better to just let final sender/receiver of the character string handle it.
That part I agree with.
- Andrew
--
English probably arose from Normans trying to pick up Saxon girls.
- Replies:
- Re: CA V4 Protocol Specification Benjamin Franksen
- References:
- CA V4 Protocol Specification Jeff Hill
- Re: CA V4 Protocol Specification Benjamin Franksen
- Re: CA V4 Protocol Specification Marty Kraimer
- Navigate by Date:
- Prev:
Re: Release 3.14.8: What goes in it and when? Marty Kraimer
- Next:
Re: Release 3.14.8: What goes in it and when? Janet Anderson
- Index:
2002
2003
2004
<2005>
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: CA V4 Protocol Specification Marty Kraimer
- Next:
Re: CA V4 Protocol Specification Benjamin Franksen
- Index:
2002
2003
2004
<2005>
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|