EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  <20052006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 2002  2003  2004  <20052006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: CA V4 Protocol Specification
From: Andrew Johnson <[email protected]>
To: Marty Kraimer <[email protected]>
Cc: [email protected]
Date: Wed, 26 Oct 2005 09:42:28 -0500
Benjamin Franksen wrote:

Just one little thought about STRING data type:

UINTN            the number of UTF-8 tokens
OCTET sequence    UTF-8 encoded character string sequence

I take it that 'number of UTF-8 tokens' means 'number of octets', right?

It must do, for reasons discussed below.

Maybe it would be worthwhile to consider adding a 'number of /characters/' count in addition to the byte count. This could improve performance, particularly when converting to other encodings on the client side. Of course any gain must be offset against the increased protocol overhead.

In EPICS we really don't want to deal with UTF-8 'characters' or the Unicode code points they encode, we'd much rather leave all that up to the user interfaces and just count octets everywhere.

This is exactly what most operating system routines do too - if you call printf() in a UTF-8 locale and give a %s with a width specification in your format string, the width will be counted in bytes (C chars), so it could break up a UTF-8 multi-byte character in the middle of a sequence.

Marty Kraimer replied:
Java 5 uses 16 bits for char, which is not sufficient to encode all uni-code character sets. It uses 2 consecutive chars to hold a unicode character that does not fit in 16 bits.

At least some C/C++ implementations use 32 bits for wchar which is sufficient for all unicode characters.
But what if an implementation uses 16 bits?

Thus how will the number of characters in a UTF-8 string be used?

Unicode/UTF-8 (which is what we really mean when we say UTF-8) is well-defined in that if a routine understands the multi-byte encoding rules it can scan a UTF-8 string and count the number of Unicode 'code points' contained in it, which is probably what Benjamin means when he talks about a character count.

However like Marty I would strongly question the usefulness of this information to anything other than the final GUI display widget that is going to put the thing on a screen; even if it were using a monospaced font, some Unicode code points actually encode 'combining' characters like accents so the number of code points wouldn't always match the width of the final output.

Visit http://www.unicode.org/faq/ to find out more than you ever wanted to know about Unicode...

Better to just let final sender/receiver of the character string handle it.

That part I agree with.

- Andrew
--
English probably arose from Normans trying to pick up Saxon girls.

Replies:
Re: CA V4 Protocol Specification Benjamin Franksen
References:
CA V4 Protocol Specification Jeff Hill
Re: CA V4 Protocol Specification Benjamin Franksen
Re: CA V4 Protocol Specification Marty Kraimer

Navigate by Date:
Prev: Re: Release 3.14.8: What goes in it and when? Marty Kraimer
Next: Re: Release 3.14.8: What goes in it and when? Janet Anderson
Index: 2002  2003  2004  <20052006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: CA V4 Protocol Specification Marty Kraimer
Next: Re: CA V4 Protocol Specification Benjamin Franksen
Index: 2002  2003  2004  <20052006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 02 Feb 2012 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·