Experimental Physics and
Industrial Control System

Goetz Pfeiffer <[email protected]> · Wed, 5 Nov 2014 10:25:59 +0100

Hello Everybody,

when using catools with strings that contain non ASCII characters, these
characters are always printed or read as octal constants, no matter what the
locale settings are.

Note: In the following text all command examples or outputs on the
console are
indented by two characters and preceded by a double colon (::), this is
taken
from reStructuredText format ( http://docutils.sourceforge.net/rst.html ).

In the following example we want to use a character of the ISO-8859-1
character
set. Why not simply use unicode UTF-8 ? The reason is that display managers
like DM2K and EDM do not support unicode. If we want to display non-ASCII
characters in string fields of records with these display managers we
must use
a character set like ISO-8859-1 (also known as Latin 1).

Here is an example on a linux host with unicode UTF-8:

First we write the degree character '°' in ISO-8859-1 encoding to the
EGU field of a record::

  > echo "°" | iconv -f UTF-8 -t ISO_8859-1 | xargs caput
U49ID8R:AmsTempT1.EGU

When we now read the value::

  > caget U49ID8R:AmsTempT1.EGU

we get::

  U49ID8R:AmsTempT1.EGU          \260

The '°' character is printed as an octal number "260". This is okay
since with
UTF-8 on our host system we couldn't display an ISO-8859-1 character.

This is our locale::

  > locale
  LANG=en_US.UTF-8
  LC_CTYPE="en_US.UTF-8"
  LC_NUMERIC="en_US.UTF-8"
  LC_TIME="en_US.UTF-8"
  LC_COLLATE="en_US.UTF-8"
  LC_MONETARY="en_US.UTF-8"
  LC_MESSAGES="en_US.UTF-8"
  LC_PAPER="en_US.UTF-8"
  LC_NAME="en_US.UTF-8"
  LC_ADDRESS="en_US.UTF-8"
  LC_TELEPHONE="en_US.UTF-8"
  LC_MEASUREMENT="en_US.UTF-8"
  LC_IDENTIFICATION="en_US.UTF-8"
  LC_ALL=

Now we change the locale to ISO-8859-1::

  > export LC_ALL=de_DE.iso88591

  > locale
  LANG=en_US.UTF-8
  LC_CTYPE="de_DE.iso88591"
  LC_NUMERIC="de_DE.iso88591"
  LC_TIME="de_DE.iso88591"
  LC_COLLATE="de_DE.iso88591"
  LC_MONETARY="de_DE.iso88591"
  LC_MESSAGES="de_DE.iso88591"
  LC_PAPER="de_DE.iso88591"
  LC_NAME="de_DE.iso88591"
  LC_ADDRESS="de_DE.iso88591"
  LC_TELEPHONE="de_DE.iso88591"
  LC_MEASUREMENT="de_DE.iso88591"
  LC_IDENTIFICATION="de_DE.iso88591"
  LC_ALL=de_DE.iso88591

Now we call caget again::

  U49ID8R:AmsTempT1.EGU          \260

The character is still printed as an octal value although our locale
settings
(LC_ALL) define that this is a printable character. caget uses function
epicsStrnEscapedFromRaw() from libCom in EPICS base to convert a string to a
printable form. This function calls isprint() to determine which
characters are
printable. The way caget is written means that locale settings from the
environment are ignored.

Using locale settings from the environment in C is simple. The C program
must
have this include::

  #include <locale.h>

And it has to call setlocale like this::

  setlocale(LC_ALL, "");

Here is, as an example, my patch of caget.c in Epics base:

---------------------------------

--- caget.c.old    2014-11-05 09:31:48.010589013 +0100
+++ caget.c    2014-11-05 09:43:28.611042679 +0100
@@ -28,6 +28,7 @@
 
 #include <stdio.h>
 #include <string.h>
+#include <locale.h>
 #include <epicsStdlib.h>
 #include <epicsString.h>
 
@@ -59,6 +60,10 @@
     "  -w <sec>: Wait time, specifies CA timeout, default is %f
second(s)\n"
     "  -c: Asynchronous get (use ca_get_callback and wait for
completion)\n"
     "  -p <prio>: CA priority (0-%u, default 0=lowest)\n"
+    "Locale:\n"
+    "  -L: use locale according to environment variables in order to\n"
+    "      determine what characters are printable. Non printable
characters\n"
+    "      are shown as 3 digit octal numbers preceded by a backslash\n"
     "Format options:\n"
     "      Default output format is \"name value\"\n"
     "  -t: Terse mode - print only value, without name\n"
@@ -389,11 +394,14 @@
 
     LINE_BUFFER(stdout);        /* Configure stdout buffering */
 
-    while ((opt = getopt(argc, argv, ":taicnhsSe:f:g:l:#:d:0:w:p:F:"))
!= -1) {
+    while ((opt = getopt(argc, argv, ":taicnhLsSe:f:g:l:#:d:0:w:p:F:"))
!= -1) {
         switch (opt) {
         case 'h':               /* Print usage */
             usage();
             return 0;
+        case 'L':               /* use environment locale settings */
+            setlocale(LC_ALL, "");
+            break;
         case 't':               /* Terse output mode */
             complainIfNotPlainAndSet(&format, terse);
             break;
---------------------------------

With these changes the new option "-L" causes caget to use locale
settings from
the environment. Here is an example how to use this::

  > export LC_ALL=de_DE.iso88591
  > caget -L U49ID8R:AmsTempT1.EGU
  U49ID8R:AmsTempT1.EGU          °

If the encoding of the terminal emulator (xterm, konsole etc.) is also
set to
ISO-8859-1 (Latin 1) the "°" character is now displayed correctly.

Maybe we could add support for locale settings from the environment to all
catools programs and possibly the IOC shell. I would propose an option "-L"
that enables this feature. What is your opinion ?

Greetings,

  Goetz Pfeiffer




Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System