Experimental Physics and
Industrial Control System

Jeff Hill <[email protected]> · Wed, 02 Mar 2005 14:02:01 -0700

Ben,

My solution to the wide character issue was to have the putChar 
and getChar interfaces pass type int. UTF-8 then becomes an 
implementation (an internal storage compression) issue.

> > I would bet that such an implementation is in the end a lot
> > more efficient than any implementation based on mutability,
such as
> > imposed by the dataAccess string interface.

Sorry, I reread your discussion about immutable strings better 
understanding your suggestion. A string must be written at some
point 
in its life time, but I am supposing that your immutable string
would 
only receive its value when it was constructed? I think I see the

distinction and that under the immutable model, if an existing
string 
is written, then a new string is created and the old string is
thrown 
away. I don't think that the dataAccess interface precludes the
internal 
implementation from doing exactly that with its storage buffers
even if the 
interfaces makes it look like this is not the case.

> Functional style:
> 
> 	res = concat(s1,s2);
> 
> Imperative style:
> 
> 	res = new string( s1.length() + s2.length() ); // or was
it -1 or +1 ???
> 	res.copy( s1 );
> 	res.append( s2 );

For example, I could easily design an interface 
that looks like "Functional Style" and implement 
it internally as "Imperative style".
Ditto for visa-versa. 

One could argue that, ignoring the implementation, the 
"Functional Style" programming interface is easier 
for programmers to use. Maybe so. The stringSegment interface 
is concentrating on being the simplist and clearest possible 
interface to an implemantion, but the best design for an 
implemenation interface and a programming interface might 
be incompatible. Therefore, we just might need to design 
also a programming interface that uses a private stringSegment 
to get at the implemenation depending
on how often there is direct access to stringSegment 
in user plug-ins. 

Also, bare in mind that one of the fundamental data access 
premises is that the user has a data container with properties 
that may be written. Therefore, a mutable interface to strings 
is required. This certainly does not preclude throwing internal 
storage for an old string away when a new string is 
written should that turn out to be the best implementation.

I agree that your constant time internal implementation based on
careful maintenance of reference counting might be very efficent,

but I don't see that the stringSegment interface precludes that
implementation.

> Another advantage of functional/immutable strings is that
support for 
> unicode encodings is a lot easier and less error-prone. For
instance, 
> since a UTF-8 character may be longer than one byte, a UTF-8
encoded 
> string should never be written to at an arbitrary byte index.
With 
> immutable strings it is much easier to maintain such
invariants.

And the internal implemenation under dataAccess could employ such

optimzations when concatinating strings also should it arrange
storage 
this same way.

The stringSegment interface *is* indexable by the stream element.

A stream element could be mapped by the implementation to a
"UTF-8 
character longer than one byte". The stream maintains a current 
position which would always be placed at the start of a UTF-8 
boundary. So when reading or writing a sequence of tokens the 
overhead is low. When moving the index, it would of course be 
necessary to scan the UTF-8 tokens one-by-one, but that cant 
be avoided by any UTF-8 implementation with random access by 
token index. 

I guess you have to ask if random access is useful or not. 
If useful, then it *is* a bit less efficent with a UTF-8 
implemenation. That cant be avoided. Otherwise, if its not 
needed, or we dont like to implement it, then we could drop 
that feature from the interface.

Jeff

> -----Original Message-----
> From: Benjamin Franksen [mailto:[email protected]] 
> Sent: Wednesday, March 02, 2005 5:34 AM
> To: Jeff Hill
> Cc: 'Eric Norum'; 'Ralph Lange'; 'Matej Sekoranja'; 'Marty 
> Kraimer'; 'Andrew Johnson'; 'Ken Evans'; 'Bob Dalesio';
'Kasemir, Kay'
> Subject: Re: memory management
> 
> 
> On Wednesday 02 March 2005 01:54, Benjamin Franksen wrote:
> > An implementation based on non-contiguous storage, could take
> > advantage of its storage model, and almost completely avoid
copying
> > (at the cost of slightly increasing the overall memory
footprint).
> > For instance, functional concatenation can be done in
constant time
> > (avoiding all allocation and copying). As long as strings are
> > immutable and references are properly tracked, an
implementation can
> > easily share the storage between different strings (except
the meta
> > data). I would bet that such an implementation is in the end
a lot
> > more efficient than any implementation based on mutability,
such as
> > imposed by the dataAccess string interface.
> 
> Another advantage of functional/immutable strings is that
support for 
> unicode encodings is a lot easier and less error-prone. For
instance, 
> since a UTF-8 character may be longer than one byte, a UTF-8
encoded 
> string should never be written to at an arbitrary byte index.
With 
> immutable strings it is much easier to maintain such
invariants.
> 
> The burden, in this case, would be with string analyzing 
> functions such 
> as a generic 'split' function that turns a string into a pair 
> (front,back) of strings according to some character or
substring 
> predicate that determines the split position. Note that such 
> a function 
> would need to traverse the string character by character anyway
(to 
> find teh split position). Thus, observing UTF-8 character
boundaries 
> would cause almost no additional overhead.
> 
> Ben
> 

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System