Ben,
My solution to the wide character issue was to have the putChar
and getChar interfaces pass type int. UTF-8 then becomes an
implementation (an internal storage compression) issue.
> > I would bet that such an implementation is in the end a lot
> > more efficient than any implementation based on mutability,
such as
> > imposed by the dataAccess string interface.
Sorry, I reread your discussion about immutable strings better
understanding your suggestion. A string must be written at some
point
in its life time, but I am supposing that your immutable string
would
only receive its value when it was constructed? I think I see the
distinction and that under the immutable model, if an existing
string
is written, then a new string is created and the old string is
thrown
away. I don't think that the dataAccess interface precludes the
internal
implementation from doing exactly that with its storage buffers
even if the
interfaces makes it look like this is not the case.
> Functional style:
>
> res = concat(s1,s2);
>
> Imperative style:
>
> res = new string( s1.length() + s2.length() ); // or was
it -1 or +1 ???
> res.copy( s1 );
> res.append( s2 );
For example, I could easily design an interface
that looks like "Functional Style" and implement
it internally as "Imperative style".
Ditto for visa-versa.
One could argue that, ignoring the implementation, the
"Functional Style" programming interface is easier
for programmers to use. Maybe so. The stringSegment interface
is concentrating on being the simplist and clearest possible
interface to an implemantion, but the best design for an
implemenation interface and a programming interface might
be incompatible. Therefore, we just might need to design
also a programming interface that uses a private stringSegment
to get at the implemenation depending
on how often there is direct access to stringSegment
in user plug-ins.
Also, bare in mind that one of the fundamental data access
premises is that the user has a data container with properties
that may be written. Therefore, a mutable interface to strings
is required. This certainly does not preclude throwing internal
storage for an old string away when a new string is
written should that turn out to be the best implementation.
I agree that your constant time internal implementation based on
careful maintenance of reference counting might be very efficent,
but I don't see that the stringSegment interface precludes that
implementation.
> Another advantage of functional/immutable strings is that
support for
> unicode encodings is a lot easier and less error-prone. For
instance,
> since a UTF-8 character may be longer than one byte, a UTF-8
encoded
> string should never be written to at an arbitrary byte index.
With
> immutable strings it is much easier to maintain such
invariants.
And the internal implemenation under dataAccess could employ such
optimzations when concatinating strings also should it arrange
storage
this same way.
The stringSegment interface *is* indexable by the stream element.
A stream element could be mapped by the implementation to a
"UTF-8
character longer than one byte". The stream maintains a current
position which would always be placed at the start of a UTF-8
boundary. So when reading or writing a sequence of tokens the
overhead is low. When moving the index, it would of course be
necessary to scan the UTF-8 tokens one-by-one, but that cant
be avoided by any UTF-8 implementation with random access by
token index.
I guess you have to ask if random access is useful or not.
If useful, then it *is* a bit less efficent with a UTF-8
implemenation. That cant be avoided. Otherwise, if its not
needed, or we dont like to implement it, then we could drop
that feature from the interface.
Jeff
> -----Original Message-----
> From: Benjamin Franksen [mailto:[email protected]]
> Sent: Wednesday, March 02, 2005 5:34 AM
> To: Jeff Hill
> Cc: 'Eric Norum'; 'Ralph Lange'; 'Matej Sekoranja'; 'Marty
> Kraimer'; 'Andrew Johnson'; 'Ken Evans'; 'Bob Dalesio';
'Kasemir, Kay'
> Subject: Re: memory management
>
>
> On Wednesday 02 March 2005 01:54, Benjamin Franksen wrote:
> > An implementation based on non-contiguous storage, could take
> > advantage of its storage model, and almost completely avoid
copying
> > (at the cost of slightly increasing the overall memory
footprint).
> > For instance, functional concatenation can be done in
constant time
> > (avoiding all allocation and copying). As long as strings are
> > immutable and references are properly tracked, an
implementation can
> > easily share the storage between different strings (except
the meta
> > data). I would bet that such an implementation is in the end
a lot
> > more efficient than any implementation based on mutability,
such as
> > imposed by the dataAccess string interface.
>
> Another advantage of functional/immutable strings is that
support for
> unicode encodings is a lot easier and less error-prone. For
instance,
> since a UTF-8 character may be longer than one byte, a UTF-8
encoded
> string should never be written to at an arbitrary byte index.
With
> immutable strings it is much easier to maintain such
invariants.
>
> The burden, in this case, would be with string analyzing
> functions such
> as a generic 'split' function that turns a string into a pair
> (front,back) of strings according to some character or
substring
> predicate that determines the split position. Note that such
> a function
> would need to traverse the string character by character anyway
(to
> find teh split position). Thus, observing UTF-8 character
boundaries
> would cause almost no additional overhead.
>
> Ben
>
- Replies:
- Re: memory management Benjamin Franksen
- References:
- Re: memory management Benjamin Franksen
- Navigate by Date:
- Prev:
Re: memory management Kay-Uwe Kasemir
- Next:
Re: memory management Benjamin Franksen
- Index:
2002
2003
2004
<2005>
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: memory management Benjamin Franksen
- Next:
Re: memory management Benjamin Franksen
- Index:
2002
2003
2004
<2005>
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|