Red is growing up fast, even if just born two weeks ago! It is time we implement basic string support so we can do our first, real, hello-word. ;-)
Red strings will natively support Unicode. In order to achieve that in an efficient and cross-platform way, we need a good plan. Here is the list of Unicode native formats used by our main target platforms API:
Windows : UTF-16 Linux : UTF-8 MacOSX/Cocoa : UTF-16 MacOSX/Darwin : UTF-8 Java : UTF-16 .Net : UTF-16 Javascript : UTF-8 Syllable : UTF-8
All these formats are variable-width encodings, requiring any indexed access to pay the cost of walking through the string.
Fortunately, there are also fixed-width Unicode encodings that can be used to give us back constant time for indexed accesses. So, in order to make it the most space-efficient, Red strings will internally support only these encoding formats:
Latin-1 (1 byte/codepoint) UCS-2 (2 bytes/codepoint) UCS-4 (4 bytes/codepoint)
This is not something new, at least Python 3.3 does it in the same way.
Additionally, UTF-8 and UTF-16 codecs will be supported, in order to deal with I/O accesses on host platforms.
Red will use UTF-8 for exchanging strings with outer world by default, except when accessing a UTF-16 API is necessary. Conversion for input and output strings will be done on-the-fly between one of the internal representation and UTF-8/UTF-16. When reading an input string, Red will select the most space-efficient internal format depending on highest codepoint in the input string. Also users should be able to force the encoding of a string to a given internal format, when possible.
So far, this is the plan for additing Unicode to Red, a prototype implementation will be done quickly, so we can fine-tune it if required.
Comments and suggestions are welcome.