Julia Version | Unit Tests | Coverage |
---|---|---|
Julia Latest |
This provides the basic types and mode methods for dealing with character sets, encodings, and character set encodings.
Currently, there are the following types:
CodeUnitTypes
a Union of the 3 codeunit types (UInt8, UInt16, UInt32) for convenienceCharSet
a struct type, which is parameterized by the name of the character set and the type needed to represent a code pointEncoding
a struct type, parameterized by the name of the encodingBinary
For storing non-textual data as a sequence of bytes, 0-0xff
ASCII
ASCII (Unicode subset, 0-0x7f)
Latin
Latin-1 (ISO-8859-1) (Unicode subset, 0-0xff)
UCS2
UCS-2 (Unicode subset, 0-0xd7ff, 0xe000-0xffff, BMP only, no surrogates)
UTF32
UTF-32 (Full Unicode, 0-0xd7ff, 0xe000-0x10ffff)
UniPlus
Unvalidated Unicode (i.e. like String
, can contain invalid codepoints)
Text1
Unknown 1-byte character set
Text2
Unknown 2-byte character set
Text4
Unknown 4-byte character set
UTF8Encoding
Native1Byte
Native2Byte
Native4Byte
NativeUTF16
Swapped4Byte
Swapped2Byte
SwappedUTF16
LE2
BE2
LE4
BE4
UTF16LE
UTF16BE
2Byte
4Byte
UTF16
BinaryCSE
, Text1CSE
, ASCIICSE
, LatinCSE
Text2CSE
, UCS2CSE
Text4CSE
, UTF32CSE
UTF8CSE
UTF32CharSet
, all valid, using UTF8Encoding
,
conforming to the Unicode Organization's standard,
i.e. no long encodings, surrogates, or invalid bytes.
RawUTF8CSE
UniPlusCharSet
, not validated, using UTF8Encoding
,
may have invalid sequences, long encodings, encode surrogates and characters
up to 0x7fffffff
UTF16CSE
UTF32CharSet
, all valid, using UTF16
Encoding (native order),
conforming to the Unicode standard, i.e. no out of order or isolated surrogates.
_LatinCSE
Indicates has at least 1 character > 0x7f, all <= 0xff_UCS2CSE
Indicates has at least 1 character > 0xff, all <= 0xffff_UTF32CSE
Indicates has at least 1 non-BMP characterThe cse
function returns the character set encoding for a string type, string.
Returns RawUTF8CSE
as a fallback for AbstractString
(i.e. same as String
)
The charset
function returns the character set for a string type, string, character type, or character.
The encoding
function returns the encoding for a type or string.
The codeunit
function returns the code unit used for a character set encoding
The cs"..."
string macro creates a CharSet type with that name
The enc"..."
string macro creates an Encoding type with that name
The @cse(cs, enc)
macro creates a character set encoding with the given character set and encoding
Also Exports the helpful constant Bool
flags BIG_ENDIAN
and LITTLE_ENDIAN
05/04/2018
about 1 month ago
53 commits