On this page
std.uni
The std.uni
module provides an implementation of fundamental Unicode algorithms and data structures. This doesn't include UTF encoding and decoding primitives, see std.utf.decode
and std.utf.encode
in std.utf
for this functionality.
Category | Functions |
---|---|
Decode | byCodePoint byGrapheme decodeGrapheme graphemeStride |
Comparison | icmp sicmp |
Classification | isAlpha isAlphaNum isCodepointSet isControl isFormat isGraphical isIntegralPair isMark isNonCharacter isNumber isPrivateUse isPunctuation isSpace isSurrogate isSurrogateHi isSurrogateLo isSymbol isWhite |
Normalization | NFC NFD NFKD NormalizationForm normalize |
Decompose | decompose decomposeHangul UnicodeDecomposition |
Compose | compose composeJamo |
Sets | CodepointInterval CodepointSet InversionList unicode |
Trie | codepointSetTrie CodepointSetTrie codepointTrie CodepointTrie toTrie toDelegate |
Casing | asCapitalized asLowerCase asUpperCase isLower isUpper toLower toLowerInPlace toUpper toUpperInPlace |
Utf8Matcher | isUtfMatcher MatcherConcept utfMatcher |
Separators | lineSep nelSep paraSep |
Building blocks | allowedIn combiningClass Grapheme |
All primitives listed operate on Unicode characters and sets of characters. For functions which operate on ASCII characters and ignore Unicode characters, see std.ascii
. For definitions of Unicode character, code point and other terms used throughout this module see the terminology section below.
The focus of this module is the core needs of developing Unicode-aware applications. To that effect it provides the following optimized primitives:
- Character classification by category and common properties:
isAlpha
,isWhite
and others. - Case-insensitive string comparison (
sicmp
,icmp
). - Converting text to any of the four normalization forms via
normalize
. - Decoding (
decodeGrapheme
) and iteration (byGrapheme
,graphemeStride
) by user-perceived characters, that is byGrapheme
clusters. - Decomposing and composing of individual character(s) according to canonical or compatibility rules, see
compose
anddecompose
, including the specific version for Hangul syllablescomposeJamo
anddecomposeHangul
.
It's recognized that an application may need further enhancements and extensions, such as less commonly known algorithms, or tailoring existing ones for region specific needs. To help users with building any extra functionality beyond the core primitives, the module provides:
CodepointSet
, a type for easy manipulation of sets of characters. Besides the typical set algebra it provides an unusual feature: a D source code generator for detection of code points in this set. This is a boon for meta-programming parser frameworks, and is used internally to power classification in small sets likeisWhite
.- A way to construct optimal packed multi-stage tables also known as a special case of Trie. The functions
codepointTrie
,codepointSetTrie
construct custom tries that map dchar to value. The end result is a fast and predictable Ο(1
) lookup that powers functions likeisAlpha
andcombiningClass
, but for user-defined data sets. - A useful technique for Unicode-aware parsers that perform character classification of encoded code points is to avoid unnecassary decoding at all costs.
utfMatcher
provides an improvement over the usual workflow of decode-classify-process, combining the decoding and classification steps. By extracting necessary bits directly from encoded code units matchers achieve significant performance improvements. SeeMatcherConcept
for the common interface of UTF matchers. - Generally useful building blocks for customized normalization:
combiningClass
for querying combining class andallowedIn
for testing the Quick_Check property of a given normalization form. - Access to a large selection of commonly used sets of code points. Supported sets include Script, Block and General Category. The exact contents of a set can be observed in the CLDR utility, on the property index page of the Unicode website. See
unicode
for easy and (optionally) compile-time checked set queries.
Synopsis
import std.uni;
void main()
{
// initialize code point sets using script/block or property name
// now 'set' contains code points from both scripts.
auto set = unicode("Cyrillic") | unicode("Armenian");
// same thing but simpler and checked at compile-time
auto ascii = unicode.ASCII;
auto currency = unicode.Currency_Symbol;
// easy set ops
auto a = set & ascii;
assert(a.empty); // as it has no intersection with ascii
a = set | ascii;
auto b = currency - a; // subtract all ASCII, Cyrillic and Armenian
// some properties of code point sets
assert(b.length > 45); // 46 items in Unicode 6.1, even more in 6.2
// testing presence of a code point in a set
// is just fine, it is O(logN)
assert(!b['$']);
assert(!b['\u058F']); // Armenian dram sign
assert(b['¥']);
// building fast lookup tables, these guarantee O(1) complexity
// 1-level Trie lookup table essentially a huge bit-set ~262Kb
auto oneTrie = toTrie!1(b);
// 2-level far more compact but typically slightly slower
auto twoTrie = toTrie!2(b);
// 3-level even smaller, and a bit slower yet
auto threeTrie = toTrie!3(b);
assert(oneTrie['£']);
assert(twoTrie['£']);
assert(threeTrie['£']);
// build the trie with the most sensible trie level
// and bind it as a functor
auto cyrillicOrArmenian = toDelegate(set);
auto balance = find!(cyrillicOrArmenian)("Hello ընկեր!");
assert(balance == "ընկեր!");
// compatible with bool delegate(dchar)
bool delegate(dchar) bindIt = cyrillicOrArmenian;
// Normalization
string s = "Plain ascii (and not only), is always normalized!";
assert(s is normalize(s));// is the same string
string nonS = "A\u0308ffin"; // A ligature
auto nS = normalize(nonS); // to NFC, the W3C endorsed standard
assert(nS == "Äffin");
assert(nS != nonS);
string composed = "Äffin";
assert(normalize!NFD(composed) == "A\u0308ffin");
// to NFKD, compatibility decomposition useful for fuzzy matching/searching
assert(normalize!NFKD("2¹⁰") == "210");
}
Terminology
The following is a list of important Unicode notions and definitions. Any conventions used specifically in this module alone are marked as such. The descriptions are based on the formal definition as found in chapter three of The Unicode Standard Core Specification.
Abstract character A unit of information used for the organization, control, or representation of textual data. Note that:- When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, visual).
- An abstract character has no concrete form and should not be confused with a glyph.
- An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a
Grapheme
. - The abstract characters encoded (see Encoded character) are known as Unicode abstract characters.
- Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences.
char
), 16-bit code units in the UTF-16 (wchar
), and 32-bit code units in the UTF-32 (dchar
). Note that in UTF-32, a code unit is a code point and is represented by the D dchar
type. Combining character A character with the General Category of Combining Mark(M).
- All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero combining class.
- These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
- The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
- A grapheme cluster typically starts with a grapheme base and then extends across any subsequent sequence of nonspacing marks. A grapheme cluster is most directly relevant to text rendering and processes such as cursor placement and text selection in editing, but may also be relevant to comparison and searching.
- For many processes, a grapheme cluster behaves as if it was a single character with the same properties as its grapheme base. Effectively, nonspacing marks apply graphically to the base, but do not change its properties.
This module defines a number of primitives that work with graphemes: Grapheme
, decodeGrapheme
and graphemeStride
. All of them are using extended grapheme boundaries as defined in the aforementioned standard annex.
Normalization
The concepts of canonical equivalent or compatibility equivalent characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence is determined by a process called normalization, whereby strings are converted into forms which are compared directly for identity. This is the primary goal of the normalization process, see the function normalize
to convert into any of the four defined forms.
A very important attribute of the Unicode Normalization Forms is that they must remain stable between versions of the Unicode Standard. A Unicode string normalized to a particular Unicode Normalization Form in one version of the standard is guaranteed to remain in that Normalization Form for implementations of future versions of the standard.
The Unicode Standard specifies four normalization forms. Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximal composition of equivalent sequences.
- Normalization Form D (NFD): The canonical decomposition of a character sequence.
- Normalization Form KD (NFKD): The compatibility decomposition of a character sequence.
- Normalization Form C (NFC): The canonical composition of the canonical decomposition of a coded character sequence.
- Normalization Form KC (NFKC): The canonical composition of the compatibility decomposition of a character sequence
The choice of the normalization form depends on the particular use case. NFC is the best form for general text, since it's more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns. NFD and NFKD are the most useful for internal processing.
Construction of lookup tables
The Unicode standard describes a set of algorithms that depend on having the ability to quickly look up various properties of a code point. Given the the codespace of about 1 million code points, it is not a trivial task to provide a space-efficient solution for the multitude of properties.
Common approaches such as hash-tables or binary search over sorted code point intervals (as in InversionList
) are insufficient. Hash-tables have enormous memory footprint and binary search over intervals is not fast enough for some heavy-duty algorithms.
The recommended solution (see Unicode Implementation Guidelines) is using multi-stage tables that are an implementation of the Trie data structure with integer keys and a fixed number of stages. For the remainder of the section this will be called a fixed trie. The following describes a particular implementation that is aimed for the speed of access at the expense of ideal size savings.
Taking a 2-level Trie as an example the principle of operation is as follows. Split the number of bits in a key (code point, 21 bits) into 2 components (e.g. 15 and 8). The first is the number of bits in the index of the trie and the other is number of bits in each page of the trie. The layout of the trie is then an array of size 2^^bits-of-index followed an array of memory chunks of size 2^^bits-of-page/bits-per-element.
The number of pages is variable (but not less then 1) unlike the number of entries in the index. The slots of the index all have to contain a number of a page that is present. The lookup is then just a couple of operations - slice the upper bits, lookup an index for these, take a page at this index and use the lower bits as an offset within this page.
Assuming that pages are laid out consequently in one array at
pages
, the pseudo-code is:
auto elemsPerPage = (2 ^^ bits_per_page) / Value.sizeOfInBits;
pages[index[n >> bits_per_page]][n & (elemsPerPage - 1)];
Where if elemsPerPage
is a power of 2 the whole process is a handful of simple instructions and 2 array reads. Subsequent levels of the trie are introduced by recursing on this notion - the index array is treated as values. The number of bits in index is then again split into 2 parts, with pages over 'current-index' and the new 'upper-index'.
For completeness a level 1 trie is simply an array. The current implementation takes advantage of bit-packing values when the range is known to be limited in advance (such as bool
). See also BitPacked
for enforcing it manually. The major size advantage however comes from the fact that multiple identical pages on every level are merged by construction.
The process of constructing a trie is more involved and is hidden from the user in a form of the convenience functions codepointTrie
, codepointSetTrie
and the even more convenient toTrie
. In general a set or built-in AA with dchar
type can be turned into a trie. The trie object in this module is read-only (immutable); it's effectively frozen after construction.
Unicode properties
This is a full list of Unicode properties accessible through unicode
with specific helpers per category nested within. Consult the CLDR utility when in doubt about the contents of a particular set.
General category sets listed below are only accessible with the unicode
shorthand accessor.
Abb. | Long form | Abb. | Long form | Abb. | Long form |
---|---|---|---|---|---|
L | Letter | Cn | Unassigned | Po | Other_Punctuation |
Ll | Lowercase_Letter | Co | Private_Use | Ps | Open_Punctuation |
Lm | Modifier_Letter | Cs | Surrogate | S | Symbol |
Lo | Other_Letter | N | Number | Sc | Currency_Symbol |
Lt | Titlecase_Letter | Nd | Decimal_Number | Sk | Modifier_Symbol |
Lu | Uppercase_Letter | Nl | Letter_Number | Sm | Math_Symbol |
M | Mark | No | Other_Number | So | Other_Symbol |
Mc | Spacing_Mark | P | Punctuation | Z | Separator |
Me | Enclosing_Mark | Pc | Connector_Punctuation | Zl | Line_Separator |
Mn | Nonspacing_Mark | Pd | Dash_Punctuation | Zp | Paragraph_Separator |
C | Other | Pe | Close_Punctuation | Zs | Space_Separator |
Cc | Control | Pf | Final_Punctuation | - | Any |
Cf | Format | Pi | Initial_Punctuation | - | ASCII |
Sets for other commonly useful properties that are accessible with unicode
:
Name | Name | Name |
---|---|---|
Alphabetic | Ideographic | Other_Uppercase |
ASCII_Hex_Digit | IDS_Binary_Operator | Pattern_Syntax |
Bidi_Control | ID_Start | Pattern_White_Space |
Cased | IDS_Trinary_Operator | Quotation_Mark |
Case_Ignorable | Join_Control | Radical |
Dash | Logical_Order_Exception | Soft_Dotted |
Default_Ignorable_Code_Point | Lowercase | STerm |
Deprecated | Math | Terminal_Punctuation |
Diacritic | Noncharacter_Code_Point | Unified_Ideograph |
Extender | Other_Alphabetic | Uppercase |
Grapheme_Base | Other_Default_Ignorable_Code_Point | Variation_Selector |
Grapheme_Extend | Other_Grapheme_Extend | White_Space |
Grapheme_Link | Other_ID_Continue | XID_Continue |
Hex_Digit | Other_ID_Start | XID_Start |
Hyphen | Other_Lowercase | |
ID_Continue | Other_Math |
Below is the table with block names accepted by unicode.block
. Note that the shorthand version unicode
requires "In" to be prepended to the names of blocks so as to disambiguate scripts and blocks.
Aegean Numbers | Ethiopic Extended | Mongolian |
Alchemical Symbols | Ethiopic Extended-A | Musical Symbols |
Alphabetic Presentation Forms | Ethiopic Supplement | Myanmar |
Ancient Greek Musical Notation | General Punctuation | Myanmar Extended-A |
Ancient Greek Numbers | Geometric Shapes | New Tai Lue |
Ancient Symbols | Georgian | NKo |
Arabic | Georgian Supplement | Number Forms |
Arabic Extended-A | Glagolitic | Ogham |
Arabic Mathematical Alphabetic Symbols | Gothic | Ol Chiki |
Arabic Presentation Forms-A | Greek and Coptic | Old Italic |
Arabic Presentation Forms-B | Greek Extended | Old Persian |
Arabic Supplement | Gujarati | Old South Arabian |
Armenian | Gurmukhi | Old Turkic |
Arrows | Halfwidth and Fullwidth Forms | Optical Character Recognition |
Avestan | Hangul Compatibility Jamo | Oriya |
Balinese | Hangul Jamo | Osmanya |
Bamum | Hangul Jamo Extended-A | Phags-pa |
Bamum Supplement | Hangul Jamo Extended-B | Phaistos Disc |
Basic Latin | Hangul Syllables | Phoenician |
Batak | Hanunoo | Phonetic Extensions |
Bengali | Hebrew | Phonetic Extensions Supplement |
Block Elements | High Private Use Surrogates | Playing Cards |
Bopomofo | High Surrogates | Private Use Area |
Bopomofo Extended | Hiragana | Rejang |
Box Drawing | Ideographic Description Characters | Rumi Numeral Symbols |
Brahmi | Imperial Aramaic | Runic |
Braille Patterns | Inscriptional Pahlavi | Samaritan |
Buginese | Inscriptional Parthian | Saurashtra |
Buhid | IPA Extensions | Sharada |
Byzantine Musical Symbols | Javanese | Shavian |
Carian | Kaithi | Sinhala |
Chakma | Kana Supplement | Small Form Variants |
Cham | Kanbun | Sora Sompeng |
Cherokee | Kangxi Radicals | Spacing Modifier Letters |
CJK Compatibility | Kannada | Specials |
CJK Compatibility Forms | Katakana | Sundanese |
CJK Compatibility Ideographs | Katakana Phonetic Extensions | Sundanese Supplement |
CJK Compatibility Ideographs Supplement | Kayah Li | Superscripts and Subscripts |
CJK Radicals Supplement | Kharoshthi | Supplemental Arrows-A |
CJK Strokes | Khmer | Supplemental Arrows-B |
CJK Symbols and Punctuation | Khmer Symbols | Supplemental Mathematical Operators |
CJK Unified Ideographs | Lao | Supplemental Punctuation |
CJK Unified Ideographs Extension A | Latin-1 Supplement | Supplementary Private Use Area-A |
CJK Unified Ideographs Extension B | Latin Extended-A | Supplementary Private Use Area-B |
CJK Unified Ideographs Extension C | Latin Extended Additional | Syloti Nagri |
CJK Unified Ideographs Extension D | Latin Extended-B | Syriac |
Combining Diacritical Marks | Latin Extended-C | Tagalog |
Combining Diacritical Marks for Symbols | Latin Extended-D | Tagbanwa |
Combining Diacritical Marks Supplement | Lepcha | Tags |
Combining Half Marks | Letterlike Symbols | Tai Le |
Common Indic Number Forms | Limbu | Tai Tham |
Control Pictures | Linear B Ideograms | Tai Viet |
Coptic | Linear B Syllabary | Tai Xuan Jing Symbols |
Counting Rod Numerals | Lisu | Takri |
Cuneiform | Low Surrogates | Tamil |
Cuneiform Numbers and Punctuation | Lycian | Telugu |
Currency Symbols | Lydian | Thaana |
Cypriot Syllabary | Mahjong Tiles | Thai |
Cyrillic | Malayalam | Tibetan |
Cyrillic Extended-A | Mandaic | Tifinagh |
Cyrillic Extended-B | Mathematical Alphanumeric Symbols | Transport And Map Symbols |
Cyrillic Supplement | Mathematical Operators | Ugaritic |
Deseret | Meetei Mayek | Unified Canadian Aboriginal Syllabics |
Devanagari | Meetei Mayek Extensions | Unified Canadian Aboriginal Syllabics Extended |
Devanagari Extended | Meroitic Cursive | Vai |
Dingbats | Meroitic Hieroglyphs | Variation Selectors |
Domino Tiles | Miao | Variation Selectors Supplement |
Egyptian Hieroglyphs | Miscellaneous Mathematical Symbols-A | Vedic Extensions |
Emoticons | Miscellaneous Mathematical Symbols-B | Vertical Forms |
Enclosed Alphanumerics | Miscellaneous Symbols | Yijing Hexagram Symbols |
Enclosed Alphanumeric Supplement | Miscellaneous Symbols and Arrows | Yi Radicals |
Enclosed CJK Letters and Months | Miscellaneous Symbols And Pictographs | Yi Syllables |
Enclosed Ideographic Supplement | Miscellaneous Technical | |
Ethiopic | Modifier Tone Letters |
Below is the table with script names accepted by unicode.script
and by the shorthand version unicode
:
Arabic | Hanunoo | Old_Italic |
Armenian | Hebrew | Old_Persian |
Avestan | Hiragana | Old_South_Arabian |
Balinese | Imperial_Aramaic | Old_Turkic |
Bamum | Inherited | Oriya |
Batak | Inscriptional_Pahlavi | Osmanya |
Bengali | Inscriptional_Parthian | Phags_Pa |
Bopomofo | Javanese | Phoenician |
Brahmi | Kaithi | Rejang |
Braille | Kannada | Runic |
Buginese | Katakana | Samaritan |
Buhid | Kayah_Li | Saurashtra |
Canadian_Aboriginal | Kharoshthi | Sharada |
Carian | Khmer | Shavian |
Chakma | Lao | Sinhala |
Cham | Latin | Sora_Sompeng |
Cherokee | Lepcha | Sundanese |
Common | Limbu | Syloti_Nagri |
Coptic | Linear_B | Syriac |
Cuneiform | Lisu | Tagalog |
Cypriot | Lycian | Tagbanwa |
Cyrillic | Lydian | Tai_Le |
Deseret | Malayalam | Tai_Tham |
Devanagari | Mandaic | Tai_Viet |
Egyptian_Hieroglyphs | Meetei_Mayek | Takri |
Ethiopic | Meroitic_Cursive | Tamil |
Georgian | Meroitic_Hieroglyphs | Telugu |
Glagolitic | Miao | Thaana |
Gothic | Mongolian | Thai |
Greek | Myanmar | Tibetan |
Gujarati | New_Tai_Lue | Tifinagh |
Gurmukhi | Nko | Ugaritic |
Han | Ogham | Vai |
Hangul | Ol_Chiki | Yi |
Below is the table of names accepted by unicode.hangulSyllableType
.
Abb. | Long form |
---|---|
L | Leading_Jamo |
LV | LV_Syllable |
LVT | LVT_Syllable |
T | Trailing_Jamo |
V | Vowel_Jamo |
- References
- ASCII Table, Wikipedia, The Unicode Consortium, Unicode normalization forms, Unicode text segmentation Unicode Implementation Guidelines Unicode Conformance
- Trademarks
- Unicode(tm) is a trademark of Unicode, Inc.
- License:
- Boost License 1.0.
- Authors:
- Dmitry Olshansky
- Source
- std/uni/package.d
- Standards:
- Unicode v6.2
- enum dchar lineSep;
-
Constant code point (0x2028) - line separator.
- enum dchar paraSep;
-
Constant code point (0x2029) - paragraph separator.
- enum dchar nelSep;
-
Constant code point (0x0085) - next line.
- template isCodepointSet(T)
-
Tests if T is some kind a set of code points. Intended for template constraints.
- enum auto isIntegralPair(T, V = uint);
-
Tests if
T
is a pair of integers that implicitly convert toV
. The following code must compile for any pairT
:
The following must not compile:(T x){ V a = x[0]; V b = x[1];}
(T x){ V c = x[2];}
- alias CodepointSet = InversionList!(GcPolicy).InversionList;
-
The recommended default type for set of code points. For details, see the current implementation:
InversionList
. - struct CodepointInterval;
-
The recommended type of
std.typecons.Tuple
to represent [a, b) intervals of code points. As used inInversionList
. Any interval type should passisIntegralPair
trait. - struct InversionList(SP = GcPolicy);
-
InversionList
is a set of code points represented as an array of open-right [a, b) intervals (seeCodepointInterval
above). The name comes from the way the representation reads left to right. For instance a set of all values [10, 50), [80, 90), plus a singular value 60 looks like this:10, 50, 60, 61, 80, 90
The way to read this is: start with negative meaning that all numbers smaller then the next one are not present in this set (and positive - the contrary). Then switch positive/negative after each number passed from left to right.
This way negative spans until 10, then positive until 50, then negative until 60, then positive until 61, and so on. As seen this provides a space-efficient storage of highly redundant data that comes in long runs. A description which Unicode character properties fit nicely. The technique itself could be seen as a variation on RLE encoding.
Sets are value types (just like
int
is) thus they are never aliased.- Example
auto a = CodepointSet('a', 'z'+1); auto b = CodepointSet('A', 'Z'+1); auto c = a; a = a | b; assert(a == CodepointSet('A', 'Z'+1, 'a', 'z'+1)); assert(a != c);
See also
unicode
for simpler construction of sets from predefined ones.Memory usage is 8 bytes per each contiguous interval in a set. The value semantics are achieved by using the COW technique and thus it's not safe to cast this type to shared.
- Note
It's not recommended to rely on the template parameters or the exact type of a current code point set in
std.uni
. The type and parameters may change when the standard allocators design is finalized. UseisCodepointSet
with templates or just stick with the default aliasCodepointSet
throughout the whole code base.-
pure this(Set)(Set set)
Constraints: if (isCodepointSet!Set); -
Construct from another code point set of any type.
-
pure this(Range)(Range intervals)
Constraints: if (isForwardRange!Range && isIntegralPair!(ElementType!Range)); -
Construct a set from a forward range of code point intervals.
- this()(uint[] intervals...);
-
Construct a set from plain values of code point intervals.
- Examples:
-
import std.algorithm.comparison : equal; auto set = CodepointSet('a', 'z'+1, 'а', 'я'+1); foreach (v; 'a'..'z'+1) assert(set[v]); // Cyrillic lowercase interval foreach (v; 'а'..'я'+1) assert(set[v]); //specific order is not required, intervals may interesect auto set2 = CodepointSet('а', 'я'+1, 'a', 'd', 'b', 'z'+1); //the same end result assert(set2.byInterval.equal(set.byInterval)); // test constructor this(Range)(Range intervals) auto chessPiecesWhite = CodepointInterval(9812, 9818); auto chessPiecesBlack = CodepointInterval(9818, 9824); auto set3 = CodepointSet([chessPiecesWhite, chessPiecesBlack]); foreach (v; '♔'..'♟'+1) assert(set3[v]);
- @property scope auto byInterval();
-
Get range that spans all of the code point intervals in this
InversionList
. - const bool opIndex(uint val);
-
Tests the presence of code point
val
in this set.- Examples:
-
auto gothic = unicode.Gothic; // Gothic letter ahsa assert(gothic['\U00010330']); // no ascii in Gothic obviously assert(!gothic['$']);
- @property size_t length();
-
Number of code points in this set
-
This opBinary(string op, U)(U rhs)
Constraints: if (isCodepointSet!U || is(U : dchar)); -
Sets support natural syntax for set algebra, namely:
Operator Math notation Description & a ∩ b intersection | a ∪ b union - a ∖ b subtraction ~ a ~ b symmetric set difference i.e. (a ∪ b) \ (a ∩ b) - Examples:
-
import std.algorithm.comparison : equal; import std.range : iota; auto lower = unicode.LowerCase; auto upper = unicode.UpperCase; auto ascii = unicode.ASCII; assert((lower & upper).empty); // no intersection auto lowerASCII = lower & ascii; assert(lowerASCII.byCodepoint.equal(iota('a', 'z'+1))); // throw away all of the lowercase ASCII writeln((ascii - lower).length); // 128 - 26 auto onlyOneOf = lower ~ ascii; assert(!onlyOneOf['Δ']); // not ASCII and not lowercase assert(onlyOneOf['$']); // ASCII and not lowercase assert(!onlyOneOf['a']); // ASCII and lowercase assert(onlyOneOf['я']); // not ASCII but lowercase // throw away all cased letters from ASCII auto noLetters = ascii - (lower | upper); writeln(noLetters.length); // 128 - 26 * 2
-
ref This opOpAssign(string op, U)(U rhs)
Constraints: if (isCodepointSet!U || is(U : dchar)); -
The 'op=' versions of the above overloaded operators.
-
const bool opBinaryRight(string op : "in", U)(U ch)
Constraints: if (is(U : dchar)); -
Tests the presence of codepoint
ch
in this set, the same asopIndex
.- Examples:
-
assert('я' in unicode.Cyrillic); assert(!('z' in unicode.Cyrillic));
- auto opUnary(string op : "!")();
-
Obtains a set that is the inversion of this set.
- See Also:
-
inverted
- @property auto byCodepoint();
-
A range that spans each code point in this set.
- Examples:
-
import std.algorithm.comparison : equal; import std.range : iota; auto set = unicode.ASCII; set.byCodepoint.equal(iota(0, 0x80));
- void toString(Writer)(scope Writer sink, ref scope const FormatSpec!char fmt);
-
Obtain a textual representation of this InversionList in form of open-right intervals.
The formatting flag is applied individually to each value, for example:
- %s and %d format the intervals as a [low .. high) range of integrals
- %x formats the intervals as a [low .. high) range of lowercase hex characters
- %X formats the intervals as a [low .. high) range of uppercase hex characters
- Examples:
-
import std.conv : to; import std.format : format; import std.uni : unicode; assert(unicode.Cyrillic.to!string == "[1024..1157) [1159..1320) [7467..7468) [7544..7545) [11744..11776) [42560..42648) [42655..42656)"); // The specs '%s' and '%d' are equivalent to the to!string call above. writeln(format("%d", unicode.Cyrillic)); // unicode.Cyrillic.to!string assert(format("%#x", unicode.Cyrillic) == "[0x400..0x485) [0x487..0x528) [0x1d2b..0x1d2c) [0x1d78..0x1d79) [0x2de0..0x2e00) " ~"[0xa640..0xa698) [0xa69f..0xa6a0)"); assert(format("%#X", unicode.Cyrillic) == "[0X400..0X485) [0X487..0X528) [0X1D2B..0X1D2C) [0X1D78..0X1D79) [0X2DE0..0X2E00) " ~"[0XA640..0XA698) [0XA69F..0XA6A0)");
- ref auto add()(uint a, uint b);
-
Add an interval [a, b) to this set.
- Examples:
-
CodepointSet someSet; someSet.add('0', '5').add('A','Z'+1); someSet.add('5', '9'+1); assert(someSet['0']); assert(someSet['5']); assert(someSet['9']); assert(someSet['Z']);
- @property auto inverted();
-
Obtains a set that is the inversion of this set.
See the '!'
opUnary
for the same but using operators.- Examples:
-
auto set = unicode.ASCII; // union with the inverse gets all of the code points in the Unicode writeln((set | set.inverted).length); // 0x110000 // no intersection with the inverse assert((set & set.inverted).empty);
- string toSourceCode(string funcName = "");
-
Generates string with D source code of unary function with name of
funcName
taking a singledchar
argument. IffuncName
is empty the code is adjusted to be a lambda function.The function generated tests if the code point passed belongs to this set or not. The result is to be used with string mixin. The intended usage area is aggressive optimization via meta programming in parser generators and the like.
- Note
- Use with care for relatively small or regular sets. It could end up being slower then just using multi-staged tables.
- Example
The above outputs something along the lines of:import std.stdio; // construct set directly from [a, b$RPAREN intervals auto set = CodepointSet(10, 12, 45, 65, 100, 200); writeln(set); writeln(set.toSourceCode("func"));
bool func(dchar ch) @safe pure nothrow @nogc { if (ch < 45) { if (ch == 10 || ch == 11) return true; return false; } else if (ch < 65) return true; else { if (ch < 100) return false; if (ch < 200) return true; return false; } }
- const @property bool empty();
-
True if this set doesn't contain any code points.
- Examples:
-
CodepointSet emptySet; writeln(emptySet.length); // 0 assert(emptySet.empty);
- template codepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)
-
A shorthand for creating a custom multi-level fixed Trie from a
CodepointSet
.sizes
are numbers of bits per level, with the most significant bits used first.- Note
-
The sum of
sizes
must be equal 21.
- See Also:
toTrie
, which is even simpler.
- Example
{ import std.stdio; auto set = unicode("Number"); auto trie = codepointSetTrie!(8, 5, 8)(set); writeln("Input code points to test:"); foreach (line; stdin.byLine) { int count=0; foreach (dchar ch; line) if (trie[ch])// is number count++; writefln("Contains %d number code points.", count); } }
- template CodepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)
-
Type of Trie generated by codepointSetTrie function.
-
template codepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)
template CodepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21) -
A slightly more general tool for building fixed
Trie
for the Unicode data.Specifically unlike
codepointSetTrie
it's allows creating mappings ofdchar
to an arbitrary typeT
.- Note
-
Overload taking
CodepointSet
s will naturally convert only to bool mappingTrie
s.
- auto codepointTrie()(T[dchar] map, T defValue = T.init);
-
auto codepointTrie(R)(R range, T defValue = T.init)
Constraints: if (isInputRange!R && is(typeof(ElementType!R.init[0]) : T) && is(typeof(ElementType!R.init[1]) : dchar));
- struct MatcherConcept;
-
Conceptual type that outlines the common properties of all UTF Matchers.
- Note
-
For illustration purposes only, every method call results in assertion failure. Use
utfMatcher
to obtain a concrete matcher for UTF-8 or UTF-16 encodings.
-
bool match(Range)(ref Range inp)
Constraints: if (isRandomAccessRange!Range && is(ElementType!Range : char));
bool skip(Range)(ref Range inp)
Constraints: if (isRandomAccessRange!Range && is(ElementType!Range : char));
bool test(Range)(ref Range inp)
Constraints: if (isRandomAccessRange!Range && is(ElementType!Range : char)); -
Perform a semantic equivalent 2 operations: decoding a code point at front of
inp
and testing if it belongs to the set of code points of this matcher.The effect on
inp
depends on the kind of function called:
Match. If the codepoint is found in the set then range
inp
is advanced by its size in code units, otherwise the range is not modifed.
Skip. The range is always advanced by the size of the tested code point regardless of the result of test.
Test. The range is left unaffected regardless of the result of test.
- Examples:
-
string truth = "2² = 4"; auto m = utfMatcher!char(unicode.Number); assert(m.match(truth)); // '2' is a number all right assert(truth == "² = 4"); // skips on match assert(m.match(truth)); // so is the superscript '2' assert(!m.match(truth)); // space is not a number assert(truth == " = 4"); // unaffected on no match assert(!m.skip(truth)); // same test ... assert(truth == "= 4"); // but skips a codepoint regardless assert(!m.test(truth)); // '=' is not a number assert(truth == "= 4"); // test never affects argument
- @property auto subMatcher(Lengths...)();
-
Advanced feature - provide direct access to a subset of matcher based a set of known encoding lengths. Lengths are provided in code units. The sub-matcher then may do less operations per any
test
/match
.Use with care as the sub-matcher won't match any code points that have encoded length that doesn't belong to the selected set of lengths. Also the sub-matcher object references the parent matcher and must not be used past the liftetime of the latter.
Another caveat of using sub-matcher is that skip is not available preciesly because sub-matcher doesn't detect all lengths.
- enum auto isUtfMatcher(M, C);
-
Test if
M
is an UTF Matcher for ranges ofChar
. -
auto utfMatcher(Char, Set)(Set set)
Constraints: if (isCodepointSet!Set); -
Constructs a matcher object to classify code points from the
set
for encoding that hasChar
as code unit.See
MatcherConcept
for API outline. -
auto toTrie(size_t level, Set)(Set set)
Constraints: if (isCodepointSet!Set); -
Convenience function to construct optimal configurations for packed Trie from any
The parameterset
of code points.level
indicates the number of trie levels to use, allowed values are: 1, 2, 3 or 4. Levels represent different trade-offs speed-size wise.
Level 1 is fastest and the most memory hungry (a bit array).
Level 4 is the slowest and has the smallest footprint.
See the Synopsis section for example.- Note
-
Level 4 stays very practical (being faster and more predictable) compared to using direct lookup on the
set
itself.
-
auto toDelegate(Set)(Set set)
Constraints: if (isCodepointSet!Set); -
Builds a
Trie
with typically optimal speed-size trade-off and wraps it into a delegate of the following type:bool delegate(dchar ch)
.Effectively this creates a 'tester' lambda suitable for algorithms like std.algorithm.find that take unary predicates.
See the Synopsis section for example. - struct unicode;
-
A single entry point to lookup Unicode code point sets by name or alias of a block, script or general category.
It uses well defined standard rules of property name lookup. This includes fuzzy matching of names, so that 'White_Space', 'white-SpAce' and 'whitespace' are all considered equal and yield the same set of white space characters.
- pure @property auto opDispatch(string name)();
-
Performs the lookup of set of code points with compile-time correctness checking. This short-cut version combines 3 searches: across blocks, scripts, and common binary properties.
Note that since scripts and blocks overlap the usual trick to disambiguate is used - to get a block use
unicode.InBlockName
, to search a script useunicode.ScriptName
.- See Also:
block
,script
and (not included in this search)hangulSyllableType
.
-
auto opCall(C)(scope const C[] name)
Constraints: if (is(C : dchar)); -
The same lookup across blocks, scripts, or binary properties, but performed at run-time. This version is provided for cases where
name
is not known beforehand; otherwise compile-time checkedopDispatch
is typically a better choice.See the table of properties for available sets.
- struct block;
-
Narrows down the search for sets of code points to all Unicode blocks.
- Note
-
Here block names are unambiguous as no scripts are searched and thus to search use simply
unicode.block.BlockName
notation.
- See Also:
- table of properties.
- Examples:
-
// use .block for explicitness writeln(unicode.block.Greek_and_Coptic); // unicode.InGreek_and_Coptic
- struct script;
-
Narrows down the search for sets of code points to all Unicode scripts.
See the table of properties for available sets.
- Examples:
-
auto arabicScript = unicode.script.arabic; auto arabicBlock = unicode.block.arabic; // there is an intersection between script and block assert(arabicBlock['']); assert(arabicScript['']); // but they are different assert(arabicBlock != arabicScript); writeln(arabicBlock); // unicode.inArabic writeln(arabicScript); // unicode.arabic
- struct hangulSyllableType;
-
Fetch a set of code points that have the given hangul syllable type.
Other non-binary properties (once supported) follow the same notation -
unicode.propertyName.propertyValue
for compile-time checked access andunicode.propertyName(propertyValue)
for run-time checked one.
See the table of properties for available sets.- Examples:
-
// L here is syllable type not Letter as in unicode.L short-cut auto leadingVowel = unicode.hangulSyllableType("L"); // check that some leading vowels are present foreach (vowel; '\u1110'..'\u115F') assert(leadingVowel[vowel]); writeln(leadingVowel); // unicode.hangulSyllableType.L
-
CodepointSet parseSet(Range)(ref Range range, bool casefold = false)
Constraints: if (isInputRange!Range && is(ElementType!Range : dchar)); -
Parse unicode codepoint set from given
range
using standard regex syntax '[...]'. The range is advanced skiping over regex set definition.casefold
parameter determines if the set should be casefolded - that is include both lower and upper case versions for any letters in the set.
-
pure @safe size_t graphemeStride(C)(scope const C[] input, size_t index)
Constraints: if (is(C : dchar)); -
Computes the length of grapheme cluster starting at
index
. Both the resulting length and theindex
are measured in code units.- Parameters:
-
C type that is implicitly convertible to dchars
C[] input
array of grapheme clusters size_t index
starting index into input[]
- Returns:
- length of grapheme cluster
- Examples:
-
writeln(graphemeStride(" ", 1)); // 1 // A + combing ring above string city = "A\u030Arhus"; size_t first = graphemeStride(city, 0); assert(first == 3); //\u030A has 2 UTF-8 code units writeln(city[0 .. first]); // "A\u030A" writeln(city[first .. $]); // "rhus"
-
Grapheme decodeGrapheme(Input)(ref Input inp)
Constraints: if (isInputRange!Input && is(immutable(ElementType!Input) == immutable(dchar))); -
Reads one full grapheme cluster from an input range of dchar
inp
.For examples see the
Grapheme
below.- Note
-
This function modifies
inp
and thusinp
must be an L-value.
-
auto byGrapheme(Range)(Range range)
Constraints: if (isInputRange!Range && is(immutable(ElementType!Range) == immutable(dchar))); -
Iterate a string by
Grapheme
.Useful for doing string manipulation that needs to be aware of graphemes.
- See Also:
-
byCodePoint
- Examples:
-
import std.algorithm.comparison : equal; import std.range.primitives : walkLength; import std.range : take, drop; auto text = "noe\u0308l"; // noël using e + combining diaeresis assert(text.walkLength == 5); // 5 code points auto gText = text.byGrapheme; assert(gText.walkLength == 4); // 4 graphemes assert(gText.take(3).equal("noe\u0308".byGrapheme)); assert(gText.drop(3).equal("l".byGrapheme));
-
auto byCodePoint(Range)(Range range)
Constraints: if (isInputRange!Range && is(immutable(ElementType!Range) == immutable(Grapheme)));
auto byCodePoint(Range)(Range range)
Constraints: if (isInputRange!Range && is(immutable(ElementType!Range) == immutable(dchar))); -
Lazily transform a range of
Grapheme
s to a range of code points.Useful for converting the result to a string after doing operations on graphemes.
If passed in a range of code points, returns a range with equivalent capabilities.
- Examples:
-
import std.array : array; import std.conv : text; import std.range : retro; string s = "noe\u0308l"; // noël // reverse it and convert the result to a string string reverse = s.byGrapheme .array .retro .byCodePoint .text; assert(reverse == "le\u0308on"); // lëon
- struct Grapheme;
-
A structure designed to effectively pack characters of a grapheme cluster.
Grapheme
has value semantics so 2 copies of aGrapheme
always refer to distinct objects. In most actual scenarios aGrapheme
fits on the stack and avoids memory allocation overhead for all but quite long clusters.- See Also:
decodeGrapheme
,graphemeStride
-
this(C)(scope const C[] chars...)
Constraints: if (is(C : dchar));
this(Input)(Input seq)
Constraints: if (!isDynamicArray!Input && isInputRange!Input && is(ElementType!Input : dchar)); -
Ctor
- const pure nothrow @nogc @trusted dchar opIndex(size_t index);
-
Gets a code point at the given index in this cluster.
- pure nothrow @nogc @trusted void opIndexAssign(dchar ch, size_t index);
-
Writes a code point
ch
at given index in this cluster.- Warning
-
Use of this facility may invalidate grapheme cluster, see also
Grapheme.valid
.
- Examples:
-
auto g = Grapheme("A\u0302"); writeln(g[0]); // 'A' assert(g.valid); g[1] = '~'; // ASCII tilda is not a combining mark writeln(g[1]); // '~' assert(!g.valid);
-
pure nothrow @nogc @safe SliceOverIndexed!Grapheme opSlice(size_t a, size_t b) return;
pure nothrow @nogc @safe SliceOverIndexed!Grapheme opSlice() return; -
Random-access range over Grapheme's characters.
- Warning
- Invalidates when this Grapheme leaves the scope, attempts to use it then would lead to memory corruption.
- const pure nothrow @nogc @property @safe size_t length();
-
Grapheme cluster length in code points.
- ref @trusted auto opOpAssign(string op)(dchar ch);
-
Append character
ch
to this grapheme.- Warning
-
Use of this facility may invalidate grapheme cluster, see also
valid
.
- See Also:
-
Grapheme.valid
- Examples:
-
import std.algorithm.comparison : equal; auto g = Grapheme("A"); assert(g.valid); g ~= '\u0301'; assert(g[].equal("A\u0301")); assert(g.valid); g ~= "B"; // not a valid grapheme cluster anymore assert(!g.valid); // still could be useful though assert(g[].equal("A\u0301B"));
-
ref auto opOpAssign(string op, Input)(scope Input inp)
Constraints: if (isInputRange!Input && is(ElementType!Input : dchar)); -
Append all characters from the input range
inp
to this Grapheme. - @property bool valid()();
-
True if this object contains valid extended grapheme cluster. Decoding primitives of this module always return a valid
Grapheme
.Appending to and direct manipulation of grapheme's characters may render it no longer valid. Certain applications may chose to use Grapheme as a "small string" of any code points and ignore this property entirely.
-
int sicmp(S1, S2)(scope S1 r1, scope S2 r2)
Constraints: if (isInputRange!S1 && isSomeChar!(ElementEncodingType!S1) && isInputRange!S2 && isSomeChar!(ElementEncodingType!S2)); -
Does basic case-insensitive comparison of
r1
andr2
. This function uses simpler comparison rule thus achieving better performance thanicmp
. However keep in mind the warning below.- Parameters:
-
S1 r1
an input range of characters S2 r2
an input range of characters
- Returns:
-
An
int
that is 0 if the strings match, <0 ifr1
is lexicographically "less" thanr2
, >0 ifr1
is lexicographically "greater" thanr2
- Warning
- This function only handles 1:1 code point mapping and thus is not sufficient for certain alphabets like German, Greek and few others.
- See Also:
icmp
std.algorithm.comparison.cmp
- Examples:
-
writeln(sicmp("Август", "авгусТ")); // 0 // Greek also works as long as there is no 1:M mapping in sight writeln(sicmp("ΌΎ", "όύ")); // 0 // things like the following won't get matched as equal // Greek small letter iota with dialytika and tonos assert(sicmp("ΐ", "\u03B9\u0308\u0301") != 0); // while icmp has no problem with that writeln(icmp("ΐ", "\u03B9\u0308\u0301")); // 0 writeln(icmp("ΌΎ", "όύ")); // 0
-
int icmp(S1, S2)(S1 r1, S2 r2)
Constraints: if (isForwardRange!S1 && isSomeChar!(ElementEncodingType!S1) && isForwardRange!S2 && isSomeChar!(ElementEncodingType!S2)); -
Does case insensitive comparison of
r1
andr2
. Follows the rules of full case-folding mapping. This includes matching as equal german ß with "ss" and other 1:M code point mappings unlikesicmp
. The cost oficmp
being pedantically correct is slightly worse performance.- Parameters:
-
S1 r1
a forward range of characters S2 r2
a forward range of characters
- Returns:
-
An
int
that is 0 if the strings match, <0 ifstr1
is lexicographically "less" thanstr2
, >0 ifstr1
is lexicographically "greater" thanstr2
- See Also:
sicmp
std.algorithm.comparison.cmp
- Examples:
-
writeln(icmp("Rußland", "Russland")); // 0 writeln(icmp("ᾩ -> \u1F70\u03B9", "\u1F61\u03B9 -> ᾲ")); // 0
- Examples:
-
By using
std.utf.byUTF
and its aliases, GC allocations via auto-decoding and thrown exceptions can be avoided, makingicmp
@safe @nogc nothrow pure
.import std.utf : byDchar; writeln(icmp("Rußland".byDchar, "Russland".byDchar)); // 0 writeln(icmp("ᾩ -> \u1F70\u03B9".byDchar, "\u1F61\u03B9 -> ᾲ".byDchar)); // 0
- pure nothrow @nogc @safe ubyte combiningClass(dchar ch);
-
Returns the combining class of
ch
.- Examples:
-
// shorten the code alias CC = combiningClass; // combining tilda writeln(CC('\u0303')); // 230 // combining ring below writeln(CC('\u0325')); // 220 // the simple consequence is that "tilda" should be // placed after a "ring below" in a sequence
- enum UnicodeDecomposition: int;
-
Unicode character decomposition type.
- Canonical
-
Canonical decomposition. The result is canonically equivalent sequence.
- Compatibility
-
Compatibility decomposition. The result is compatibility equivalent sequence.
- Note
- Compatibility decomposition is a lossy conversion, typically suitable only for fuzzy matching and internal processing.
- pure nothrow @safe dchar compose(dchar first, dchar second);
-
Try to canonically compose 2 characters. Returns the composed character if they do compose and dchar.init otherwise.
The assumption is that
first
comes beforesecond
in the original text, usually meaning that the first is a starter.- Note
-
Hangul syllables are not covered by this function. See
composeJamo
below.
- Examples:
-
writeln(compose('A', '\u0308')); // '\u00C4' writeln(compose('A', 'B')); // dchar.init writeln(compose('C', '\u0301')); // '\u0106' // note that the starter is the first one // thus the following doesn't compose writeln(compose('\u0308', 'A')); // dchar.init
- @safe Grapheme decompose(UnicodeDecomposition decompType = Canonical)(dchar ch);
-
Returns a full Canonical (by default) or Compatibility decomposition of character
ch
. If no decomposition is available returns aGrapheme
with thech
itself.- Note
- This function also decomposes hangul syllables as prescribed by the standard.
- See Also:
decomposeHangul
for a restricted version that takes into account only hangul syllables but no other decompositions.
- Examples:
-
import std.algorithm.comparison : equal; writeln(compose('A', '\u0308')); // '\u00C4' writeln(compose('A', 'B')); // dchar.init writeln(compose('C', '\u0301')); // '\u0106' // note that the starter is the first one // thus the following doesn't compose writeln(compose('\u0308', 'A')); // dchar.init assert(decompose('Ĉ')[].equal("C\u0302")); assert(decompose('D')[].equal("D")); assert(decompose('\uD4DC')[].equal("\u1111\u1171\u11B7")); assert(decompose!Compatibility('¹')[].equal("1"));
- @safe Grapheme decomposeHangul(dchar ch);
-
Decomposes a Hangul syllable. If
ch
is not a composed syllable then this function returnsGrapheme
containing onlych
as is.- Examples:
-
import std.algorithm.comparison : equal; assert(decomposeHangul('\uD4DB')[].equal("\u1111\u1171\u11B6"));
- pure nothrow @nogc @safe dchar composeJamo(dchar lead, dchar vowel, dchar trailing = (dchar).init);
-
Try to compose hangul syllable out of a leading consonant (
lead
), avowel
and optionaltrailing
consonant jamos.On success returns the composed LV or LVT hangul syllable.
If any oflead
andvowel
are not a valid hangul jamo of the respective character class returns dchar.init.- Examples:
-
writeln(composeJamo('\u1111', '\u1171', '\u11B6')); // '\uD4DB' // leaving out T-vowel, or passing any codepoint // that is not trailing consonant composes an LV-syllable writeln(composeJamo('\u1111', '\u1171')); // '\uD4CC' writeln(composeJamo('\u1111', '\u1171', ' ')); // '\uD4CC' writeln(composeJamo('\u1111', 'A')); // dchar.init writeln(composeJamo('A', '\u1171')); // dchar.init
- enum NormalizationForm: int;
-
Enumeration type for normalization forms, passed as template parameter for functions like
normalize
. - NFC
NFD
NFKC
NFKD -
Shorthand aliases from values indicating normalization forms.
- inout(C)[] normalize(NormalizationForm norm = NFC, C)(inout(C)[] input);
-
Returns
input
string normalized to the chosen form. Form C is used by default.For more information on normalization forms see the normalization section.
- Note
- In cases where the string in question is already normalized, it is returned unmodified and no memory allocation happens.
- Examples:
-
// any encoding works wstring greet = "Hello world"; assert(normalize(greet) is greet); // the same exact slice // An example of a character with all 4 forms being different: // Greek upsilon with acute and hook symbol (code point 0x03D3) writeln(normalize!NFC("ϓ")); // "\u03D3" writeln(normalize!NFD("ϓ")); // "\u03D2\u0301" writeln(normalize!NFKC("ϓ")); // "\u038E" writeln(normalize!NFKD("ϓ")); // "\u03A5\u0301"
- bool allowedIn(NormalizationForm norm)(dchar ch);
-
Tests if dchar
ch
is always allowed (Quick_Check=YES) in normalization formnorm
.- Examples:
-
// e.g. Cyrillic is always allowed, so is ASCII assert(allowedIn!NFC('я')); assert(allowedIn!NFD('я')); assert(allowedIn!NFKC('я')); assert(allowedIn!NFKD('я')); assert(allowedIn!NFC('Z'));
- pure nothrow @nogc @safe bool isWhite(dchar c);
-
Whether or not
c
is a Unicode whitespace character. (general Unicode category: Part of C0(tab, vertical tab, form feed, carriage return, and linefeed characters), Zs, Zl, Zp, and NEL(U+0085)) - pure nothrow @nogc @safe bool isLower(dchar c);
-
Return whether
c
is a Unicode lowercase character. - pure nothrow @nogc @safe bool isUpper(dchar c);
-
Return whether
c
is a Unicode uppercase character. -
auto asLowerCase(Range)(Range str)
Constraints: if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) && !isConvertibleToString!Range);
auto asUpperCase(Range)(Range str)
Constraints: if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) && !isConvertibleToString!Range); -
Convert an input range or a string to upper or lower case.
Does not allocate memory. Characters in UTF-8 or UTF-16 format that cannot be decoded are treated as
std.utf.replacementDchar
.- Parameters:
-
Range str
string or range of characters
- Returns:
-
an input range of
dchar
s
- Examples:
-
import std.algorithm.comparison : equal; assert("hEllo".asUpperCase.equal("HELLO"));
-
auto asCapitalized(Range)(Range str)
Constraints: if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) && !isConvertibleToString!Range); -
Capitalize an input range or string, meaning convert the first character to upper case and subsequent characters to lower case.
Does not allocate memory. Characters in UTF-8 or UTF-16 format that cannot be decoded are treated as
std.utf.replacementDchar
.- Parameters:
-
Range str
string or range of characters
- Returns:
- an InputRange of dchars
- See Also:
toUpper
,toLower
asUpperCase
,asLowerCase
- Examples:
-
import std.algorithm.comparison : equal; assert("hEllo".asCapitalized.equal("Hello"));
-
pure @trusted void toLowerInPlace(C)(ref C[] s)
Constraints: if (is(C == char) || is(C == wchar) || is(C == dchar)); -
Converts
s
to lowercase (by performing Unicode lowercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. Ifs
does not have any uppercase characters, thens
is unaltered. -
pure @trusted void toUpperInPlace(C)(ref C[] s)
Constraints: if (is(C == char) || is(C == wchar) || is(C == dchar)); -
Converts
s
to uppercase (by performing Unicode uppercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. Ifs
does not have any lowercase characters, thens
is unaltered. - pure nothrow @nogc @safe dchar toLower(dchar c);
-
If
c
is a Unicode uppercase character, then its lowercase equivalent is returned. Otherwisec
is returned.- Warning
- certain alphabets like German and Greek have no 1:1 upper-lower mapping. Use overload of toLower which takes full string instead.
-
ElementEncodingType!S[] toLower(S)(S s)
Constraints: if (isSomeString!S || isRandomAccessRange!S && hasLength!S && hasSlicing!S && isSomeChar!(ElementType!S)); -
Creates a new array which is identical to
s
except that all of its characters are converted to lowercase (by preforming Unicode lowercase mapping). If none ofs
characters were affected, thens
itself is returned ifs
is astring
-like type.- Parameters:
-
S s
A random access range of characters
- Returns:
-
An array with the same element type as
s
.
- pure nothrow @nogc @safe dchar toUpper(dchar c);
-
If
c
is a Unicode lowercase character, then its uppercase equivalent is returned. Otherwisec
is returned.- Warning
- Certain alphabets like German and Greek have no 1:1 upper-lower mapping. Use overload of toUpper which takes full string instead.
std.algorithm.iteration.map
to produce an algorithm that can convert a range of characters to upper case without allocating memory. A string can then be produced by usingstd.algorithm.mutation.copy
to send it to anstd.array.appender
.- Examples:
-
import std.algorithm.iteration : map; import std.algorithm.mutation : copy; import std.array : appender; auto abuf = appender!(char[])(); "hello".map!toUpper.copy(abuf); writeln(abuf.data); // "HELLO"
-
ElementEncodingType!S[] toUpper(S)(S s)
Constraints: if (isSomeString!S || isRandomAccessRange!S && hasLength!S && hasSlicing!S && isSomeChar!(ElementType!S)); -
Allocates a new array which is identical to
s
except that all of its characters are converted to uppercase (by preforming Unicode uppercase mapping). If none ofs
characters were affected, thens
itself is returned ifs
is astring
-like type.- Parameters:
-
S s
A random access range of characters
- Returns:
-
An new array with the same element type as
s
.
- pure nothrow @nogc @safe bool isAlpha(dchar c);
-
Returns whether
c
is a Unicode alphabetic character (general Unicode category: Alphabetic). - pure nothrow @nogc @safe bool isMark(dchar c);
-
Returns whether
c
is a Unicode mark (general Unicode category: Mn, Me, Mc). - pure nothrow @nogc @safe bool isNumber(dchar c);
-
Returns whether
c
is a Unicode numerical character (general Unicode category: Nd, Nl, No). - pure nothrow @nogc @safe bool isAlphaNum(dchar c);
-
Returns whether
c
is a Unicode alphabetic character or number. (general Unicode category: Alphabetic, Nd, Nl, No).- Parameters:
-
dchar c
any Unicode character
- Returns:
true
if the character is in the Alphabetic, Nd, Nl, or No Unicode categories
- pure nothrow @nogc @safe bool isPunctuation(dchar c);
-
Returns whether
c
is a Unicode punctuation character (general Unicode category: Pd, Ps, Pe, Pc, Po, Pi, Pf). - pure nothrow @nogc @safe bool isSymbol(dchar c);
-
Returns whether
c
is a Unicode symbol character (general Unicode category: Sm, Sc, Sk, So). - pure nothrow @nogc @safe bool isSpace(dchar c);
-
Returns whether
c
is a Unicode space character (general Unicode category: Zs) - pure nothrow @nogc @safe bool isGraphical(dchar c);
-
Returns whether
c
is a Unicode graphical character (general Unicode category: L, M, N, P, S, Zs). - pure nothrow @nogc @safe bool isControl(dchar c);
-
Returns whether
c
is a Unicode control character (general Unicode category: Cc). - pure nothrow @nogc @safe bool isFormat(dchar c);
-
Returns whether
c
is a Unicode formatting character (general Unicode category: Cf). - pure nothrow @nogc @safe bool isPrivateUse(dchar c);
-
Returns whether
c
is a Unicode Private Use code point (general Unicode category: Co). - pure nothrow @nogc @safe bool isSurrogate(dchar c);
-
Returns whether
c
is a Unicode surrogate code point (general Unicode category: Cs). - pure nothrow @nogc @safe bool isSurrogateHi(dchar c);
-
Returns whether
c
is a Unicode high surrogate (lead surrogate). - pure nothrow @nogc @safe bool isSurrogateLo(dchar c);
-
Returns whether
c
is a Unicode low surrogate (trail surrogate). - pure nothrow @nogc @safe bool isNonCharacter(dchar c);
-
Returns whether
c
is a Unicode non-character i.e. a code point with no assigned abstract character. (general Unicode category: Cn)
© 1999–2021 The D Language Foundation
Licensed under the Boost License 1.0.
https://dlang.org/phobos/std_uni.html