Next: Hash Integrity, Previous: Special Topics, Up: Special Topics [Contents][Index]
Monotone initially dealt with only ASCII characters, in file path names, certificate names, key names, and packets. Some conservative extensions are provided to permit internationalized use. These extensions can be summarized as follows:
The remainder of this section is a precise specification of monotone’s internationalization behavior.
The process of mapping a string of bytes representing wide characters
from one encoding to another. Per-file character set conversions are
specified by a Lua hook get_charset_conv
which takes a filename
and returns a table of two strings: the first represents the
"internal" (database) charset, the second represents the "external"
(file system) charset.
Letters, digits, and hyphen: the set of ASCII bytes 0x2D
,
0x30..0x39
, 0x41..0x5A
, and 0x61..0x7A
.
RFC 3454, a general framework for mapping, normalizing, prohibiting and bidirectionality checking for international names prior to use in public network protocols.
RFC 3491, a specific profile of stringprep, used for preparing international domain names (IDNs)
RFC 3492, a "bootstring" encoding of Unicode into ASCII.
RFC 3490, international domain names for applications, a combination of the above technologies (nameprep, punycoding, limiting to LDH characters) to form a specific "ASCII compatible encoding" (ACE) of Unicode, signified by the presence of an "unlikely" ACE prefix string "xn–". IDNA is intended to make it possible to use Unicode relatively "safely" over legacy ASCII-based applications. the general picture of an IDNA string is this:
{ACE-prefix}{LDH-sanitized(punycode(nameprep(UTF-8-string)))}
It is important to understand that IDNA encoding does not preserve the input string: it both prohibits a wide variety of possible strings and normalizes non-equal strings to supposedly "equivalent" forms.
By default, monotone does not decode IDNA when printing to the console (IDNA names are ASCII, which is a subset of UTF-8, so this normal form conversion can still apply, albeit oddly). this behavior is to protect users against security problems associated with malicious use of "similar-looking" characters.
0x5C
’\’ path separator to 0x2F
’/’. This extra
processing is performed by boost::filesystem.
0x2F
(ASCII / ), and
without a leading or trailing 0x2F
.
0x2F
and any ASCII "control codes"
(0x00..0x1F
and 0x7F
).
sha1sum
will produce
different results than those entries shown in a corresponding manifest.
UI messages are displayed via calls to gettext()
.
Host names are read on the command-line and subject to normal form
conversion. Host names are then split at 0x2E
(ASCII ’.’), each
component is subject to IDNA encoding, and the components are
rejoined.
After processing, host names are stored internally as ASCII. The
invariant is that a host name inside monotone contains only sequences
of LDH separated by 0x2E
.
Read on the command line and subject to normal form conversion and IDNA encoding as a single component. The invariant is that a cert name inside monotone is a single LDH ASCII string.
Cert values may be either text or binary, depending on the return
value of the hook cert_is_binary
. If binary, the cert value is
never printed to the screen (the literal string "<binary>" is
displayed, instead), and is never subjected to line ending or
character conversion. If text, the cert value is subject to normal
form conversion, as well as having all UTF-8 codes corresponding to
ASCII control codes (0x0..0x1F
and 0x7F
) prohibited in
the normal form, except 0x0A
(ASCII LF).
Read on the command line and subject to normal form conversion and IDNA encoding as a single component. The invariant is that a var domain inside monotone is a single LDH ASCII string.
Var names and values are assumed to be text, and subject to normal form conversion.
Read on the command line and subject to normal form conversion and
IDNA encoding as an email address (split and joined at ’.’ and ’@’
characters). The invariant is that a key name inside monotone contains
only LDH, 0x2E
(ASCII ’.’) and 0x40
(ASCII ’@’)
characters.
Packets are 7-bit ASCII. The characters permitted in packets are the union of these character sets:
Next: Hash Integrity, Previous: Special Topics, Up: Special Topics [Contents][Index]