Next: , Up: Special Topics



7.1 Internationalization

Monotone initially dealt with only ASCII characters, in file path names, certificate names, key names, and packets. Some conservative extensions are provided to permit internationalized use. These extensions can be summarized as follows:

The remainder of this section is a precise specification of monotone's internationalization behavior.

General Terms

Character set conversion
The process of mapping a string of bytes representing wide characters from one encoding to another. Per-file character set conversions are specified by a Lua hook get_charset_conv which takes a filename and returns a table of two strings: the first represents the "internal" (database) charset, the second represents the "external" (file system) charset.
Line ending conversion
The process of converting platform-dependent end-of-line codes (0x0D, 0x0A, or the pair 0x0D 0x0A) from one convention to another. Per-file line ending conversion is specified by a Lua hook get_linesep_conv which takes a filename and returns a table of two strings: the first represents the "internal" (database) line ending convention, the second represents the "external" (file system) line ending convention. each string should be one of the three strings "CR", "LF", or "CRLF".

Note that Line ending conversion is always performed on the internal character set, when both character set and line ending conversion are enabled; this behavior is meant to encourage the use of the monotone's “normal form” (UTF-8, '\n') as an internal form for your source files, when working with multiple external forms. Also note that line ending conversion only works on character encodings with the specific code bytes described above, such as ASCII, ISO-8859x, and UTF-8.

Normal form conversion
Character set and line ending conversions done between a "system form" and a "normal form". The system character set form is inferred from the environment, using the various locale environment variables. The system line ending form can be additionally specialized using the get_system_linesep hook. No hooks exist for adjusting the system character set, since the system character set must be known during command-line argument processing, before any Lua hooks are loaded.

Monotone's normal form is the UTF-8 character set and the 0x0A (LF) line ending form. This form is used in any files monotone needs to read, write, and interpret itself, such as: MT/revision, MT/work, MT/options, .mt-attrs

LDH
Letters, digits, and hyphen: the set of ASCII bytes 0x2D, 0x30..0x39, 0x41..0x5A, and 0x61..0x7A.
stringprep
RFC 3454, a general framework for mapping, normalizing, prohibiting and bidirectionality checking for international names prior to use in public network protocols.
nameprep
RFC 3491, a specific profile of stringprep, used for preparing international domain names (IDNs)
punycode
RFC 3492, a "bootstring" encoding of unicode into ASCII.
IDNA
RFC 3490, international domain names for applications, a combination of the above technologies (nameprep, punycoding, limiting to LDH characters) to form a specific "ASCII compatible encoding" (ACE) of unicode, signified by the presence of an "unlikely" ACE prefix string "xn–". IDNA is intended to make it possible to use unicode relatively "safely" over legacy ASCII-based applications. the general picture of an IDNA string is this:
                {ACE-prefix}{LDH-sanitized(punycode(nameprep(UTF-8-string)))}
     

It is important to understand that IDNA encoding does not preserve the input string: it both prohibits a wide variety of possible strings and normalizes non-equal strings to supposedly "equivalent" forms.

By default, monotone does not decode IDNA when printing to the console (IDNA names are ASCII, which is a subset of UTF-8, so this normal form conversion can still apply, albeit oddly). this behavior is to protect users against security problems associated with malicious use of "similar-looking" characters. If the hook display_decoded_idna returns true, IDNA names are decoded for display.

Filenames

File contents

UI messages

UI messages are displayed via calls to gettext().

Host names

Host names are read on the command-line and subject to normal form conversion. Host names are then split at 0x2E (ASCII '.'), each component is subject to IDNA encoding, and the components are rejoined.

After processing, host names are stored internally as ASCII. The invariant is that a host name inside monotone contains only sequences of LDH separated by 0x2E.

Cert names

Read on the command line and subject to normal form conversion and IDNA encoding as a single component. The invariant is that a cert name inside monotone is a single LDH ASCII string.

Cert values

Cert values may be either text or binary, depending on the return value of the hook cert_is_binary. If binary, the cert value is never printed to the screen (the literal string "<binary>" is displayed, instead), and is never subjected to line ending or character conversion. If text, the cert value is subject to normal form conversion, as well as having all UTF-8 codes corresponding to ASCII control codes (0x0..0x1F and 0x7F) prohibited in the normal form, except 0x0A (ASCII LF).

Var domains

Read on the command line and subject to normal form conversion and IDNA encoding as a single component. The invariant is that a var domain inside monotone is a single LDH ASCII string.

Var names and values

Var names and values are assumed to be text, and subject to normal form conversion.

Key names

Read on the command line and subject to normal form conversion and IDNA encoding as an email address (split and joined at '.' and '@' characters). The invariant is that a key name inside monotone contains only LDH, 0x2E (ASCII '.') and 0x40 (ASCII '@') characters.

Packets

Packets are 7-bit ASCII. The characters permitted in packets are the union of these character sets:

The .mt-attrs file

Now uses 0x0A (ASCII LF) as a delimiter, to permit 0x20 in filenames. This may change in the future.