Network Working Group K. Davies
Internet-Draft ICANN
Intended status: Informational A. Freytag
Expires: March 27, 2014 ASMUS Inc.
September 23, 2013
Representing Label Generation Rulesets using XML
draft-davies-idntables-04
Abstract
This memo describes a method of representing the domain name
registration policy for a zone administrator using Extensible Markup
Language (XML). These policies, known as "Label Generation Rulesets"
(LGRs), are particularly used for the implementation of
Internationalised Domain Names (IDNs). The rulesets are used to
implement and share policy on which specific Unicode code points are
permitted for registrations, which alternative code points are
considered variants, and what actions may be performed on those
variants.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on March 27, 2014.
Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
Davies & Freytag Expires March 27, 2014 [Page 1]
Internet-Draft Label Generation Rulesets in XML September 2013
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 6
4. LGR Format . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1. Namespace . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2. Basic structure . . . . . . . . . . . . . . . . . . . . . 7
4.3. Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3.1. The version element . . . . . . . . . . . . . . . . . 8
4.3.2. The date element . . . . . . . . . . . . . . . . . . . 8
4.3.3. The language element . . . . . . . . . . . . . . . . . 8
4.3.4. The domain element . . . . . . . . . . . . . . . . . . 9
4.3.5. The description element . . . . . . . . . . . . . . . 9
4.3.6. The validity-start and validity-end elements . . . . . 9
4.3.7. The unicode-version element . . . . . . . . . . . . . 10
4.3.8. The references element . . . . . . . . . . . . . . . . 10
5. Code Point Rules . . . . . . . . . . . . . . . . . . . . . . . 12
5.1. Sequences . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2. Variants . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.1. Basic variants . . . . . . . . . . . . . . . . . . . . 13
5.2.2. Null variants . . . . . . . . . . . . . . . . . . . . 14
5.2.3. Dispositions . . . . . . . . . . . . . . . . . . . . . 14
5.2.4. The ref attribute . . . . . . . . . . . . . . . . . . 14
5.2.5. Conditional variants . . . . . . . . . . . . . . . . . 15
5.2.6. The comment attribute . . . . . . . . . . . . . . . . 16
5.3. Code point tagging . . . . . . . . . . . . . . . . . . . . 16
6. Whole Label and Context Evaluation . . . . . . . . . . . . . . 17
6.1. Basic concepts . . . . . . . . . . . . . . . . . . . . . . 17
6.2. Character Classes . . . . . . . . . . . . . . . . . . . . 17
6.2.1. Tag-based classes . . . . . . . . . . . . . . . . . . 18
6.2.2. Unicode property based classes . . . . . . . . . . . . 18
6.2.3. Explicitly declared classes . . . . . . . . . . . . . 19
6.2.4. Combined classes . . . . . . . . . . . . . . . . . . . 20
6.3. Whole Label and Context Rules . . . . . . . . . . . . . . 21
6.3.1. The rule element . . . . . . . . . . . . . . . . . . . 21
6.3.2. Parameterized Context or When Rule . . . . . . . . . . 25
6.4. Action elements . . . . . . . . . . . . . . . . . . . . . 27
6.4.1. Recommended Disposition Values . . . . . . . . . . . . 28
6.4.2. Precedence . . . . . . . . . . . . . . . . . . . . . . 28
6.4.3. Implied actions . . . . . . . . . . . . . . . . . . . 28
Davies & Freytag Expires March 27, 2014 [Page 2]
Internet-Draft Label Generation Rulesets in XML September 2013
7. Example table . . . . . . . . . . . . . . . . . . . . . . . . 30
8. Processing a label against an LGR . . . . . . . . . . . . . . 32
8.1. Determining eligibility for a label . . . . . . . . . . . 32
8.2. Determining variants for a label . . . . . . . . . . . . . 32
8.3. Determining a disposition for a label or variant label . . 32
9. Conversion between other formats . . . . . . . . . . . . . . . 34
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 35
11. Security Considerations . . . . . . . . . . . . . . . . . . . 36
12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Appendix A. RelaxNG Schema . . . . . . . . . . . . . . . . . . . 38
Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 45
Appendix C. Editorial Notes . . . . . . . . . . . . . . . . . . . 46
C.1. Known Issues and Future Work . . . . . . . . . . . . . . . 46
C.2. Change History . . . . . . . . . . . . . . . . . . . . . . 46
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 47
Davies & Freytag Expires March 27, 2014 [Page 3]
Internet-Draft Label Generation Rulesets in XML September 2013
1. Introduction
This memo describes a method of using Extensible Markup Language
(XML) to describe the algorithm used to determine whether a given
domain label is permitted, and under which conditions based on the
code points it contains and their context. These algorithms are
comprised of a list of permissible code points, variants, and a
number of rules describing where certain relationships are applied.
These algorithms form part of a zone administrator's policies, and
can be referred to as Label Generation Rulesets (LGRs), or IDN
tables.
Administrators of the zones for top-level domain registries have
historically published their LGRs using ASCII text or HTML. The
formatting of these documents has been loosely based on the format
used for the Language Variant Table in [RFC3743]. [RFC4290] also
provides a "model table format" that describes a similar set of
functionality.
Through the first decade of IDN deployment, experience has shown that
LGRs derived from these formats are difficult to consistently
implement and compare due to their differing formats. A universal
format, such as one using a structured XML format, will assist by
improving machine-readability, consistency, reusability and
maintainability of LGRs. It also provides for more complex
conditional implementation of variants that reflects the known
requirements of current zone administrator policies.
While the predominant usage of this specification is to represent IDN
label policy, the format is not limited to IDN usage may also be used
for describing ASCII domain name label rulesets.
Davies & Freytag Expires March 27, 2014 [Page 4]
Internet-Draft Label Generation Rulesets in XML September 2013
2. Design Goals
The following items are explicit design goals of this format:
o MUST be in a format that can be implemented in a reasonably
straightforward manner in software;
o The format SHOULD be able to be checked for formatting errors,
such that common mistakes can be caught;
o An LGR MUST be able to express the set of valid code points that
are allowed for registration under a specific zone administrator's
policies;
o MUST be able to express computed alternatives to a given domain
name based on a one-to-one, or one-to-many relationship. These
computed alternatives are commonly known as "variants";
o Variants SHOULD be able to be tagged with specific categories,
such that the categories can be used to support registry policy
(such as whether to list the computed variant in the zone, or to
merely block it from registration);
o Variants MUST be able to stipulated based on contextual
information. For example, specific variants may only be
applicable when they follow another specific code point, or when
the code point is displayed in a specific presentation form;
o The data contained within an LGR MUST be unambiguous, such that
independent implementations that utilise the contents will arrive
at the same results;
o LGRs SHOULD be suitable for comparison and re-use, such that one
could easily compare the contents of two or more to see the
differences, to merge them, and so on.
o As many existing IDN tables are practicable SHOULD be able to be
migrated to the LGR format with all applicable logic retained.
It is explicitly NOT the goal of this format to stipulate what code
points should be listed in an LGR by a zone administrator. Which
registration policies are used for a particular zone is outside the
scope of this memo.
Davies & Freytag Expires March 27, 2014 [Page 5]
Internet-Draft Label Generation Rulesets in XML September 2013
3. Requirements
To be able to fulfil the known utilisation of LGRs, the existing
corpus of published IDN tables were reviewed to prepare this
specification.
In addition, the requirements of ICANN's work to implement an LGR for
the DNS Root Zone [LGR-PROCEDURE] were also considered. In Section B
of that document, five specific requirements for an LGR methodology
were identified:
o The ability to identify a set of code points that are permitted.
o The ability to represent a list of variants, if any, for each code
point.
o A method of identifying code points that are related, using a tag.
o The ability to describe rules regarding the possible actions that
may be performed on the resulting label (such as blocked,
allocatable, etc.)
o The ability to describe rules that check for ill-formed
combinations across the whole label.
Finally, the syntax and rules in [RFC5892] were reviewed.
Davies & Freytag Expires March 27, 2014 [Page 6]
Internet-Draft Label Generation Rulesets in XML September 2013
4. LGR Format
An LGR is expressed as a well-formed XML Document[XML].
4.1. Namespace
The XML Namespace URI is [TBD].
4.2. Basic structure
The basic XML framework of the document is as follows:
...
Within the "lgr" element rests several sub-elements. Firstly is a
"meta" element that contains all meta-data associated with the IDN
table, such as its authorship, what it is used for, implementation
notes and references. This is followed by a "data" element that
contains the substantive code point data. Finally, an optional
"rules" element contains information on whole-label evaluation rules,
if any, along with any specific rules regarding the disposition of
computed variants.
...
...
...
A document MUST contain exactly one "lgr" element. Each "lgr"
element MUST contain exactly one "data" element, optionally preceded
by one "meta" element and optionally followed by one "rules" element.
4.3. Metadata
The "meta" element is used to express meta-data associated within the
LGR. It can be used to identify the author or relevant contact
person, explain the intended usage of the LGR, and provide
Davies & Freytag Expires March 27, 2014 [Page 7]
Internet-Draft Label Generation Rulesets in XML September 2013
implementation notes as well as references. The data contained
within is not required by software consuming the LGR in order to
calculate valid labels, or to calculate variants. However, the
"unicode-version" element should be used to identify whether a
consumer of the table has the right Unicode data to perform
operations on the table.
4.3.1. The version element
The "version" element is used to uniquely identify each version of
the LGR being represented. No specific format is required, but it is
RECOMMENDED that it be a numerical positive integer, which is
incremented with each revision of the file.
An example of a typical first edition of a document:
1
4.3.2. The date element
The "date" element is used to identify the date the LGR was posted.
The contents of this element MUST be a valid ISO 8601 date string as
described in[RFC3339].
Example of a date:
2009-11-01
4.3.3. The language element
The "language" element signals that the LGR is associated with a
specific language or script. The value of the language element must
be a valid language tag as described in[RFC5646]. The tag may simply
refer to a script if the LGR is not referring to a specific language.
Example of an English language LGR:
en
If the LGR applies to a specific script, rather than a language, the
"und" language tag should be used followed by the relevant [RFC5646]
script subtag. For example, for a Cyrillic script LGR:
und-Cyrl
If the LGR covers a specific set of multiple languages or scripts,
the language element can be repeated. However, for cases of
Davies & Freytag Expires March 27, 2014 [Page 8]
Internet-Draft Label Generation Rulesets in XML September 2013
insignificant admixture of characters from other scripts, the use of
a single "language" element identifying the predominant script is
RECOMMENDED. In the exceptional case where no script is predominant,
use Zyyy (Common):
und-Zyyy
Note that that for the particular case of Japanese, a script tag
"Japn" exists that matches the mixture of scripts used in writing
that language. The preferred language element would be:
und-Japn
4.3.4. The domain element
This optional element refers to a domain to which this policy is
applied.
example.com
There may be multiple tags used to reflect a list of
domains.
4.3.5. The description element
The "description" element is a free-form element that contains any
additional relevant description that is useful for the user in its
interpretation. Typically, this field contains authorship
information, as well as additional context on how the LGR was
formulated (such as citations and references), and how it has been
applied.
The element has an optional "type" attribute, which refers to the
media type of the enclosed data. If the description lacks a type
field, it will be assumed to be plain text.
The "type" attribute may be used to specify the encoding within
description element. The attribute should be a valid MIME type. If
supplied, it will be assumed the contents is content of that
encoding. Typical types would be "text/plain" or "text/html". "text/
plain" will be assumed if no type attribute is specified.
4.3.6. The validity-start and validity-end elements
The "validity-start" and "validity-end" elements are optional
elements that describe the time period from which the contents of the
LGR become valid (i.e. are used in registry policy), and the contents
of the LGR cease to be used.
Davies & Freytag Expires March 27, 2014 [Page 9]
Internet-Draft Label Generation Rulesets in XML September 2013
The times should conform to the format described in section 5.6 of
[RFC5646]. It may be comprised of a date, or a date and time stamp.
4.3.7. The unicode-version element
Whenever an IDN table depends on character properties from a given
version of the Unicode standard, the minimum version number MUST be
listed. If any software processing the table does not have the
minimum requisite version, it MUST NOT perform any operations
relating to whole-label evaluation. This is because some Unicode
code points may not have been assigned in an earlier version, leaving
properties for these code points undefined. It is RECOMMENDED to
only reference stable or immutable properties as other may change
between versions.
6.2
It is not necessary to include a "unicode-version" element for files
that do not make use of Unicode properties. Because Unicode has been
strictly additive from Version 1.1, the required minimum version for
the repertoire can be uniquely determined by checking the code point
values in any "cp" attributes against the "age" property in [UAX42].
4.3.8. The references element
An IDN table may define a list of references in an optional
"references" element. The references element contains any number of
"reference" elements, each containing an "id" attribute. It is
RECOMMENDED that the "id" attribute be an integer. The "id"
attributes MUST be unique. The value of the element may a citation
of a standard, dictionary or other specification, in any suitable
format. A reference can be associated with many elements contained
in the "data" or "rules" elements, by using an optional "ref"
attribute.
The Unicode Standard, Version 7.0
Big-5: Computer Chinese Glyph and Character
Code Mapping Table, Technical Report C-26, 1984
ISO/IEC 10646:2012 3rd edition
...
...
...
Davies & Freytag Expires March 27, 2014 [Page 10]
Internet-Draft Label Generation Rulesets in XML September 2013
A "ref" attribute may not occur on elements that are named references
to character classes and rules and on certain specific other element
types. See description of these elements below.
Davies & Freytag Expires March 27, 2014 [Page 11]
Internet-Draft Label Generation Rulesets in XML September 2013
5. Code Point Rules
The bulk of a label generation ruleset is a description of which set
of code points are eligible for a given label. For rulesets that
perform operations that result in potential variants, the code point-
level relationships between variants need to also be described.
The code point data is collected within a "data" element. Within
this element, a series of "char" and "range" elements describe
eligible code points, or ranges of code points, respectively.
Discrete permissible code points or code point sequences are declared
with a "char" element, e.g.
Ranges of permissible code points may be stipulated with a "range"
element, e.g.
The range is inclusive of the first and last code points. Whether
code points are specified individually or as part of a range makes no
difference in processing the data, and tools reading or writing the
XML format may not retain a distinction. All attributes defined for
a range element are as if applied to each code point within.
Code points must be expressed in uppercase, hexadecimal, and zero
padded to a minimum of 4 digits - in other words according to the
standard Unicode convention but without the prefix "U+". The
rationale for not allowing other encoding formats, including native
Unicode encoding in XML, is explored in[UAX42]. The XML conventions
used in this format, including the element and attribute names,
mirror this document where practical and reasonable to do so.
5.1. Sequences
A sequence of two or more code points may be specified in a LGR,, for
example, when defining the source for n:m variant mappings. Another
use of sequences would be in cases when the exact sequence of code
points is required to occur in order for the constituent elements to
be eligible, such as when a specific code point is only eligible when
preceded or followed by another code point. The following would
define the eligibility of the MIDDLE DOT (U+00B7) only when both
preceded and followed by the LATIN SMALL LETTER L (U+006C):
Davies & Freytag Expires March 27, 2014 [Page 12]
Internet-Draft Label Generation Rulesets in XML September 2013
As an alternative to using sequences to define a required context, a
"char" or "range" element may specify conditional context in a "when"
attribute as described below inSection 5.2.5. The latter method is
more flexible in that such conditional context is not limited to
specific code point. and allows prohibited, as well as required
context to be specified.
5.2. Variants
While most LGRs typically only determine code point eligibility,
others additionally specify a mapping of code points to other code
points, known as "variants". What constitutes a variant is a matter
of policy, and varies for each implementation. The following
examples are intended to demonstrate the syntax; they are not
necessarily typical.
5.2.1. Basic variants
Variants are specified using one of more "var" elements as children
of a "char" element.
For example, to map LATIN SMALL LETTER V (U+0076) as a variant of
LATIN SMALL LETTER U (U+0075):
A sequence of multiple code points can be specified as a variant of a
single code point. For example, the sequence of LATIN SMALL LETTER O
(U+006F) then LATIN SMALL LETTER E (U+0065) might hypothetically be
specified as a variant for an LATIN SMALL LETTER O WITH DIAERESIS
(U+00F6) as follows:
The "var" element specifies Variants in only one direction, even
though the variant relation is usually considered symmetric, that is,
if A is a variant of B then B is typically also a variant of A. The
format requires that the inverse of the variant is given explicitly
to fully specify symmetric variant relations in the IDN table. This
has the beneficial side effect of making the symmetry explicit:
Davies & Freytag Expires March 27, 2014 [Page 13]
Internet-Draft Label Generation Rulesets in XML September 2013
Both the source and target of a variant mapping may be sequences. As
it is not possible to specify variants for ranges, ranges cannot be
used for characters for which variant relations need to be defined.
5.2.2. Null variants
To specify a null variant, which is a variant string that maps to no
code point, use an empty cp attribute. For example, to mark a string
with a ZERO WIDTH NON-JOINER (U+200C) to the same string without the
ZERO WIDTH NON-JOINER:
The symmetric form of a null variant cannot be expressed in the IDN
table format.
5.2.3. Dispositions
Variants may be given dispositions. These describe the policy state
for a variant label that was generated using a particular variant.
The dispositions are the same as described below inSection 6.4.
A disposition may be of any value, but several conventional
dispositions are predefined below in Section 6.4 to encourage common
conventions in their application. If these values can represent
registry policy, they SHOULD be used.
Usually, if a variant label contains any instance of one of the
blocked variants the label would be blocked, but if it contained only
instances of allocated variants it could be allocated. See the
discussion about implied actions inSection 6.4.3.
5.2.4. The ref attribute
Reference information may optionally be specified by a "ref"
attribute, consisting of a space delimited sequence of reference
identifiers.
Davies & Freytag Expires March 27, 2014 [Page 14]
Internet-Draft Label Generation Rulesets in XML September 2013
This facility is typically used to give source information for
characters or variant relations. This information is ignored when
machine-processing an LGR. Specifying a "ref" attribute on a range
element is equivalent to specifying the same ref attribute on every
single code point of the range. The reference identifiers MUST match
those declared in the "references" element (see Section 4.3.8).
In addition to "char", "range" and "var" elements in the data
section, a ref attribute may be present for literals ("char" inside a
rule) as well as rules and class definitions, but not for named
references to them.
5.2.5. Conditional variants
Fundamentally, variants are mappings between two sequences of code
points. However, in some instances for a variant relationship to
exist, some context external to the code point sequence must be
considered. For example, in some cases the positional context
determines whether two code point sequences are variants of each
other. An example are the Arabic characters, which can have
different forms based on position. This positional context cannot be
solely derived from the code point, as the code point is the same for
the various forms.
To specify a conditional variant relationship the "when" attribute is
used. The variant relationship exists when the condition in the
"when" attribute is satisfied. A "not-when" attribute may be used
for conditions that must not be satisfied. The value of each "when"
or "not-when" attributes is a a parameterized context rule as
described below inSection 6.3.2.
Assuming the "rules" element contains suitably defined rules for
"arabic-isolated" and "arabic-final", the following example shows how
to mark ARABIC LETTER ALEF WITH WAVY HAMZA BELOW (U+0673) as a
variant of ARABIC LETTER ALEF WITH HAMZA BELOW (U+0625), but only
when it appears in isolated or final forms:
Only a single "when" or "not-when" attribute can be applied to any
Davies & Freytag Expires March 27, 2014 [Page 15]
Internet-Draft Label Generation Rulesets in XML September 2013
"var" element, however, multiple "var" elements using the same
mapping, but different "when" or "not-when" attributes may be
specified.
While currently Arabic is the only script known for which such
conditional variants are defined. there are other scripts, such as
Mongolian, which share the concept of positional forms. By requiring
explicit definitions for these rules, this mechanism can easily
handle any additional types of conditional variants that are
required.
As described in Section 5.1 a "when" or "not-when" attribute may also
be specified to any "char" element in the data section to define
required or prohibited contextual conditions under which a code point
is valid.
5.2.6. The comment attribute
Any "char", "range" or "variant" element may contain a comment in a
"comment" attribute. The contents of a comment attribute are free-
form plain text. Comments are ignored in machine processing of the
table. Comment attributes may also be placed on certain elements in
the "rules" section of the document, such as actions and literals
("char"), as well as definitions of classes and rules, but not named
references to them. Finally, in the metadata the "version" and
"reference" elements may have comment attributes to match the syntax
in [RFC3743]
5.3. Code point tagging
Typically, LGRs are used to explicitly designate allowable code
points, with any label with a code point not explicitly listed in the
LGR being considered an ineligible label according to the ruleset.
For more complex registry rules, there may be a need to discern code
points of certain types. This can be accomplished by applying a
"tag" attribute, and then filtering on results based on the tag using
whole label evaluation. Tag attributes may be of any value, and
multiple values are separated by space.
A simple example would be to label preferred code points (as
in[RFC3743]) by adding "preferred" to the tag, and then using a rule
such as shown in Section 6.3.1 to filter out labels that consist
entirely of such preferred code points.
Davies & Freytag Expires March 27, 2014 [Page 16]
Internet-Draft Label Generation Rulesets in XML September 2013
6. Whole Label and Context Evaluation
6.1. Basic concepts
The code points in a label sometimes need to satisfy context-based
rules, for example for the label to be considered valid, or to
satisfy the context for a variant mapping (see the description of the
"when" attribute inSection 6.3.2).
A Whole Label Evaluation rule (WLE) is applied to the whole label.
It is used to validate both original labels and variant labels
derived from them. A conditional context rules is a specialized form
of WLE specific to the context around a single code point or code
point sequence. For example, if a rule is referenced in the "when"
attribute of a variant mapping it is used to describe the conditional
context under which the particular variant mapping is defined to
exist.
Each rule is defined in a "rule" element. A rule may contain the
following as child elements:
o literal code points or code point sequences
o character classes, which defines sets of code points to be used
for context comparisons; and
o context operators, which define when character classes and
literals may appear
6.2. Character Classes
Character classes are sets of characters, that often share a
particular property. They can be specified in several ways:
1. by defining the property via matching a tag in the code point
data. All characters with the same tag attribute are part of the
same class.
2. by referencing one of the Unicode character properties defined in
the Unicode Character Database[UAX42];
3. by explicitly listing all the code points in the class; or
4. by defining the class as a combination of any number of these
definitions or other classes.
A character class has an optional "name" attribute, consisting of a
single identifier not containing spaces. If it is omitted, the class
Davies & Freytag Expires March 27, 2014 [Page 17]
Internet-Draft Label Generation Rulesets in XML September 2013
is anonymous and exists only inside the rule or combined class where
it is defined. A named character class is defined independently and
can be referenced by name by both rules and other character classes.
...
An empty "class" element with a name attribute is a reference to an
existing named class. Such an element MUST not have either "comment"
or "ref" attributes as those may only be placed on a class
definition.
6.2.1. Tag-based classes
The char element may contain a tag attribute that consists of one or
more space separated identifiers, for example:
This defines two tags for use with code point U+0061, the tag
"letter" and the tag "lower". Implicitly, this defines two named
character classes, the class "letter" and the class "lower", the
first with 0061 and 4E00 as elements and the latter with 0061, but
not 4E00 as an element. The document MUST not contain an explicitly
named class definition of the same name as an implicitly named tag-
derived class.
6.2.2. Unicode property based classes
A class is defined in terms of Unicode properties by giving the
Unicode property alias and the property value or property value
alias, separated by a colon.
The example above selects all characters for which the Unicode
canonical combining class (ccc) value is 9. This value of the ccc is
assigned in the to all characters that are viramas. The string "ccc"
is the short-alias for the canonical combining class, as defined in
the Unicode Character Database [UAX42].
Davies & Freytag Expires March 27, 2014 [Page 18]
Internet-Draft Label Generation Rulesets in XML September 2013
Unicode properties may, in principle, change between versions of the
Unicode Standard. However, the values assigned for a given version
are fixed. If Unicode Properties are used, a minimum Unicode version
MUST be declared in the header. (Note, some Unicode properties are
by definition stable across versions and do not change once
assigned.)
6.2.3. Explicitly declared classes
A class of code points may also be declared by listing the code
points that are a member of the class. This is useful when tagging
cannot be used because code points are not listed individually as
part of the eligible set of code points for the given LGR, for
example because they only occur in code point sequences.
To define a class in terms of an explicit list of code points:
This defines a class named "abc" containing the code points for
characters "a", "b" and "c". The ordering of the code points is not
material, but it is RECOMMENDED to list them in ascending order.
Range operators may also be used to represent any series of
consecutive code points. The same declaration can be made as
follows:
Range and code point declarations can be freely intermixed. A
shorthand notation exists where code points are directly represented
by space separated hexadecimal values, and ranges are represented by
a start and end value separated by a hyphen.
0061 0062-0063
would be a more streamlined expression of the same class using the
shorthand notation.
Davies & Freytag Expires March 27, 2014 [Page 19]
Internet-Draft Label Generation Rulesets in XML September 2013
6.2.4. Combined classes
Classes may be combined using logical operators for inversion, union,
intersection, difference and symmetric difference (exclusive-or).
+-------------------+------------------------------------+
| Logical Operation | Example |
+-------------------+------------------------------------+
| Inversion | |
+-------------------+------------------------------------+
| Union | |
| | |
| | |
| | |
+-------------------+------------------------------------+
| Intersection | |
| | |
| | |
| | |
+-------------------+------------------------------------+
| Difference | |
| | |
| | |
| | |
+-------------------+------------------------------------+
| Symmetric | |
| Difference | |
| | |
| | |
+-------------------+------------------------------------+
Combinations can be anonymous or named.
This creates a named class that represents the union of classes "xxx"
and "yyy", and which can be referenced in other classes or rules as
.
Note that the reference to a named class is always via a "class"
element, independent of how the character class was defined.
An "intersection", "symmetric-difference" or "difference" element
MUST contain precisely two, and a "not" element MUST contain
Davies & Freytag Expires March 27, 2014 [Page 20]
Internet-Draft Label Generation Rulesets in XML September 2013
precisely one "class" or one of the operator elements, while a
"union" element MUST contain two or more elements.
6.3. Whole Label and Context Rules
Each rule is comprised of a series of matching operators that must be
satisfied in order to determine whether a label meets a given
condition. Rules may reference other rules or character classes
defined elsewhere in the table.
6.3.1. The rule element
A matching rule is defined by a "rule" element, which contains
combinations character classes with literal code point sequences and
context operators contained in child elements. In evaluating a rule,
each child element is matched in order.
Rules may optionally be named using a "name" attribute containing a
single identifier string with no spaces. If the name attribute is
omitted, the rule is anonymous and may not incorporated by reference
into another rule or referenced by an action or "when" attribute.
A simple rule to match a label where all characters are members of
the class "preferred":
Rules are paired with explicit and implied actions, triggering these
actions when a rule matches a label. For example, a simple explicit
action for the rule shown above would be:
which has the effect of setting the policy disposition for a label
made up entirely of "preferred" code points, to "preferred".
Explicit actions are further discussed in Section 6.4 and use of
rules in conditional context for implied actions is discussed in
Section 5.2.5 and Section 6.4.3.
6.3.1.1. The count attribute
The number of times a specific character class or rule may appear in
an expression defined by a rule is given by the "count" attribute.
The attribute consists of a number, optionally followed by a "+"
sign. The number MUST be an integer of value 0 or higher, and gives
the number of times the class or rule may appear in matching. If the
Davies & Freytag Expires March 27, 2014 [Page 21]
Internet-Draft Label Generation Rulesets in XML September 2013
number is followed by a plus sign ("+"), it means that any number of
additional occurrences are allowed beyond the number stated.
Therefore, "1" would mean exactly one occurrence, whereas "1+" would
indicate one or more occurrences.
If no count attribute is specified, the number of occurrences is "1".
6.3.1.2. The choice element
For cases where several alternates could be chosen, the "choice"
element can encode a list of choices:
Each child element of a "choice" represents one alternative. The
first matching alternative determines the match for the choice
element. To express a choice where one alternative consists of a
sequence of elements, they can be wrapped in an anonymous rule.
6.3.1.3. Literal code point sequences
A literal code point sequence matches a single code point or a
sequence. It is defined by a "char" element, with the code point or
sequence to be matched given by the "cp" attribute. When used as a
literal, a "char" element may contain a "count" in addition to the
"cp" attribute, comments or references, but no conditional contexts
or child elements.
6.3.1.4. The any element
The "any" element matches any single code point. It may have a
"count" attribute. For an example see Section 6.3.1.8
The "any" element" may have neither a "comment" nor a "ref"
attribute.
6.3.1.5. The start and end elements
To match the beginning or end of a label, use the "start" or "end"
element.
Davies & Freytag Expires March 27, 2014 [Page 22]
Internet-Draft Label Generation Rulesets in XML September 2013
Start and end elements do not have a "count" or any other attribute.
When their use is not required, it is RECOMMENDED to us the "match"
attribute instead. One case where start or end elements are required
is when only some, but not all of the alternatives in a "choice" need
to match beginning or end of a label.
6.3.1.6. The match attribute
Whole Label Evaluation Rules in principle always apply to the entire
label, but in practice, for example to express a requirement to not
start a label with a digit, some rules do not need to cover the whole
label. Use attribute "match" with value "whole-label" to identify a
rule applicable to the entire label. For other rules use "from-
start", "anywhere" and "to-end" to define rules that need to match in
specific positions of the label. Certain parameterized context rules
(see Section 6.3.2) have a match attribute value of "context". The
defaults is "anywhere".
A "match" attribute present in the definition of a rule is ignored if
a rule is referenced by name inside another rule. An anonymous rule
may not have a "match" attribute.
6.3.1.7. The name attribute
Rules and classes may be named using a "name" attribute and can be
nested either directly or, if named, by reference.
Here's an example of a rule requiring that all labels be letters
(optionally followed by combining marks) and possibly digits. The
example shows rules and classes referenced by name.
Davies & Freytag Expires March 27, 2014 [Page 23]
Internet-Draft Label Generation Rulesets in XML September 2013
6.3.1.8. Example rule from IDNA2008
This sections shows an example of the whole label evaluation rule
from[RFC5892]forbidding the mixture of the Arabic-Indic and extended
Arabic-Indic digits in the same label.
The preceding example also demonstrates several instances of the use
of anonymous rules for grouping.
Davies & Freytag Expires March 27, 2014 [Page 24]
Internet-Draft Label Generation Rulesets in XML September 2013
6.3.2. Parameterized Context or When Rule
A special type of rule provides a context for evaluating the validity
of a code point or variant mapping. This rule is invoked by the
"when" attribute described inSection 5.2.5. For a context rule, the
match attribute is normally "context". Such "when rules" contain a
special place holder, represented by a "match" element (not to be
confused with the "match" attribute). When evaluated, the "match"
element is replaced by a literal corresponding to the "cp" attribute
of the element for which the rule in its "when" attribute is being
evaluated.
For example, the Greek lower numeral sign is invalid if not
immediately preceding a character in the Greek script. This is most
naturally addressed with a when rule using look-ahead:
...
In evaluating this rule, the "match" element is treated as if it was
replaced by a literal
The action implied by a context rule is always a disposition of
"invalid" if the when rule is not matched. Unlike other rules, these
rules may not be associated with arbitrary actions via "action"
elements.
6.3.2.1. The look-behind and look-ahead elements
Context rules use the "look-behind" and "look-ahead" elements to
define context before and after the code point sequence matched by
the "match" element. If the "match" element is omitted, neither the
"look-behind" nor the "look-ahead" element may be present.
Here is an example of a rule that defines an "initial" context for an
Arabic code point:
Davies & Freytag Expires March 27, 2014 [Page 25]
Internet-Draft Label Generation Rulesets in XML September 2013
A when rule contains any combination of "look-ahead" , "match" and
"look-behind" elements in that order. Each of these elements occurs
at most once, and none have a "count" attribute. If a context rule
contains a look-ahead or look-behind element, it MUST contain a
"match" element. If a "match" element is present the rule MUST have
a "match" attribute with a value of "context".
6.3.2.2. Omitting the match element
If the "match" element is omitted, the evaluation of the context rule
is not tied to the position of the code point or sequence associated
with the "when" attribute.
Katakana middle dot is invalid in any label not containing at least
one Japanese character anywhere in the label. Because this
requirement is independent of the position of the middle dot, the
rule does not require a "match" element and the "match" attribute is
"anywhere".
Davies & Freytag Expires March 27, 2014 [Page 26]
Internet-Draft Label Generation Rulesets in XML September 2013
The Katakana middle dot is used only with Han, Katakana or Hiragana.
The "when" rule requires that at least one code point in the label is
in one of these scripts. (Note that the Katakana middle dot itself
is of script Common).
6.4. Action elements
The purpose of a rule is to trigger a specific action. Often, the
action simply results in blocking a label that does not match a rule.
An example of an action invalidating a label:
An action may contain precisely one "match" or "not-match" attribute,
but not both. Because rules may be compound rules that contain other
rules, only a single rule may be named as the value of the "match" or
"not-match" attribute.
An action may contain either one of a set of optional attributes
matching the variant disposition from the "disposition" attributed of
any "var" element used in generating the variant label being
evaluated. Assuming all variants have been given suitable
"disposition" attributes of "blocked" or "allocate" and that a rule
is defined matching labels consisting entirely of code points tagged
as "preferred" the following actions evaluate the disposition for the
variant label:
The first action matches any variant label for which at least one of
the code point variants carries the disposition "blocked". The
second matches any variant label for which all of the code point
variants carry the disposition "allocate". Neither action matches a
label that is not a variant label. If necessary repeat an action so
it applies to an ordinary label:
Davies & Freytag Expires March 27, 2014 [Page 27]
Internet-Draft Label Generation Rulesets in XML September 2013
6.4.1. Recommended Disposition Values
The precise nature of the policy action taken in response to a
disposition and the name of the corresponding "disposition"
attributes are only partially defined here. It is strongly
RECOMMENDED to use the following actions only with their conventional
sense.
invalid The resulting string is not a valid label. This disposition
may be assigned implicitly, seeSection 6.4.3.
block The resulting string is a valid label, but should be blocked
from registration. This would typically apply for a derived
variant that has no practical use, such as blocking confusingly
similar by undesirable variants.
allocate The resulting string should be reserved for use by the same
operator of the origin string, but not automatically allocated
for use.
activate The resulting string should be activated for use. (This is
the typical default action if no tagging is used, and is known
as a "preferred" variant in [RFC3743])
6.4.2. Precedence
Actions are applied in the order of their appearance in the file.
This defines their relative precedence. The first action for which
the rule is matched or not-matched as required for a particular label
defines the disposition for that label. The conventional order of
precedence for the actions defined here is "invalid", "block",
"allocate", "activate". In order to define a different order of
precedence or when additional actions are defined, list the actions
in the appropriate order.
6.4.3. Implied actions
The context, or "when" rules carry an implied action with a
disposition of "invalid". These rules are evaluated at the time a
label's code points and variants are checked for validity
(seeSection 8) . In other words, before any whole-label evaluation
rules and with higher precedence. The context rules for variant
mappings are evaluated when variants are generated and / or when
variant tables are made symmetric and transitive. They have an
implied action with a disposition of "invalid" which means a putative
variant mapping doesn't exist in the given context.
Note that such non-existing variant mapping is different from a
Davies & Freytag Expires March 27, 2014 [Page 28]
Internet-Draft Label Generation Rulesets in XML September 2013
blocked variant, which is variant code point mapping that exists, but
results in a label that may not be allocated.
Variant mappings may be given a disposition attribute . An implied
action relates these to the disposition for the entire variant label.
For example, a variant label in which any variant code point is a
blocked code point variant is blocked. The default order of
precedence for evaluating dispositions is as given above. The
default precedence applies if no actions are defined that match
specific variant dispositions.
Davies & Freytag Expires March 27, 2014 [Page 29]
Internet-Draft Label Generation Rulesets in XML September 2013
7. Example table
The following presents a sample XML LGR showing a near complete
collection of most of the elements and attributes defined in this
specification in somewhat typical context.
1
2010-01-01
sv
example
Swedish
examples institute.
]]>
The Unicode Standard 6.3
RFC 5892
Big-5: Computer Chinese Glyph and Character
Code Mapping Table, Technical Report C-26, 1984
Davies & Freytag Expires March 27, 2014 [Page 30]
Internet-Draft Label Generation Rulesets in XML September 2013
006E 0070-0078
Davies & Freytag Expires March 27, 2014 [Page 31]
Internet-Draft Label Generation Rulesets in XML September 2013
8. Processing a label against an LGR
8.1. Determining eligibility for a label
In order to use a table to test a specific domain label for
membership in the LGR, a consumer of the LGR must iterate through
each code point within a given U-label, and test that each code point
is a member of the LGR. If any code point is not a member of the
LGR, it shall be deemed as not eligible in accordance with the table.
A code point is deemed a member of the table when it is listed with
the "char" element, and all necessary condition listed in "when" or
"not-when" attributes are correctly satisfied.
8.2. Determining variants for a label
For a given eligible label, the set of variants is deemed to be each
possible permutation of "var" elements, whereby all "when" and "not-
when" attributes are correctly satisfied for each var element in the
given permutation and all applicable whole label evaluation rules are
satisfied as follows:
o Create each possible permutation of a label, by substituting each
code point or code point sequence in turn by any defined variant
mapping
o Apply variant mappings with "when" or "not-when" attributes only
if the conditions are satisfied
o Record each of the "disposition" values on the variant mappings
used in creating a given variant label
o Evaluate each variant against any actions for which the
disposition is "invalid", remove any that satisfy the conditions.
8.3. Determining a disposition for a label or variant label
For a given label, the disposition for the is determined by
evaluating in order of their appearance all actions for which the
label or variant label satisfies the conditions.
o For any label, the "disposition" value for the first action
applies, for which the label matches or doesn't match the whole
label evaluation rule, given in the "match" or "not-match"
attribute for that action,
o For any variant label, the "disposition" value for the first
action applies, for which the label matches or doesn't match the
Davies & Freytag Expires March 27, 2014 [Page 32]
Internet-Draft Label Generation Rulesets in XML September 2013
whole label evaluation rule, given in the "match" or "not-match"
attribute, and for which any or all of the recorded variant
dispositions match the conditions for that action.
o For any remaining variant label, assign the variant label the
disposition matching the most restrictive disposition recorded.
for any of its variants The order from most restrictive to least
is "invalid", "blocked", "allocated", "active".
o Variants dispositions outside the predefined default set, and for
which no action is defined are ignored.
Davies & Freytag Expires March 27, 2014 [Page 33]
Internet-Draft Label Generation Rulesets in XML September 2013
9. Conversion between other formats
Both [RFC3743] and [RFC4290] provide different grammars for IDN
tables. These formats are unable to fully cater for the increased
requirements of contemporary IDN variant policies.
This specification is a superset of functionality provided by these
IDN table formats, thus any table expressed in those formats can be
expressed in this format. Automated conversion can be conducted
between tables conformant with the grammar specified in each
document.
Davies & Freytag Expires March 27, 2014 [Page 34]
Internet-Draft Label Generation Rulesets in XML September 2013
10. IANA Considerations
This document does not specify any IANA actions.
Davies & Freytag Expires March 27, 2014 [Page 35]
Internet-Draft Label Generation Rulesets in XML September 2013
11. Security Considerations
There are no security considerations for this memo.
Davies & Freytag Expires March 27, 2014 [Page 36]
Internet-Draft Label Generation Rulesets in XML September 2013
12. References
[LGR-PROCEDURE]
Internet Corporation for Assigned Names and Numbers,
"Procedure to Develop and Maintain the Label Generation
Rules for the Root Zone in Respect of IDNA Labels".
[RFC3339] Klyne, G., Ed. and C. Newman, "Date and Time on the
Internet: Timestamps", RFC 3339, July 2002.
[RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
Engineering Team (JET) Guidelines for Internationalized
Domain Names (IDN) Registration and Administration for
Chinese, Japanese, and Korean", RFC 3743, April 2004.
[RFC4290] Klensin, J., "Suggested Practices for Registration of
Internationalized Domain Names (IDN)", RFC 4290,
December 2005.
[RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman,
"Linguistic Guidelines for the Use of the Arabic Language
in Internet Domains", RFC 5564, February 2010.
[RFC5646] Phillips, A. and M. Davis, "Tags for Identifying
Languages", BCP 47, RFC 5646, September 2009.
[RFC5892] Faltstrom, P., "The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA)",
RFC 5892, August 2010.
[UAX42] Unicode Consortium, "Unicode Character Database in XML".
[XML] "Extensible Markup Language (XML) 1.0".
Davies & Freytag Expires March 27, 2014 [Page 37]
Internet-Draft Label Generation Rulesets in XML September 2013
Appendix A. RelaxNG Schema
[TODO: this needs to be updated to reflect additions to the syntax.]
Davies & Freytag Expires March 27, 2014 [Page 38]
Internet-Draft Label Generation Rulesets in XML September 2013
Davies & Freytag Expires March 27, 2014 [Page 39]
Internet-Draft Label Generation Rulesets in XML September 2013
Davies & Freytag Expires March 27, 2014 [Page 40]
Internet-Draft Label Generation Rulesets in XML September 2013
Davies & Freytag Expires March 27, 2014 [Page 41]
Internet-Draft Label Generation Rulesets in XML September 2013
Davies & Freytag Expires March 27, 2014 [Page 42]
Internet-Draft Label Generation Rulesets in XML September 2013
Davies & Freytag Expires March 27, 2014 [Page 43]
Internet-Draft Label Generation Rulesets in XML September 2013
Davies & Freytag Expires March 27, 2014 [Page 44]
Internet-Draft Label Generation Rulesets in XML September 2013
Appendix B. Acknowledgements
This format builds upon the work on documenting IDN tables by many
different registry operators. Notably, a comprehensive language
table for Chinese, Japanese and Korean was developed by the "Joint
Engineering Team" [RFC3743] that is the basis of many registry
policies; and a set of guidelines for Arabic script registrations
[RFC5564] was published by the Arabic-language community.
Contributions that have shaped this document have been provided by
Francisco Arias, Mark Davis, Nicholas Ostler, Thomas Roessler, Steve
Sheng, Michel Suignard, John Yunker and Andrew Sullivan.
Davies & Freytag Expires March 27, 2014 [Page 45]
Internet-Draft Label Generation Rulesets in XML September 2013
Appendix C. Editorial Notes
This appendix to be removed prior to final publication.
C.1. Known Issues and Future Work
o A method of specifying the origin URI for a table, and an
expiration or refresh policy, as meta-data may be a useful way to
declare how the table will be updated.
C.2. Change History
-00 Initial draft.
-01 Add an XML Namespace, and fix other XML nits. Add support for
sequences of code points. Improve on consistently using Unicode
nomenclature.
-02 Add support for validity periods.
-03 Incorporate requirements from the Label Generation Ruleset
Procedure for the DNS Root Zone. These requirements include a
detailed grammar for specifying whole-label variants, and the
ability to explicitly declare of the actions associated with a
specific variant. The document also consistently applies the
term "Label Generation Ruleset", rather than "IDN table", to
reflect the policy term now being used to describe these.
-04 Support reference information per [RFC3743]. Update description
in response to feedback. Extend the context rules to "char"
elements and allow for inverse matching ("not-when"). Extend
the description of label processing and implied actions, and
allow for actions that reference disposition attributes on any
or all variant mappings used in the generation of a variant
label.
Davies & Freytag Expires March 27, 2014 [Page 46]
Internet-Draft Label Generation Rulesets in XML September 2013
Authors' Addresses
Kim Davies
Internet Corporation for Assigned Names and Numbers
12025 Waterfront Drive
Los Angeles, CA 90094
US
Phone: +1 310 301 5800
Email: kim.davies@icann.org
URI: http://www.iana.org/
Asmus Freytag
ASMUS Inc.
Email: asmus@unicode.org
Davies & Freytag Expires March 27, 2014 [Page 47]