Diff: rfc9485.original

	rfc9485.original	rfc9485.txt


	Network Working Group C. Bormann	Internet Engineering Task Force (IETF) C. Bormann
	Internet-Draft Universität Bremen TZI	Request for Comments: 9485 Universität Bremen TZI
	Intended status: Standards Track T. Bray	Category: Standards Track T. Bray
	Expires: 31 December 2023 Textuality	ISSN: 2070-1721 Textuality
	29 June 2023	October 2023


	I-Regexp: An Interoperable Regexp Format	I-Regexp: An Interoperable Regular Expression Format
	draft-ietf-jsonpath-iregexp-08

	Abstract	Abstract


	This document specifies I-Regexp, a flavor of regular expressions	This document specifies I-Regexp, a flavor of regular expression that
	that is limited in scope with the goal of interoperation across many	is limited in scope with the goal of interoperation across many
	different regular-expression libraries.	different regular expression libraries.

	About This Document

	This note is to be removed before publishing as an RFC.

	Status information for this document may be found at
	https://datatracker.ietf.org/doc/draft-ietf-jsonpath-iregexp/.

	Discussion of this document takes place on the JSONPath Working Group
	mailing list (mailto:JSONPath@ietf.org), which is archived at
	https://mailarchive.ietf.org/arch/browse/JSONPath/. Subscribe at
	https://www.ietf.org/mailman/listinfo/JSONPath/.

	Source for this draft and an issue tracker can be found at
	https://github.com/ietf-wg-jsonpath/iregexp.

	Status of This Memo	Status of This Memo


	This Internet-Draft is submitted in full conformance with the	This is an Internet Standards Track document.
	provisions of BCP 78 and BCP 79.

	Internet-Drafts are working documents of the Internet Engineering
	Task Force (IETF). Note that other groups may also distribute
	working documents as Internet-Drafts. The list of current Internet-
	Drafts is at https://datatracker.ietf.org/drafts/current/.


	Internet-Drafts are draft documents valid for a maximum of six months	This document is a product of the Internet Engineering Task Force
	and may be updated, replaced, or obsoleted by other documents at any	(IETF). It represents the consensus of the IETF community. It has
	time. It is inappropriate to use Internet-Drafts as reference	received public review and has been approved for publication by the
	material or to cite them other than as "work in progress."	Internet Engineering Steering Group (IESG). Further information on
		Internet Standards is available in Section 2 of RFC 7841.


	This Internet-Draft will expire on 31 December 2023.	Information about the current status of this document, any errata,
		and how to provide feedback on it may be obtained at
		https://www.rfc-editor.org/info/rfc9485.

	Copyright Notice	Copyright Notice

	Copyright (c) 2023 IETF Trust and the persons identified as the	Copyright (c) 2023 IETF Trust and the persons identified as the
	document authors. All rights reserved.	document authors. All rights reserved.

	This document is subject to BCP 78 and the IETF Trust's Legal	This document is subject to BCP 78 and the IETF Trust's Legal

	Provisions Relating to IETF Documents (https://trustee.ietf.org/	Provisions Relating to IETF Documents
	license-info) in effect on the date of publication of this document.	(https://trustee.ietf.org/license-info) in effect on the date of
	Please review these documents carefully, as they describe your rights	publication of this document. Please review these documents
	and restrictions with respect to this document. Code Components	carefully, as they describe your rights and restrictions with respect
	extracted from this document must include Revised BSD License text as	to this document. Code Components extracted from this document must
	described in Section 4.e of the Trust Legal Provisions and are	include Revised BSD License text as described in Section 4.e of the
	provided without warranty as described in the Revised BSD License.	Trust Legal Provisions and are provided without warranty as described
		in the Revised BSD License.

	Table of Contents	Table of Contents


	1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2	1. Introduction
	1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3	1.1. Terminology
	2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 3	2. Objectives
	3. I-Regexp Syntax . . . . . . . . . . . . . . . . . . . . . . . 4	3. I-Regexp Syntax
	3.1. Checking Implementations . . . . . . . . . . . . . . . . 5	3.1. Checking Implementations
	4. I-Regexp Semantics . . . . . . . . . . . . . . . . . . . . . 5	4. I-Regexp Semantics
	5. Mapping I-Regexp to Regexp Dialects . . . . . . . . . . . . . 5	5. Mapping I-Regexp to Regexp Dialects
	5.1. Multi-Character Escapes . . . . . . . . . . . . . . . . . 6	5.1. Multi-Character Escapes
	5.2. XSD Regexps . . . . . . . . . . . . . . . . . . . . . . . 6	5.2. XSD Regexps
	5.3. ECMAScript Regexps . . . . . . . . . . . . . . . . . . . 6	5.3. ECMAScript Regexps
	5.4. PCRE, RE2, Ruby Regexps . . . . . . . . . . . . . . . . . 7	5.4. PCRE, RE2, and Ruby Regexps
	6. Motivation and Background . . . . . . . . . . . . . . . . . . 7	6. Motivation and Background
	6.1. Implementing I-Regexp . . . . . . . . . . . . . . . . . . 7	6.1. Implementing I-Regexp
	7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8	7. IANA Considerations
	8. Security considerations . . . . . . . . . . . . . . . . . . . 8	8. Security Considerations
	9. References . . . . . . . . . . . . . . . . . . . . . . . . . 9	9. References
	9.1. Normative References . . . . . . . . . . . . . . . . . . 9	9.1. Normative References
	9.2. Informative References . . . . . . . . . . . . . . . . . 10	9.2. Informative References
	Appendix A. Regexps and Similar Constructs in Recent Published	Acknowledgements
	RFCs . . . . . . . . . . . . . . . . . . . . . . . . . . 10	Authors' Addresses
	Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 12
	Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12

	1. Introduction	1. Introduction

	This specification describes an interoperable regular expression	This specification describes an interoperable regular expression

	("regexp") flavor, I-Regexp.	(abbreviated as "regexp") flavor, I-Regexp.

	I-Regexp does not provide advanced regular expression features such	I-Regexp does not provide advanced regular expression features such
	as capture groups, lookahead, or backreferences. It supports only a	as capture groups, lookahead, or backreferences. It supports only a
	Boolean matching capability, i.e., testing whether a given regular	Boolean matching capability, i.e., testing whether a given regular
	expression matches a given piece of text.	expression matches a given piece of text.

	I-Regexp supports the entire repertoire of Unicode characters	I-Regexp supports the entire repertoire of Unicode characters
	(Unicode scalar values); both the I-Regexp strings themselves and the	(Unicode scalar values); both the I-Regexp strings themselves and the
	strings they are matched against are sequences of Unicode scalar	strings they are matched against are sequences of Unicode scalar

	values (often represented in UTF-8 encoding form [STD63] for	values (often represented in UTF-8 encoding form [RFC3629] for
	interchange).	interchange).


	I-Regexp is a subset of XSD regular expressions [XSD-2].	I-Regexp is a subset of XML Schema Definition (XSD) regular
		expressions [XSD-2].

	This document includes guidance for converting I-Regexps for use with	This document includes guidance for converting I-Regexps for use with
	several well-known regular expression idioms.	several well-known regular expression idioms.

	The development of I-Regexp was motivated by the work of the JSONPath	The development of I-Regexp was motivated by the work of the JSONPath

	Working Group. The Working Group wanted to include in its	Working Group (WG). The WG wanted to include support for the use of
	specification [I-D.ietf-jsonpath-base] support for the use of regular	regular expressions in JSONPath filters in its specification
	expressions in JSONPath filters, but was unable to find a useful	[JSONPATH-BASE], but was unable to find a useful specification for
	specification for regular expressions which would be interoperable	regular expressions that would be interoperable across the popular
	across the popular libraries.	libraries.

	1.1. Terminology	1.1. Terminology


	This document uses the abbreviation "regexp" for what are usually	This document uses the abbreviation "regexp" for what is usually
	called regular expressions in programming. "I-Regexp" is used as a	called a "regular expression" in programming. The term "I-Regexp" is
	noun meaning a character string (sequence of Unicode scalar values)	used as a noun meaning a character string (sequence of Unicode scalar
	that conforms to the requirements in this specification; the plural	values) that conforms to the requirements in this specification; the
	is "I-Regexps".	plural is "I-Regexps".


	This specification uses Unicode terminology. A good entry point into	This specification uses Unicode terminology; a good entry point is
	that is provided by [UNICODE-GLOSSARY].	provided by [UNICODE-GLOSSARY].

	The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",	The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
	"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and	"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
	"OPTIONAL" in this document are to be interpreted as described in	"OPTIONAL" in this document are to be interpreted as described in
	BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all	BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
	capitals, as shown here.	capitals, as shown here.

	The grammatical rules in this document are to be interpreted as ABNF,	The grammatical rules in this document are to be interpreted as ABNF,
	as described in [RFC5234] and [RFC7405], where the "characters" of	as described in [RFC5234] and [RFC7405], where the "characters" of
	Section 2.3 of [RFC5234] are Unicode scalar values.	Section 2.3 of [RFC5234] are Unicode scalar values.

	2. Objectives	2. Objectives

	I-Regexps should handle the vast majority of practical cases where a	I-Regexps should handle the vast majority of practical cases where a

	matching regexp is needed in a data model specification or a query	matching regexp is needed in a data-model specification or a query-
	language expression.	language expression.


	The editors of this document conducted a survey of the regexp syntax	At the time of writing, an editor of this document conducted a survey
	used in published RFCs. All examples found there should be covered	of the regexp syntax used in recently published RFCs. All examples
	by I-Regexps, both syntactically and with their intended semantics.	found there should be covered by I-Regexps, both syntactically and
	The exception is the use of multi-character escapes, for which	with their intended semantics. The exception is the use of multi-
	workaround guidance is provided in Section 5.	character escapes, for which workaround guidance is provided in
		Section 5.

	3. I-Regexp Syntax	3. I-Regexp Syntax

	An I-Regexp MUST conform to the ABNF specification in Figure 1.	An I-Regexp MUST conform to the ABNF specification in Figure 1.

	i-regexp = branch *( "\|" branch )	i-regexp = branch *( "\|" branch )
	branch = *piece	branch = *piece
	piece = atom [ quantifier ]	piece = atom [ quantifier ]
	quantifier = ( "*" / "+" / "?" ) / range-quantifier	quantifier = ( "*" / "+" / "?" ) / range-quantifier
	range-quantifier = "{" QuantExact [ "," [ QuantExact ] ] "}"	range-quantifier = "{" QuantExact [ "," [ QuantExact ] ] "}"
	QuantExact = 1*%x30-39 ; '0'-'9'	QuantExact = 1*%x30-39 ; '0'-'9'

	atom = NormalChar / charClass / ( "(" i-regexp ")" )	atom = NormalChar / charClass / ( "(" i-regexp ")" )
	NormalChar = ( %x00-27 / "," / "-" / %x2F-3E ; '/'-'>'	NormalChar = ( %x00-27 / "," / "-" / %x2F-3E ; '/'-'>'
	/ %x40-5A ; '@'-'Z'	/ %x40-5A ; '@'-'Z'
	/ %x5E-7A ; '^'-'z'	/ %x5E-7A ; '^'-'z'

	/ %x7E-10FFFF )	/ %x7E-D7FF ; skip surrogate code points
		/ %xE000-10FFFF )
	charClass = "." / SingleCharEsc / charClassEsc / charClassExpr	charClass = "." / SingleCharEsc / charClassEsc / charClassExpr
	SingleCharEsc = "\" ( %x28-2B ; '('-'+'	SingleCharEsc = "\" ( %x28-2B ; '('-'+'
	/ "-" / "." / "?" / %x5B-5E ; '['-'^'	/ "-" / "." / "?" / %x5B-5E ; '['-'^'
	/ %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'	/ %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'
	)	)
	charClassEsc = catEsc / complEsc	charClassEsc = catEsc / complEsc
	charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"	charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"
	CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc	CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc
	CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z'	CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z'

	/ %x5E-10FFFF ) / SingleCharEsc	/ %x5E-D7FF ; skip surrogate code points
		/ %xE000-10FFFF ) / SingleCharEsc
	catEsc = %s"\p{" charProp "}"	catEsc = %s"\p{" charProp "}"
	complEsc = %s"\P{" charProp "}"	complEsc = %s"\P{" charProp "}"
	charProp = IsCategory	charProp = IsCategory
	IsCategory = Letters / Marks / Numbers / Punctuation / Separators /	IsCategory = Letters / Marks / Numbers / Punctuation / Separators /
	Symbols / Others	Symbols / Others
	Letters = %s"L" [ ( %s"l" / %s"m" / %s"o" / %s"t" / %s"u" ) ]	Letters = %s"L" [ ( %s"l" / %s"m" / %s"o" / %s"t" / %s"u" ) ]
	Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]	Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]
	Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]	Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]
	Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'	Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'
	/ %s"i" / %s"o" / %s"s" ) ]	/ %s"i" / %s"o" / %s"s" ) ]
	Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]	Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]
	Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]	Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]
	Others = %s"C" [ ( %s"c" / %s"f" / %s"n" / %s"o" ) ]	Others = %s"C" [ ( %s"c" / %s"f" / %s"n" / %s"o" ) ]

	Figure 1: I-Regexp Syntax in ABNF	Figure 1: I-Regexp Syntax in ABNF

	As an additional restriction, charClassExpr is not allowed to match	As an additional restriction, charClassExpr is not allowed to match

	[^], which according to this grammar would parse as a positive	[^], which, according to this grammar, would parse as a positive
	character class containing the single character ^.	character class containing the single character ^.


	This is essentially XSD regexp without character class subtraction,	This is essentially an XSD regexp without:
	without multi-character escapes such as \s, \S, and \w, and without
	Unicode blocks.	* character class subtraction,

		* multi-character escapes such as \s, \S, and \w, and

		* Unicode blocks.

	An I-Regexp implementation MUST be a complete implementation of this	An I-Regexp implementation MUST be a complete implementation of this
	limited subset. In particular, full support for the Unicode	limited subset. In particular, full support for the Unicode

	functionality defined in this specification is REQUIRED; the	functionality defined in this specification is REQUIRED. The
	implementation MUST NOT limit itself to 7- or 8-bit character sets	implementation:
	such as ASCII and MUST support the Unicode character property set in
	character classes.	* MUST NOT limit itself to 7- or 8-bit character sets such as ASCII,
		and

		* MUST support the Unicode character property set in character
		classes.

	3.1. Checking Implementations	3.1. Checking Implementations

	A _checking_ I-Regexp implementation is one that checks a supplied	A _checking_ I-Regexp implementation is one that checks a supplied
	regexp for compliance with this specification and reports any	regexp for compliance with this specification and reports any
	problems. Checking implementations give their users confidence that	problems. Checking implementations give their users confidence that

	they didn't accidentally insert non-interoperable syntax, so checking	they didn't accidentally insert syntax that is not interoperable, so
	is RECOMMENDED. Exceptions to this rule may be made for low-effort	checking is RECOMMENDED. Exceptions to this rule may be made for
	implementations that map I-Regexp to another regexp library by simple	low-effort implementations that map I-Regexp to another regexp
	steps such as performing the mapping operations discussed in	library by simple steps such as performing the mapping operations
	Section 5; here, the effort needed to do full checking may dwarf the	discussed in Section 5. Here, the effort needed to do full checking
	rest of the implementation effort. Implementations SHOULD document	might dwarf the rest of the implementation effort. Implementations
	whether they are checking or not.	SHOULD document whether or not they are checking.

	Specifications that employ I-Regexp may want to define in which cases	Specifications that employ I-Regexp may want to define in which cases
	their implementations can work with a non-checking I-Regexp	their implementations can work with a non-checking I-Regexp
	implementation and when full checking is needed, possibly in the	implementation and when full checking is needed, possibly in the
	process of defining their own implementation classes.	process of defining their own implementation classes.

	4. I-Regexp Semantics	4. I-Regexp Semantics


	This syntax is a subset of that of [XSD-2]. Implementations which	This syntax is a subset of that of [XSD-2]. Implementations that
	interpret I-Regexps MUST yield Boolean results as specified in	interpret I-Regexps MUST yield Boolean results as specified in
	[XSD-2]. (See also Section 5.2.)	[XSD-2]. (See also Section 5.2.)

	5. Mapping I-Regexp to Regexp Dialects	5. Mapping I-Regexp to Regexp Dialects


	The material in this section is non-normative, provided as guidance	The material in this section is not normative; it is provided as
	to developers who want to use I-Regexps in the context of other	guidance to developers who want to use I-Regexps in the context of
	regular expression dialects.	other regular expression dialects.

	5.1. Multi-Character Escapes	5.1. Multi-Character Escapes


	Common multi-character escapes (MCEs), and character classes built	I-Regexp does not support common multi-character escapes (MCEs) and
	around them, which are not supported in I-Regexp, can usually be	character classes built around them. These can usually be replaced
	replaced as shown for example in Table 1.	as shown by the examples in Table 1.


	+===========+==============+	+============+===============+
	\| MCE/class \| Replace with \|	\| MCE/class: \| Replace with: \|
	+===========+==============+	+============+===============+
	\| \S \| [^ \t\n\r] \|	\| \S \| [^ \t\n\r] \|
	+-----------+--------------+	+------------+---------------+
	\| [\S ] \| [^\t\n\r] \|	\| [\S ] \| [^\t\n\r] \|
	+-----------+--------------+	+------------+---------------+
	\| \d \| [0-9] \|	\| \d \| [0-9] \|
	+-----------+--------------+	+------------+---------------+

	Table 1: Example	Table 1: Example

	substitutes for multi-	Substitutes for Multi-
	character escapes	Character Escapes

	Note that the semantics of \d in XSD regular expressions is that of	Note that the semantics of \d in XSD regular expressions is that of
	\p{Nd}; however, this would include all Unicode characters that are	\p{Nd}; however, this would include all Unicode characters that are
	digits in various writing systems, which is almost certainly not what	digits in various writing systems, which is almost certainly not what
	is required in IETF publications.	is required in IETF publications.

	The construct \p{IsBasicLatin} is essentially a reference to legacy	The construct \p{IsBasicLatin} is essentially a reference to legacy

	ASCII, it can be replaced by the character class [\u0000-\u007f].	ASCII; it can be replaced by the character class [\u0000-\u007f].

	5.2. XSD Regexps	5.2. XSD Regexps


	Any I-Regexp also is an XSD Regexp [XSD-2], so the mapping is an	Any I-Regexp is also an XSD regexp [XSD-2], so the mapping is an
	identity function.	identity function.


	Note that a few errata for [XSD-2] have been fixed in [XSD11-2],	Note that a few errata for [XSD-2] have been fixed in [XSD-1.1-2];
	which is therefore also included as a normative reference. XSD 1.1	therefore, it is also included in the Normative References
	is less widely implemented than XSD 1.0, and implementations of XSD	(Section 9.1). XSD 1.1 is less widely implemented than XSD 1.0, and
	1.0 are likely to include these bugfixes, so for the intents and	implementations of XSD 1.0 are likely to include these bugfixes; for
	purposes of this specification an implementation of XSD 1.0 regexps	the intents and purposes of this specification, an implementation of
	is equivalent to an implementation of XSD 1.1 regexps.	XSD 1.0 regexps is equivalent to an implementation of XSD 1.1
		regexps.

	5.3. ECMAScript Regexps	5.3. ECMAScript Regexps

	Perform the following steps on an I-Regexp to obtain an ECMAScript	Perform the following steps on an I-Regexp to obtain an ECMAScript
	regexp [ECMA-262]:	regexp [ECMA-262]:

	* For any unescaped dots (.) outside character classes (first	* For any unescaped dots (.) outside character classes (first

	alternative of charClass production): replace dot by [^\n\r].	alternative of charClass production), replace the dot with
		[^\n\r].

	* Envelope the result in ^(?: and )$.	* Envelope the result in ^(?: and )$.

	The ECMAScript regexp is to be interpreted as a Unicode pattern ("u"	The ECMAScript regexp is to be interpreted as a Unicode pattern ("u"
	flag; see Section 21.2.2 "Pattern Semantics" of [ECMA-262]).	flag; see Section 21.2.2 "Pattern Semantics" of [ECMA-262]).

	Note that where a regexp literal is required, the actual regexp needs	Note that where a regexp literal is required, the actual regexp needs
	to be enclosed in /.	to be enclosed in /.


	5.4. PCRE, RE2, Ruby Regexps	5.4. PCRE, RE2, and Ruby Regexps


	Perform the same steps as in Section 5.3 to obtain a valid regexp in	To obtain a valid regexp in Perl Compatible Regular Expressions
	PCRE [PCRE2], the Go programming language [RE2], and the Ruby	(PCRE) [PCRE2], the Go programming language's RE2 regexp library
	programming language, except that the last step is:	[RE2], and the Ruby programming language, perform the same steps as
		in Section 5.3, except that the last step is:

	* Enclose the regexp in \A(?: and )\z.	* Enclose the regexp in \A(?: and )\z.

	6. Motivation and Background	6. Motivation and Background

	While regular expressions originally were intended to describe a	While regular expressions originally were intended to describe a
	formal language to support a Boolean matching function, they have	formal language to support a Boolean matching function, they have
	been enhanced with parsing functions that support the extraction and	been enhanced with parsing functions that support the extraction and
	replacement of arbitrary portions of the matched text. With this	replacement of arbitrary portions of the matched text. With this

	accretion of features, parsing regexp libraries have become more	accretion of features, parsing-regexp libraries have become more
	susceptible to bugs and surprising performance degradations which can	susceptible to bugs and surprising performance degradations that can
	be exploited in Denial of Service attacks by an attacker who controls	be exploited in denial-of-service attacks by an attacker who controls
	the regexp submitted for processing. I-Regexp is designed to offer	the regexp submitted for processing. I-Regexp is designed to offer

	interoperability, and to be less vulnerable to such attacks, with the	interoperability and to be less vulnerable to such attacks, with the
	trade-off that its only function is to offer a boolean response as to	trade-off that its only function is to offer a Boolean response as to
	whether a character sequence is matched by a regexp.	whether a character sequence is matched by a regexp.

	6.1. Implementing I-Regexp	6.1. Implementing I-Regexp

	XSD regexps are relatively easy to implement or map to widely	XSD regexps are relatively easy to implement or map to widely

	implemented parsing regexp dialects, with these notable exceptions:	implemented parsing-regexp dialects, with these notable exceptions:

	* Character class subtraction. This is a very useful feature in	* Character class subtraction. This is a very useful feature in
	many specifications, but it is unfortunately mostly absent from	many specifications, but it is unfortunately mostly absent from

	parsing regexp dialects. Thus, it is omitted from I-Regexp.	parsing-regexp dialects. Thus, it is omitted from I-Regexp.

	* Multi-character escapes. \d, \w, \s and their uppercase	* Multi-character escapes. \d, \w, \s and their uppercase
	complement classes exhibit a large amount of variation between	complement classes exhibit a large amount of variation between
	regexp flavors. Thus, they are omitted from I-Regexp.	regexp flavors. Thus, they are omitted from I-Regexp.


	* Not all regexp implementations support accesses to Unicode tables	* Not all regexp implementations support access to Unicode tables
	that enable executing constructs such as \p{Nd}, although the	that enable executing constructs such as \p{Nd}, although the

	\p/\P feature in general is now quite widely available. While in	\p/\P feature in general is now quite widely available. While, in
	principle it is possible to translate these into character-class	principle, it is possible to translate these into character-class
	matches, this also requires access to those tables. Thus, regexp	matches, this also requires access to those tables. Thus, regexp
	libraries in severely constrained environments may not be able to	libraries in severely constrained environments may not be able to
	support I-Regexp conformance.	support I-Regexp conformance.

	7. IANA Considerations	7. IANA Considerations


	This document makes no requests of IANA.	This document has no IANA actions.


	8. Security considerations	8. Security Considerations


	While technically out of scope of this specification, Section 10	While technically out of the scope of this specification, Section 10
	(Security Considerations) of [STD63] applies to implementations.	("Security Considerations") of [RFC3629] applies to implementations.
	Particular note needs to be taken of the last paragraph of Section 3	Particular note needs to be taken of the last paragraph of Section 3

	(UTF-8 definition) of [STD63]; an I-Regexp implementation may need to	("UTF-8 definition") of [RFC3629]; an I-Regexp implementation may
	mitigate limitations of the platform implementation in this regard.	need to mitigate limitations of the platform implementation in this
		regard.

	As discussed in Section 6, more complex regexp libraries may contain	As discussed in Section 6, more complex regexp libraries may contain

	exploitable bugs leading to crashes and remote code execution. There	exploitable bugs, which can lead to crashes and remote code
	is also the problem that such libraries often have hard-to-predict	execution. There is also the problem that such libraries often have
	performance characteristics, leading to attacks that overload an	performance characteristics that are hard to predict, leading to
	implementation by matching against an expensive attacker-controlled	attacks that overload an implementation by matching against an
	regexp.	expensive attacker-controlled regexp.

	I-Regexps have been designed to allow implementation in a way that is	I-Regexps have been designed to allow implementation in a way that is
	resilient to both threats; this objective needs to be addressed	resilient to both threats; this objective needs to be addressed
	throughout the implementation effort. Non-checking implementations	throughout the implementation effort. Non-checking implementations
	(see Section 3.1) are likely to expose security limitations of any	(see Section 3.1) are likely to expose security limitations of any
	regexp engine they use, which may be less problematic if that engine	regexp engine they use, which may be less problematic if that engine

	has been built with security considerations in mind (e.g., [RE2]); a	has been built with security considerations in mind (e.g., [RE2]).
	checking implementation is still RECOMMENDED.	In any case, a checking implementation is still RECOMMENDED.

	Implementations that specifically implement the I-Regexp subset can,	Implementations that specifically implement the I-Regexp subset can,
	with care, be designed to generally run in linear time and space in	with care, be designed to generally run in linear time and space in

	the input, and to detect when that would not be the case (see below).	the input and to detect when that would not be the case (see below).

	Existing regexp engines should be able to easily handle most	Existing regexp engines should be able to easily handle most
	I-Regexps (after the adjustments discussed in Section 5), but may	I-Regexps (after the adjustments discussed in Section 5), but may
	consume excessive resources for some types of I-Regexps or outright	consume excessive resources for some types of I-Regexps or outright
	reject them because they cannot guarantee efficient execution. (Note	reject them because they cannot guarantee efficient execution. (Note
	that different versions of the same regexp library may be more or	that different versions of the same regexp library may be more or
	less vulnerable to excessive resource consumption for these cases.)	less vulnerable to excessive resource consumption for these cases.)


	Specifically, range quantifiers (as in a{2,4}) provide particular	Specifically, range quantifiers (as in a{2,4}) provide particular
	challenges for both existing and I-Regexp focused implementations.	challenges for both existing and I-Regexp focused implementations.

	These may therefore limit range quantifiers in composability	Implementations may therefore limit range quantifiers in
	(disallowing nested range quantifiers such as (a{2,4}){2,4}) or range	composability (disallowing nested range quantifiers such as
	(disallowing very large ranges such as a{20,200000}), or detect and	(a{2,4}){2,4}) or range (disallowing very large ranges such as
	reject any excessive resource consumption caused by them.	a{20,200000}), or detect and reject any excessive resource
		consumption caused by range quantifiers.

	I-Regexp implementations that are used to evaluate regexps from	I-Regexp implementations that are used to evaluate regexps from

	untrusted sources need to be robust to these cases. Implementers	untrusted sources need to be robust in these cases. Implementers
	using existing regexp libraries are encouraged to check their	using existing regexp libraries are encouraged:
	documentation to see if mitigations are configurable, such as limits
	in resource consumption, and to document their own degree of	* to check their documentation to see if mitigations are
	robustness resulting from employing such mitigations.	configurable, such as limits in resource consumption, and

		* to document their own degree of robustness resulting from
		employing such mitigations.

	9. References	9. References

	9.1. Normative References	9.1. Normative References

	[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate	[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
	Requirement Levels", BCP 14, RFC 2119,	Requirement Levels", BCP 14, RFC 2119,
	DOI 10.17487/RFC2119, March 1997,	DOI 10.17487/RFC2119, March 1997,

	<https://www.rfc-editor.org/rfc/rfc2119>.	<https://www.rfc-editor.org/info/rfc2119>.

	[RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax	[RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
	Specifications: ABNF", STD 68, RFC 5234,	Specifications: ABNF", STD 68, RFC 5234,
	DOI 10.17487/RFC5234, January 2008,	DOI 10.17487/RFC5234, January 2008,

	<https://www.rfc-editor.org/rfc/rfc5234>.	<https://www.rfc-editor.org/info/rfc5234>.

	[RFC7405] Kyzivat, P., "Case-Sensitive String Support in ABNF",	[RFC7405] Kyzivat, P., "Case-Sensitive String Support in ABNF",
	RFC 7405, DOI 10.17487/RFC7405, December 2014,	RFC 7405, DOI 10.17487/RFC7405, December 2014,

	<https://www.rfc-editor.org/rfc/rfc7405>.	<https://www.rfc-editor.org/info/rfc7405>.

	[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC	[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
	2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,	2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,

	May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.	May 2017, <https://www.rfc-editor.org/info/rfc8174>.


	[XSD-2] Malhotra, A., Ed. and P. V. Biron, Ed., "XML Schema Part	[XSD-1.1-2]
	2: Datatypes Second Edition", W3C REC REC-xmlschema-	Peterson, D., Ed., Gao, S., Ed., Malhotra, A., Ed.,
		Sperberg-McQueen, C. M., Ed., Thompson, H., Ed., and P.
		Biron, Ed., "W3C XML Schema Definition Language (XSD) 1.1
		Part 2: Datatypes", W3C REC REC-xmlschema11-2-20120405,
		W3C REC-xmlschema11-2-20120405, 5 April 2012,
		<https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/>.

		[XSD-2] Biron, P., Ed. and A. Malhotra, Ed., "XML Schema Part 2:
		Datatypes Second Edition", W3C REC REC-xmlschema-
	2-20041028, W3C REC-xmlschema-2-20041028, 28 October 2004,	2-20041028, W3C REC-xmlschema-2-20041028, 28 October 2004,
	<https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/>.	<https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/>.


	[XSD11-2] Malhotra, A., Ed., Peterson, D., Ed., Thompson, H., Ed.,
	Sperberg-McQueen, M., Ed., Biron, P. V., Ed., and S. Gao,
	Ed., "W3C XML Schema Definition Language (XSD) 1.1 Part 2:
	Datatypes", W3C REC REC-xmlschema11-2-20120405, W3C REC-
	xmlschema11-2-20120405, 5 April 2012,
	<https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/>.

	9.2. Informative References	9.2. Informative References

	[ECMA-262] Ecma International, "ECMAScript 2020 Language	[ECMA-262] Ecma International, "ECMAScript 2020 Language

	Specification", ECMA Standard ECMA-262, 11th Edition, June	Specification", Standard ECMA-262, 11th Edition, June
	2020, <https://www.ecma-international.org/wp-	2020, <https://www.ecma-international.org/wp-
	content/uploads/ECMA-262.pdf>.	content/uploads/ECMA-262.pdf>.


	[I-D.ietf-jsonpath-base]	[JSONPATH-BASE]
	Gössner, S., Normington, G., and C. Bormann, "JSONPath:	Gössner, S., Ed., Normington, G., Ed., and C. Bormann,
	Query expressions for JSON", Work in Progress, Internet-	Ed., "JSONPath: Query expressions for JSON", Work in
	Draft, draft-ietf-jsonpath-base-14, 10 June 2023,	Progress, Internet-Draft, draft-ietf-jsonpath-base-20, 25
	<https://datatracker.ietf.org/doc/html/draft-ietf-	August 2023, <https://datatracker.ietf.org/doc/html/draft-
	jsonpath-base-14>.	ietf-jsonpath-base-20>.

	[PCRE2] "Perl-compatible Regular Expressions (revised API:	[PCRE2] "Perl-compatible Regular Expressions (revised API:

	PCRE2)", n.d., <http://pcre.org/current/doc/html/>.	PCRE2)", <http://pcre.org/current/doc/html/>.

	[RE2] "RE2 is a fast, safe, thread-friendly alternative to	[RE2] "RE2 is a fast, safe, thread-friendly alternative to
	backtracking regular expression engines like those used in	backtracking regular expression engines like those used in

	PCRE, Perl, and Python. It is a C++ library.", n.d.,	PCRE, Perl, and Python. It is a C++ library.", commit
	<https://github.com/google/re2>.	73031bb, <https://github.com/google/re2>.

		[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
		10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
		2003, <https://www.rfc-editor.org/info/rfc3629>.

	[RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493,	[RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
	DOI 10.17487/RFC7493, March 2015,	DOI 10.17487/RFC7493, March 2015,

	<https://www.rfc-editor.org/rfc/rfc7493>.	<https://www.rfc-editor.org/info/rfc7493>.

	[STD63] Yergeau, F., "UTF-8, a transformation format of ISO
	10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
	2003, <https://www.rfc-editor.org/rfc/rfc3629>.

	[UNICODE-GLOSSARY]	[UNICODE-GLOSSARY]
	Unicode, Inc., "Glossary of Unicode Terms",	Unicode, Inc., "Glossary of Unicode Terms",
	<https://unicode.org/glossary/>.	<https://unicode.org/glossary/>.


	Appendix A. Regexps and Similar Constructs in Recent Published RFCs

	This section is to be removed before publishing as an RFC.

	This appendix contains a number of regular expressions that have been
	extracted from some recently published RFCs based on some ad-hoc
	matching. Multi-line constructions were not included. With the
	exception of some (often surprisingly dubious) usage of multi-
	character escapes and a reference to the IsBasicLatin Unicode block,
	all regular expressions validate against the ABNF in Figure 1.

	rfc6021.txt 459 (([0-1](\.[1-3]?[0-9]))\|(2\.(0\|([1-9]\d*))))
	rfc6021.txt 513 \d(\.\d){1,127}
	rfc6021.txt 529 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
	rfc6021.txt 631 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
	rfc6021.txt 647 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
	rfc6021.txt 933 ((:\|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
	rfc6021.txt 938 (([^:]+:){6}(([^:]+:[^:]+)\|(.\..)))\|
	rfc6021.txt 1026 ((:\|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
	rfc6021.txt 1031 (([^:]+:){6}(([^:]+:[^:]+)\|(.\..)))\|
	rfc6020.txt 6647 [0-9a-fA-F]*
	rfc6095.txt 2544 \S(.*\S)?
	rfc6110.txt 1583 [aeiouy]*
	rfc6110.txt 3222 [A-Z][a-z]*
	rfc6536.txt 1583 \*
	rfc6536.txt 1632 [^\].
	rfc6643.txt 524 \p{IsBasicLatin}{0,255}
	rfc6728.txt 3480 \S+
	rfc6728.txt 3500 \S(.*\S)?
	rfc6991.txt 477 (([0-1](\.[1-3]?[0-9]))\|(2\.(0\|([1-9]\d*))))
	rfc6991.txt 525 \d(\.\d){1,127}
	rfc6991.txt 541 [a-zA-Z_][a-zA-Z0-9\-_.]*
	rfc6991.txt 542 .\|..\|[^xX].\|.[^mM].\|..[^lL].*
	rfc6991.txt 571 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
	rfc6991.txt 665 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
	rfc6991.txt 693 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
	rfc6991.txt 725 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
	rfc6991.txt 743 [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-
	rfc6991.txt 1041 ((:\|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
	rfc6991.txt 1046 (([^:]+:){6}(([^:]+:[^:]+)\|(.\..)))\|
	rfc6991.txt 1099 [0-9\.]*
	rfc6991.txt 1109 [0-9a-fA-F:\.]*
	rfc6991.txt 1164 ((:\|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
	rfc6991.txt 1169 (([^:]+:){6}(([^:]+:[^:]+)\|(.\..)))\|
	rfc7407.txt 933 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){0,254}
	rfc7407.txt 1494 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){4,31}
	rfc7758.txt 703 \d{2}:\d{2}:\d{2}(\.\d+)?
	rfc7758.txt 1358 \d{2}:\d{2}:\d{2}(\.\d+)?
	rfc7895.txt 349 \d{4}-\d{2}-\d{2}
	rfc7950.txt 8323 [0-9a-fA-F]*
	rfc7950.txt 8355 [a-zA-Z_][a-zA-Z0-9\-_.]*
	rfc7950.txt 8356 [xX][mM][lL].*
	rfc8040.txt 4713 \d{4}-\d{2}-\d{2}
	rfc8049.txt 6704 [A-Z]{2}
	rfc8194.txt 629 \*
	rfc8194.txt 637 [0-9]{8}\.[0-9]{6}
	rfc8194.txt 905 Z\|[\+\-]\d{2}:\d{2}
	rfc8194.txt 963 (2((2[4-9])\|(3[0-9]))\.).*
	rfc8194.txt 974 (([fF]{2}[0-9a-fA-F]{2}):).*
	rfc8299.txt 7986 [A-Z]{2}
	rfc8341.txt 1878 \*
	rfc8341.txt 1927 [^\].
	rfc8407.txt 1723 [0-9\.]*
	rfc8407.txt 1749 [a-zA-Z_][a-zA-Z0-9\-_.]*
	rfc8407.txt 1750 .\|..\|[^xX].\|.[^mM].\|..[^lL].*
	rfc8525.txt 550 \d{4}-\d{2}-\d{2}
	rfc8776.txt 838 /?([a-zA-Z0-9\-_.]+)(/[a-zA-Z0-9\-_.]+)*
	rfc8776.txt 874 ([a-zA-Z0-9\-_.]+:)*
	rfc8819.txt 311 [\S ]+
	rfc8944.txt 596 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){7}

	Figure 2: Example regular expressions extracted from RFCs

	Acknowledgements	Acknowledgements


	This specification has been motivated by the discussion in the IETF	Discussion in the IETF JSONPATH WG about whether to include a regexp
	JSONPATH WG about whether to include a regexp mechanism into the	mechanism into the JSONPath query expression specification and
	JSONPath query expression specification, as well as by previous	previous discussions about the YANG pattern and Concise Data
	discussions about the YANG pattern and CDDL .regexp features.	Definition Language (CDDL) .regexp features motivated this
		specification.


	The basic approach for this specification was inspired by The I-JSON	The basic approach for this specification was inspired by "The I-JSON
	Message Format [RFC7493].	Message Format" [RFC7493].

	Authors' Addresses	Authors' Addresses

	Carsten Bormann	Carsten Bormann
	Universität Bremen TZI	Universität Bremen TZI
	Postfach 330440	Postfach 330440
	D-28359 Bremen	D-28359 Bremen
	Germany	Germany
	Phone: +49-421-218-63921	Phone: +49-421-218-63921
	Email: cabo@tzi.org	Email: cabo@tzi.org

End of changes. 63 change blocks.
	275 lines changed or deleted	205 lines changed or added
This html diff was produced by rfcdiff 1.48.