Network Working Group Y.-K. Wang Internet Draft Qualcomm Intended status: Standards track Y. Sanchez Expires: March 2014 T. Schierl Fraunhofer HHI S. Wenger Vidyo M. M. Hannuksela Nokia September 6, 2013 RTP Payload Format for High Efficiency Video Coding draft-ietf-payload-rtp-h265-01.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 11, 2013. Wang, et al Expires March 6, 2014 [Page 1] Internet-Draft RTP Payload Format for HEVC September 6, 2013 Copyright and License Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Wang, et al Expires March 6, 2014 [Page 2] Internet-Draft RTP Payload Format for HEVC September 6, 2013 Abstract This memo describes an RTP payload format for the video coding standard ITU-T Recommendation H.265 and ISO/IEC International Standard 23008-2, both also known as High Efficiency Video Coding (HEVC) [HEVC], developed by the Joint Collaborative Team on Video Coding (JCT-VC). The RTP payload format allows for packetization of one or more Network Abstraction Layer (NAL) units in each RTP packet payload, as well as fragmentation of a NAL unit into multiple RTP packets. Furthermore, it supports transmission of an HEVC stream over a single as well as multiple RTP flows. The payload format has wide applicability in videoconferencing, Internet video streaming, and high bit-rate entertainment-quality video, among others. Table of Contents Status of this Memo...............................................1 Abstract..........................................................3 Table of Contents.................................................3 1 . Introduction..................................................5 1.1 . Overview of the HEVC Codec...............................5 1.1.1 Coding-Tool Features..................................5 1.1.2 Systems and Transport Interfaces......................7 1.1.3 Parallel Processing Support..........................13 1.1.4 NAL Unit Header......................................15 1.2 . Overview of the Payload Format..........................17 2 . Conventions..................................................17 3 . Definitions and Abbreviations................................17 3.1 Definitions...............................................17 3.1.1 Definitions from the HEVC Specification..............18 3.1.2 Definitions Specific to This Memo....................19 3.2 Abbreviations.............................................20 4 . RTP Payload Format...........................................22 4.1 RTP Header Usage..........................................22 4.2 Payload Structures........................................23 4.3 Transmission Modes........................................24 4.4 Decoding Order Number.....................................25 4.5 Single NAL Unit Packets...................................27 4.6 Aggregation Packets (APs).................................27 Wang, et al Expires March 6, 2014 [Page 3] Internet-Draft RTP Payload Format for HEVC September 6, 2013 4.7 Fragmentation Units (FUs).................................32 5 . Packetization Rules..........................................36 6 . De-packetization Process.....................................37 7 . Payload Format Parameters....................................38 7.1 Media Type Registration...................................39 7.2 SDP Parameters............................................52 7.2.1 Mapping of Payload Type Parameters to SDP............53 7.2.2 Usage with SDP Offer/Answer Model....................54 7.2.3 Usage in Declarative Session Descriptions............58 7.2.4 Dependency Signaling in Multi-Session Transmission...60 8 . Use with Feedback Messages...................................60 8.1 Definition of the SPLI Feedback Message...................62 8.2 Use of HEVC with the RPSI Feedback Message................63 8.3 Use of HEVC with the SPLI Feedback Message................63 9 . Security Considerations......................................63 10 . Congestion Control..........................................65 11 . IANA Consideration..........................................66 12 . Acknowledgements............................................66 13 . References..................................................66 13.1 Normative References.....................................66 13.2 Informative References...................................67 14 . Authors' Addresses..........................................68 Wang, et al Expires March 6, 2014 [Page 4] Internet-Draft RTP Payload Format for HEVC September 6, 2013 1. Introduction 1.1. Overview of the HEVC Codec High Efficiency Video Coding [HEVC], formally known as ITU-T Recommendation H.265 and ISO/IEC International Standard 23008-2 was ratified by ITU-T in April 2013 and reportedly provides significant coding efficiency gains over H.264 [H.264]. As both H.264 [H.264] and its RTP payload format [RFC6184] are widely deployed and generally known in the relevant implementer communities, frequently only the differences between those two specifications are highlighted in non-normative, explanatory parts of this memo. Basic familiarity with both specifications is assumed for those parts. However, the normative parts of this memo do not require study of H.264 or its RTP payload format. H.264 and HEVC share a similar hybrid video codec design. Conceptually, both technologies include a video coding layer (VCL), which is often used to refer to the coding-tool features, and a network abstraction layer (NAL), which is often used to refer to the systems and transport interface aspects of the codecs. 1.1.1 Coding-Tool Features Similarly to earlier hybrid-video-coding-based standards, including H.264, the following basic video coding design is employed by HEVC. A prediction signal is first formed either by intra or motion compensated prediction, and the residual (the difference between the original and the prediction) is then coded. The gains in coding efficiency are achieved by redesigning and improving almost all parts of the codec over earlier designs. In addition, HEVC includes several tools to make the implementation on parallel architectures easier. Below is a summary of HEVC coding-tool features. Quad-tree block and transform structure One of the major tools that contribute significantly to the coding efficiency of HEVC is the usage of flexible coding blocks and transforms, which are defined in a hierarchical quad-tree manner. Unlike H.264, where the basic coding block is a macroblock of fixed Wang, et al Expires March 6, 2014 [Page 5] Internet-Draft RTP Payload Format for HEVC September 6, 2013 size 16x16, HEVC defines a Coding Tree Unit (CTU) of a maximum size of 64x64. Each CTU can be divided into smaller units in a hierarchical quad-tree manner and can represent smaller blocks down to size 4x4. Similarly, the transforms used in HEVC can have different sizes, starting from 4x4 and going up to 32x32. Utilizing large blocks and transforms contribute to the major gain of HEVC, especially at high resolutions. Entropy coding HEVC uses a single entropy coding engine, which is based on Context Adaptive Binary Arithmetic Coding (CABAC), whereas H.264 uses two distinct entropy coding engines. CABAC in HEVC shares many similarities with CABAC of H.264, but contains several improvements. Those include improvements in coding efficiency and lowered implementation complexity, especially for parallel architectures. In-loop filtering H.264 includes an in-loop adaptive deblocking filter, where the blocking artifacts around the transform edges in the reconstructed picture are smoothed to improve the picture quality and compression efficiency. In HEVC, a similar deblocking filter is employed but with somewhat lower complexity. In addition, pictures undergo a subsequent filtering operation called Sample Adaptive Offset (SAO), which is a new design element in HEVC. SAO basically adds a pixel- level offset in an adaptive manner and usually acts as a de-ringing filter. It is observed that SAO improves the picture quality, especially around sharp edges contributing substantially to visual quality improvements of HEVC. Motion prediction and coding There have been a number of improvements in this area that are summarized as follows. The first category is motion merge and advanced motion vector prediction (AMVP) modes. The motion information of a prediction block can be inferred from the spatially or temporally neighboring blocks. This is similar to the DIRECT mode in H.264 but includes new aspects to incorporate the flexible quad-tree structure and methods to improve the parallel implementations. In addition, the motion vector predictor can be Wang, et al Expires March 6, 2014 [Page 6] Internet-Draft RTP Payload Format for HEVC September 6, 2013 signaled for improved efficiency. The second category is high- precision interpolation. The interpolation filter length is increased to 8-tap from 6-tap, which improves the coding efficiency but also comes with increased complexity. In addition, the interpolation filter is defined with higher precision without any intermediate rounding operations to further improve the coding efficiency. Intra prediction and intra coding Compared to 8 intra prediction modes in H.264, HEVC supports angular intra prediction with 33 directions. This increased flexibility improves both objective coding efficiency and visual quality as the edges can be better predicted and ringing artifacts around the edges can be reduced. In addition, the reference samples are adaptively smoothed based on the prediction direction. To avoid contouring artifacts a new interpolative prediction generation is included to improve the visual quality. Furthermore, discrete sine transform (DST) is utilized instead of traditional discrete cosine transform (DCT) for 4x4 intra transform blocks. Other coding-tool features HEVC includes some tools for lossless coding and efficient screen content coding, such as skipping the transform for certain blocks. These tools are particularly useful for example when streaming the user-interface of a mobile device to a large display. 1.1.2 Systems and Transport Interfaces HEVC inherited the basic systems and transport interfaces designs, such as the NAL-unit-based syntax structure, the hierarchical syntax and data unit structure from sequence-level parameter sets, multi- picture-level or picture-level parameter sets, slice-level header parameters, lower-level parameters, the supplemental enhancement information (SEI) message mechanism, the hypothetical reference decoder (HRD) based video buffering model, and so on. In the following, a list of differences in these aspects compared to H.264 is summarized. Wang, et al Expires March 6, 2014 [Page 7] Internet-Draft RTP Payload Format for HEVC September 6, 2013 Video parameter set A new type of parameter set, called video parameter set (VPS), was introduced. For the first (2013) version of [HEVC], the video parameter set NAL unit is required to be available prior to its activation, while the information contained in the video parameter set is not necessary for operation of the decoding process. For future HEVC extensions, such as the 3D or scalable extensions, the video parameter set is expected to include information necessary for operation of the decoding process, e.g. decoding dependency or information for reference picture set construction of enhancement layers. The VPS provides a "big picture" of a bitstream, including what types of operation points are provided, the profile, tier, and level of the operation points, and some other high-level properties of the bitstream that can be used as the basis for session negotiation and content selection, etc. (see section 7.1). Profile, tier and level The profile, tier and level syntax structure that can be included in both VPS and sequence parameter set (SPS) includes 12 bytes data to describe the entire bitstream (including all temporally scalable layers, which are referred to as sub-layers in the HEVC specification), and can optionally include more profile, tier and level information pertaining to individual temporally scalable layers. The profile indicator indicates the "best viewed as" profile when the bitstream conforms to multiple profiles, similar to the major brand concept in the ISO base media file format (ISOBMFF) [ISOBMFF] and file formats derived based on ISOBMFF, such as the 3GPP file format [3GP]. The profile, tier and level syntax structure also includes the indications of whether the bitstream is free of frame-packed content, whether the bitstream is free of interlaced source content and free of field pictures, i.e., contains only frame pictures of progressive source, such that clients/players with no support of post-processing functionalities for handling of frame-packed or interlaced source content or field pictures can reject those bitstreams. Wang, et al Expires March 6, 2014 [Page 8] Internet-Draft RTP Payload Format for HEVC September 6, 2013 Bitstream and elementary stream HEVC includes a definition of an elementary stream, which is new compared to H.264. An elementary stream consists of a sequence of one or more bitstreams. An elementary stream that consists of two or more bitstreams has typically been formed by splicing together two or more bitstreams (or parts thereof). When an elementary stream contains more than one bitstream, the last NAL unit of the last access unit of a bitstream (except the last bitstream in the elementary stream) must contain an end of bitstream NAL unit and the first access unit of the subsequent bitstream must be an intra random access point (IRAP) access unit. This IRAP access unit may be a clean random access (CRA), broken link access (BLA), or instantaneous decoding refresh (IDR) access unit. Random access support HEVC includes signaling in NAL unit header, through NAL unit types, of IRAP pictures beyond IDR pictures. Three types of IRAP pictures, namely IDR, CRA and BLA pictures are supported, wherein IDR pictures are conventionally referred to as closed group-of-pictures (closed- GOP) random access points, and CRA and BLA pictures are those conventionally referred to as open-GOP random access points. BLA pictures usually originate from splicing of two bitstreams or part thereof at a CRA picture, e.g. during stream switching. To enable better systems usage of IRAP pictures, altogether six different NAL units are defined to signal the properties of the IRAP pictures, which can be used to better match the stream access point (SAP) types as defined in the ISOBMFF [ISOBMFF], which are utilized for random access support in both 3GP-DASH [3GPDASH] and MPEG DASH [MPEGDASH]. Pictures following an IRAP picture in decoding order and preceding the IRAP picture in output order are referred to as leading pictures associated with the IRAP picture. There are two types of leading pictures, namely random access decodable leading (RADL) pictures and random access skipped leading (RASL) pictures. RADL pictures are decodable when the decoding started at the associated IRAP picture, and RASL pictures are not decodable when the decoding started at the associated IRAP picture and are usually discarded. HEVC provides mechanisms to enable the specification of conformance of bitstreams with RASL pictures being discarded, thus Wang, et al Expires March 6, 2014 [Page 9] Internet-Draft RTP Payload Format for HEVC September 6, 2013 to provide a standard-compliant way to enable systems components to discard RASL pictures when needed. Temporal scalability support HEVC includes an improved support of temporal scalability, by inclusion of the signaling of TemporalId in the NAL unit header, the restriction that pictures of a particular temporal sub-layer cannot be used for inter prediction reference by pictures of a higher temporal sub-layer, the sub-bitstream extraction process, and the requirement that each sub-bitstream extraction output be a conforming bitstream. Media-aware network elements (MANEs) can utilize the TemporalId in the NAL unit header for stream adaptation purposes based on temporal scalability. Temporal sub-layer switching support HEVC specifies, through NAL unit types present in the NAL unit header, the signaling of temporal sub-layer access (TSA) and stepwise temporal sub-layer access (STSA). A TSA picture and pictures following the TSA picture in decoding order do not use pictures prior to the TSA picture in decoding order with TemporalId greater than or equal to that of the TSA picture for inter prediction reference. A TSA picture enables up-switching, at the TSA picture, to the sub-layer containing the TSA picture or any higher sub-layer, from the immediately lower sub-layer. An STSA picture does not use pictures with the same TemporalId as the STSA picture for inter prediction reference. Pictures following an STSA picture in decoding order with the same TemporalId as the STSA picture do not use pictures prior to the STSA picture in decoding order with the same TemporalId as the STSA picture for inter prediction reference. An STSA picture enables up-switching, at the STSA picture, to the sub-layer containing the STSA picture, from the immediately lower sub-layer. Sub-layer reference or non-reference pictures The concept and signaling of reference/non-reference pictures in HEVC are different from H.264. In H.264, if a picture may be used by any other picture for inter prediction reference, it is a reference picture; otherwise it is a non-reference picture, and this Wang, et al Expires March 6, 2014 [Page 10] Internet-Draft RTP Payload Format for HEVC September 6, 2013 is signaled by two bits in the NAL unit header. In HEVC, a picture is called a reference picture only when it is marked as "used for reference". In addition, the concept of sub-layer reference picture was introduced. If a picture may be used by another other picture with the same TemporalId for inter prediction reference, it is a sub-layer reference picture; otherwise it is a sub-layer non- reference picture. Whether a picture is a sub-layer reference picture or sub-layer non-reference picture is signaled through NAL unit type values. Extensibility Besides the TemporalId in the NAL unit header, HEVC also includes the signaling of a six-bit layer ID in the NAL unit header, which must be equal to 0 for a single-layer bitstream. Extension mechanisms have been included in VPS, SPS, PPS, SEI NAL unit, slice headers, and so on. All these extension mechanisms enable future extensions in a backward compatible manner, such that bitstreams encoded according to potential future HEVC extensions can be fed to then-legacy decoders (e.g. HEVC version 1 decoders) and the then- legacy decoders can decode and output the base layer bitstream. Bitstream extraction HEVC includes a bitstream extraction process as an integral part of the overall decoding process, as well as specification of the use of the bitstream extraction process in description of bitstream conformance tests as part of the hypothetical reference decoder (HRD) specification. Reference picture management The reference picture management of HEVC, including reference picture marking and removal from the decoded picture buffer (DPB) as well as reference picture list construction (RPLC), differs from that of H.264. Instead of the sliding window plus adaptive memory management control operation (MMCO) based reference picture marking mechanism in H.264, HEVC specifies a reference picture set (RPS) based reference picture management and marking mechanism, and the RPLC is consequently based on the RPS mechanism. A reference picture set consists of a set of reference pictures associated with Wang, et al Expires March 6, 2014 [Page 11] Internet-Draft RTP Payload Format for HEVC September 6, 2013 a picture, consisting of all reference pictures that are prior to the associated picture in decoding order, that may be used for inter prediction of the associated picture or any picture following the associated picture in decoding order. The reference picture set consists of five lists of reference pictures; RefPicSetStCurrBefore, RefPicSetStCurrAfter, RefPicSetStFoll, RefPicSetLtCurr and RefPicSetLtFoll. RefPicSetStCurrBefore, RefPicSetStCurrAfter and RefPicSetLtCurr contain all reference pictures that may be used in inter prediction of the current picture and that may be used in inter prediction of one or more of the pictures following the current picture in decoding order. RefPicSetStFoll and RefPicSetLtFoll consist of all reference pictures that are not used in inter prediction of the current picture but may be used in inter prediction of one or more of the pictures following the current picture in decoding order. RPS provides an "intra-coded" signaling of the DPB status, instead of an "inter-coded" signaling, mainly for improved error resilience. The RPLC process in HEVC is based on the RPS, by signaling an index to an RPS subset for each reference index. The RPLC process has been simplified compared to that in H.264, by removal of the reference picture list modification (also referred to as reference picture list reordering) process. Ultra low delay support HEVC specifies a sub-picture-level HRD operation, for support of the so-called ultra-low delay. The mechanism specifies a standard- compliant way to enable delay reduction below one picture interval. Sub-picture-level coded picture buffer (CPB) and DPB parameters may be signaled, and utilization of these information for the derivation of CPB timing (wherein the CPB removal time corresponds to decoding time) and DPB output timing (display time) is specified. Decoders are allowed to operate the HRD at the conventional access-unit- level, even when the sub-picture-level HRD parameters are present. New SEI messages HEVC inherits many H.264 SEI messages with changes in syntax and/or semantics making them applicable to HEVC. Additionally, there are a few new SEI messages reviewed briefly in the following paragraphs. Wang, et al Expires March 6, 2014 [Page 12] Internet-Draft RTP Payload Format for HEVC September 6, 2013 The structure of pictures SEI message provides information on the NAL unit types, picture order count values, and prediction dependencies of a sequence of pictures. The SEI message can be used for example for concluding what impact a lost picture has on other pictures. The decoded picture hash SEI message provides a checksum derived from the sample values of a decoded picture. It can be used for detecting whether a picture was correctly received and decoded. The active parameter sets SEI message includes the IDs of the active video parameter set and the active sequence parameter set and can be used to activate VPSs and SPSs. In addition, the SEI message includes the following indications: 1) An indication of whether "full random accessibility" is supported (when supported, all parameter sets needed for decoding of the remaining of the bitstream when random accessing from the beginning of the current coded video sequence by completely discarding all access units earlier in decoding order are present in the remaining bitstream and all coded pictures in the remaining bitstream can be correctly decoded); 2) An indication of whether there is no parameter set within the current coded video sequence that updates another parameter set of the same type preceding in decoding order. An update of a parameter set refers to the use of the same parameter set ID but with some other parameters changed. If this property is true for all coded video sequences in the bitstream, then all parameter sets can be sent out- of-band before session start. The decoding unit information SEI message provides coded picture buffer removal delay information for a decoding unit. The message can be used in very-low-delay buffering operations. The region refresh information SEI message can be used together with the recovery point SEI message (present in both H.264 and HEVC) for improved support of gradual decoding refresh (GDR). This supports random access from inter-coded pictures, wherein complete pictures can be correctly decoded or recovered after an indicated number of pictures in output/display order. Wang, et al Expires March 6, 2014 [Page 13] Internet-Draft RTP Payload Format for HEVC September 6, 2013 1.1.3 Parallel Processing Support The reportedly significantly higher encoding computational demand of HEVC over H.264, in conjunction with the ever increasing video resolution (both spatially and temporally) required by the market, led to the adoption of VCL coding tools specifically targeted to allow for parallelization on the sub-picture level. That is, parallelization occurs, at the minimum, at the granularity of an integer number of CTUs. The targets for this type of high-level parallelization are multicore CPUs and DSPs as well as multiprocessor systems. In a system design, to be useful, these tools require signaling support, which is provided in Section 7 of this memo. This section provides a brief overview of the tools available in [HEVC]. Many of the tools incorporated in HEVC were designed keeping in mind the potential parallel implementations in multi-core/multi-processor architectures. Specifically, for parallelization, four picture partition strategies are available. Slices are segments of the bitstream that can be reconstructed independently from other slices within the same picture (though there may still be interdependencies through loop filtering operations). Slices are the only tool that can be used for parallelization that is also available, in virtually identical form, in H.264. Slices based parallelization does not require much inter- processor or inter-core communication (except for inter-processor or inter-core data sharing for motion compensation when decoding a predictively coded picture, which is typically much heavier than inter-processor or inter-core data sharing due to in-picture prediction), as slices are designed to be independently decodable. However, for the same reason, slices can require some coding overhead. Further, slices (in contrast to some of the other tools mentioned below) also serve as the key mechanism for bitstream partitioning to match Maximum Transfer Unit (MTU) size requirements, due to the in-picture independence of slices and the fact that each regular slice is encapsulated in its own NAL unit. In many cases, the goal of parallelization and the goal of MTU size matching can place contradicting demands to the slice layout in a picture. The realization of this situation led to the development of the more advanced tools mentioned below. This payload format does not Wang, et al Expires March 6, 2014 [Page 14] Internet-Draft RTP Payload Format for HEVC September 6, 2013 contain any specific mechanisms aiding parallelization through slices. Dependent slice segments allow for fragmentation of a coded slice into fragments at CTU boundaries without breaking any in-picture prediction mechanism. They are complementary to the fragmentation mechanism described in this memo in that they need the cooperation of the encoder. As a dependent slice segment necessarily contains an integer number of CTUs, a decoder using multiple cores operating on CTUs can process a dependent slice segment without communicating parts of the slice segment's bitstream to other cores. Fragmentation, as specified in this memo, in contrast, does not guarantee that a fragment contains an integer number of CTUs. In wavefront parallel processing (WPP), the picture is partitioned into rows of CTUs. Entropy decoding and prediction are allowed to use data from CTUs in other partitions. Parallel processing is possible through parallel decoding of CTU rows, where the start of the decoding of a row is delayed by two CTUs, so to ensure that data related to a CTU above and to the right of the subject CTU is available before the subject CTU is being decoded. Using this staggered start (which appears like a wavefront when represented graphically), parallelization is possible with up to as many processors/cores as the picture contains CTU rows. Because in-picture prediction between neighboring CTU rows within a picture is allowed, the required inter-processor/inter-core communication to enable in-picture prediction can be substantial. The WPP partitioning does not result in the creation of more NAL units compared to when it is not applied, thus WPP cannot be used for MTU size matching, though slices can be used in combination for that purpose. Tiles define horizontal and vertical boundaries that partition a picture into tile columns and rows. The scan order of CTUs is changed to be local within a tile (in the order of a CTU raster scan of a tile), before decoding the top-left CTU of the next tile in the order of tile raster scan of a picture. Similar to slices, tiles break in-picture prediction dependencies (including entropy decoding dependencies). However, they do not need to be included into individual NAL units (same as WPP in this regard), hence tiles Wang, et al Expires March 6, 2014 [Page 15] Internet-Draft RTP Payload Format for HEVC September 6, 2013 cannot be used for MTU size matching, though slices can be used in combination for that purpose. Each tile can be processed by one processor/core, and the inter-processor/inter-core communication required for in-picture prediction between processing units decoding neighboring tiles is limited to conveying the shared slice header in cases a slice is spanning more than one tile, and loop filtering related sharing of reconstructed samples and metadata. Insofar, tiles are less demanding in terms of inter-processor communication bandwidth compared to WPP due to the in-picture independence between two neighboring partitions. 1.1.4 NAL Unit Header HEVC maintains the NAL unit concept of H.264 with modifications. HEVC uses a two-byte NAL unit header, as shown in Figure 1. The payload of a NAL unit refers to the NAL unit excluding the NAL unit header. +---------------+---------------+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| Type | LayerId | TID | +-------------+-----------------+ Figure 1 The structure of HEVC NAL unit header The semantics of the fields in the NAL unit header are as specified in [HEVC] and described briefly below for convenience. In addition to the name and size of each field, the corresponding syntax element name in [HEVC] is also provided. F: 1 bit forbidden_zero_bit. MUST be zero. HEVC declares a value of 1 as a syntax violation. Note that the inclusion of this bit in the NAL unit header is to enable transport of HEVC video over MPEG-2 transport systems (avoidance of start code emulations) [MPEG2S]. Type: 6 bits nal_unit_type. This field specifies the NAL unit type as defined in Table 7-1 of [HEVC]. If the most significant bit of this Wang, et al Expires March 6, 2014 [Page 16] Internet-Draft RTP Payload Format for HEVC September 6, 2013 field of a NAL unit is equal to 0 (i.e. the value of this field is less than 32), the NAL unit is a VCL NAL unit. Otherwise, the NAL unit is a non-VCL NAL unit. For a reference of all currently defined NAL unit types and their semantics, please refer to Section 7.4.1 in [HEVC]. LayerId: 6 bits nuh_layer_id. MUST be equal to zero. It is anticipated that in future scalable or 3D video coding extensions of this specification, this syntax element will be used to identify additional layers that may be present in the coded video sequence, wherein a layer may be, e.g. a spatial scalable layer, a quality scalable layer, a texture view, or a depth view. TID: 3 bits nuh_temporal_id_plus1. This field specifies the temporal identifier of the NAL unit plus 1. The value of TemporalId is equal to TID minus 1. A TID value of 0 is illegal to ensure that there is at least one bit in the NAL unit header equal to 1, so to enable independent considerations of start code emulations in the NAL unit header and in the NAL unit payload data. 1.2. Overview of the Payload Format This payload format defines the following processes required for transport of HEVC coded data over RTP [RFC3550]: o Usage of RTP header with this payload format o Packetization of HEVC coded NAL units into RTP packets using three types of payload structures, namely single NAL unit packet, aggregation packet, and fragment unit o Transmission of HEVC NAL units of the same bitstream within a single RTP session or multiple RTP sessions o Media type parameters to be used with the Session Description Protocol (SDP) [RFC4566] Wang, et al Expires March 6, 2014 [Page 17] Internet-Draft RTP Payload Format for HEVC September 6, 2013 2. Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119 [RFC2119]. This specification uses the notion of setting and clearing a bit when bit fields are handled. Setting a bit is the same as assigning that bit the value of 1 (On). Clearing a bit is the same as assigning that bit the value of 0 (Off). 3. Definitions and Abbreviations 3.1 Definitions This document uses the terms and definitions of [HEVC]. Section 3.1.1 lists relevant definitions copied from [HEVC] for convenience. Section 3.1.2 gives definitions specific to this memo. 3.1.1 Definitions from the HEVC Specification access unit: A set of NAL units that are associated with each other according to a specified classification rule, are consecutive in decoding order, and contain exactly one coded picture. BLA access unit: An access unit in which the coded picture is a BLA picture. BLA picture: An IRAP picture for which each VCL NAL unit has nal_unit_type equal to BLA_W_LP, BLA_W_RADL, or BLA_N_LP. coded video sequence: A sequence of access units that consists, in decoding order, of an IRAP access unit with NoRaslOutputFlag equal to 1, followed by zero or more access units that are not IRAP access units with NoRaslOutputFlag equal to 1, including all subsequent access units up to but not including any subsequent access unit that is an IRAP access unit with NoRaslOutputFlag equal to 1. Informative note: An IRAP access unit may be an IDR access unit, a BLA access unit, or a CRA access unit. The value of NoRaslOutputFlag is equal to 1 for each IDR access unit, each BLA Wang, et al Expires March 6, 2014 [Page 18] Internet-Draft RTP Payload Format for HEVC September 6, 2013 access unit, and each CRA access unit that is the first access unit in the bitstream in decoding order, is the first access unit that follows an end of sequence NAL unit in decoding order, or has HandleCraAsBlaFlag equal to 1. CRA access unit: An access unit in which the coded picture is a CRA picture. CRA picture: A RAP picture for which each VCL NAL unit has nal_unit_type equal to CRA_NUT. IDR access unit: An access unit in which the coded picture is an IDR picture. IDR picture: A RAP picture for which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or IDR_N_LP. IRAP access unit: An access unit in which the coded picture is an IRAP picture. IRAP picture: A coded picture for which each VCL NAL unit has nal_unit_type in the range of BLA_W_LP to RSV_IRAP_VCL23, inclusive. layer: A set of VCL NAL units that all have a particular value of nuh_layer_id and the associated non-VCL NAL units, or one of a set of syntactical structures having a hierarchical relationship. operation point: bitstream created from another bitstream by operation of the sub-bitstream extraction process with the another bitstream, a target highest TemporalId, and a target layer identifier list as inputs. random access: The act of starting the decoding process for a bitstream at a point other than the beginning of the stream. sub-layer: A temporal scalable layer of a temporal scalable bitstream consisting of VCL NAL units with a particular value of the TemporalId variable, and the associated non-VCL NAL units. tile: A rectangular region of coding tree blocks within a particular tile column and a particular tile row in a picture. Wang, et al Expires March 6, 2014 [Page 19] Internet-Draft RTP Payload Format for HEVC September 6, 2013 tile column: A rectangular region of coding tree blocks having a height equal to the height of the picture and a width specified by syntax elements in the picture parameter set. tile row: A rectangular region of coding tree blocks having a height specified by syntax elements in the picture parameter set and a width equal to the width of the picture. 3.1.2 Definitions Specific to This Memo media aware network element (MANE): A network element, such as a middlebox or application layer gateway that is capable of parsing certain aspects of the RTP payload headers or the RTP payload and reacting to their contents. Informative note: The concept of a MANE goes beyond normal routers or gateways in that a MANE has to be aware of the signaling (e.g., to learn about the payload type mappings of the media streams), and in that it has to be trusted when working with SRTP. The advantage of using MANEs is that they allow packets to be dropped according to the needs of the media coding. For example, if a MANE has to drop packets due to congestion on a certain link, it can identify and remove those packets whose elimination produces the least adverse effect on the user experience. After dropping packets, MANEs must rewrite RTCP packets to match the changes to the RTP packet stream as specified in Section 7 of [RFC3550]. NAL unit decoding order: A NAL unit order that conforms to the constraints on NAL unit order given in Section 7.4.2.4 in [HEVC]. NALU-time: The value that the RTP timestamp would have if the NAL unit would be transported in its own RTP packet. RTP packet stream: A sequence of RTP packets with increasing sequence numbers (except for wrap-around), identical PT and identical SSRC (Synchronization Source), carried in one RTP session. Within the scope of this memo, one RTP packet stream is utilized to transport one or more temporal sub-layers. Wang, et al Expires March 6, 2014 [Page 20] Internet-Draft RTP Payload Format for HEVC September 6, 2013 transmission order: The order of packets in ascending RTP sequence number order (in modulo arithmetic). Within an aggregation packet, the NAL unit transmission order is the same as the order of appearance of NAL units in the packet. base session: an RTP session in Multi-Session Transmission mode that transports a bitstream subset which the rest of RTP sessions in the Multi-Session Transmission depends on. [Ed. (YK): Check the need of this definition after the draft is more complete.] 3.2 Abbreviations AP Aggregation Packet BLA Broken Link Access CRA Clean Random Access CTB Coding Tree Block CTU Coding Tree Unit CVS Coded Video Sequence FU Fragmentation Unit GDR Gradual Decoding Refresh HRD Hypothetical Reference Decoder IDR Instantaneous Decoding Refresh IRAP Intra Random Access Point MANE Media Aware Network Element MST Multi-Session Transmission MTU Maximum Transfer Unit NAL Network Abstraction Layer Wang, et al Expires March 6, 2014 [Page 21] Internet-Draft RTP Payload Format for HEVC September 6, 2013 NALU Network Abstraction Layer Unit PPS Picture Parameter Set RADL Random Access Decodable Leading (Picture) RASL Random Access Skipped Leading (Picture) RPS Reference Picture Set SEI Supplemental Enhancement Information SPS Sequence Parameter Set SST Single-Session Transmission STSA Step-wise Temporal Sub-layer Access TSA Temporal Sub-layer Access VCL Video Coding Layer VPS Video Parameter Set 4. RTP Payload Format 4.1 RTP Header Usage The format of the RTP header is specified in [RFC3550] and reprinted in Figure 2 for convenience. This payload format uses the fields of the header in a manner consistent with that specification. The RTP payload (and the settings for some RTP header bits) for aggregation packets and fragmentation units are specified in Sections 4.7 and 4.8, respectively. Wang, et al Expires March 6, 2014 [Page 22] Internet-Draft RTP Payload Format for HEVC September 6, 2013 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | contributing source (CSRC) identifiers | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2 RTP header according to [RFC3550] The RTP header information to be set according to this RTP payload format is set as follows: Marker bit (M): 1 bit Set for the last packet of the access unit indicated by the RTP timestamp, in line with the normal use of the M bit in video formats, to allow an efficient playout buffer handling. Decoders can use this bit as an early indication of the last packet of an access unit. Informative note: The content of a NAL unit does not tell whether or not the NAL unit is the last NAL unit, in decoding order, of an access unit. An RTP sender implementation may obtain this information from the video encoder. If, however, the implementation cannot obtain this information directly from the encoder, e.g., when the stream was pre-encoded, and also there is no timestamp allocated for each NAL unit, then the sender implementation can inspect subsequent NAL units in decoding order to determine whether or not the NAL unit is the last NAL unit of an access unit as follows. A NAL unit naluX is the last NAL unit of an access unit if it is the last NAL unit of the stream or the next VCL NAL unit naluY in decoding order has the high-order bit of the first byte after its NAL unit header equal to 1, and all NAL units between naluX and naluY, when present, have nal_unit_type in the range of 32 to Wang, et al Expires March 6, 2014 [Page 23] Internet-Draft RTP Payload Format for HEVC September 6, 2013 35, inclusive, equal to 39, or in the ranges of 41 to 44, inclusive, or 48 to 55, inclusive. Payload type (PT): 7 bits The assignment of an RTP payload type for this new packet format is outside the scope of this document and will not be specified here. The assignment of a payload type has to be performed either through the profile used or in a dynamic way. Sequence number (SN): 16 bits Set and used in accordance with RFC 3550. Timestamp: 32 bits The RTP timestamp is set to the sampling timestamp of the content. A 90 kHz clock rate MUST be used. If the NAL unit has no timing properties of its own (e.g., parameter set and SEI NAL units), the RTP timestamp is set to the RTP timestamp of the coded picture of the access unit in which the NAL unit is included, according to Section 7.4.2.4.4 of [HEVC]. Receivers SHOULD ignore the picture output timing information in any picture timing SEI messages or decoding unit information SEI messages as specified in [HEVC]. Instead, receivers SHOULD use the RTP timestamp for the display process. Receivers MUST pass picture timing SEI messages and decoding unit information SEI messages to the decoder and MAY use the field/frame related information for the display process e.g. when frame doubling or frame tripling is indicated by the field/frame related information. 4.2 Payload Header Usage The TID value indicates (among other things) the relative importance of an RTP packet, for example because NAL units belonging to higher temporal sub-layers are not used for the decoding of lower temporal sub-layers. A lower value of TID indicates a higher importance. Wang, et al Expires March 6, 2014 [Page 24] Internet-Draft RTP Payload Format for HEVC September 6, 2013 More important NAL units MAY be better protected against transmission losses than less important NAL units. 4.3 Payload Structures The first two bytes of the payload of an RTP packet are referred to as the payload header. The payload header consists of the same fields (F, Type, LayerId, and TID) as the NAL unit header as shown in section 1.1.4, irrespective of the type of the payload structure. Three different types of RTP packet payload structures are specified. A receiver can identify the type of an RTP packet payload through the Type field in the payload header. The three different payload structures are as follows: o Single NAL unit packet: Contains a single NAL unit in the payload, and the NAL unit header of the NAL unit also serves as the payload header. This payload structure is specified in section 4.6. o Aggregation packet (AP): Contains more than one NAL unit within one access unit. This payload structure is specified in section 4.7. o Fragmentation unit (FU): Contains a subset of a single NAL unit. This payload structure is specified in section 4.8. 4.4 Transmission Modes This memo enables transmission of an HEVC bitstream over a single RTP session or multiple RTP sessions. The concept and working principle is inherited from [RFC6190] and follows a similar design. If only one RTP session is used for transmission of the HEVC bitstream, the transmission mode is referred to as single-session transmission (SST); otherwise (more than one RTP session is used for transmission of the HEVC bitstream), the transmission mode is referred to as multi-session transmission (MST). [Ed. (YK): Unify the style of abbreviated words throughout the document.] Wang, et al Expires March 6, 2014 [Page 25] Internet-Draft RTP Payload Format for HEVC September 6, 2013 SST SHOULD be used for point-to-point unicast scenarios, while MST SHOULD be used for point-to-multipoint multicast scenarios where different receivers require different operation points of the same HEVC bitstream, to improve bandwidth utilizing efficiency. Informative note: A multicast may degrade to a unicast after all but one receivers have left (this is a justification of the first "SHOULD" instead of "MUST"), and there might be scenarios where MST is desirable but not possible e.g. when IP multicast is not deployed in certain network (this is a justification of the second "SHOULD" instead of "MUST"). The transmission mode is indicated by the tx-mode media parameter (see section 7.1). If tx-mode is equal to "SST", SST MUST be used. Otherwise (tx-mode is equal to "MST"), MST MUST be used. 4.5 Decoding Order Number For each NAL unit, the variable AbsDon is derived, representing the decoding order number that is indicative of the NAL unit decoding order. Let NAL unit n be the n-th NAL unit in transmission order within an RTP session. If tx-mode is equal to "SST" and sprop-depack-buf-nalus is equal to 0, AbsDon[n], the value of AbsDon for NAL unit n, is derived as equal to n. Otherwise (tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater than 0), AbsDon[n] is derived as follows, where DON[n] is the value of the variable DON for NAL unit n: o If n is equal to 0 (i.e. NAL unit n is the very first NAL unit in transmission order), AbsDon[0] is set equal to DON[0]. o Otherwise (n is greater than 0), the following applies for derivation of AbsDon[n]: If DON[n] == DON[n-1], AbsDon[n] = AbsDon[n-1] Wang, et al Expires March 6, 2014 [Page 26] Internet-Draft RTP Payload Format for HEVC September 6, 2013 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - DON[n]) If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) For any two NAL units m and n, the following applies: o AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows NAL unit m in NAL unit decoding order. o When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order of the two NAL units can be in either order. o AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes NAL unit m in decoding order. When two consecutive NAL units in the NAL unit decoding order have different values of AbsDon, the value of AbsDon for the second NAL unit in decoding order MUST be greater than the value of AbsDon for the first NAL unit, and the absolute difference between the two AbsDon values MAY be greater than or equal to 1. Informative note: There are multiple reasons to allow for the absolute difference of the values of AbsDon for two consecutive NAL units in the NAL unit decoding order to be greater than one. An increment by one is not required, as at the time of associating values of AbsDon to NAL units, it may not be known whether all NAL units are to be delivered to the receiver. For example, a gateway may not forward VCL NAL units of higher sub- layers or some SEI NAL units when there is congestion in the network. In another example, the first intra picture of a pre- encoded clip is transmitted in advance to ensure that it is readily available in the receiver, and when transmitting the first intra picture, the originator does not exactly know how Wang, et al Expires March 6, 2014 [Page 27] Internet-Draft RTP Payload Format for HEVC September 6, 2013 many NAL units will be encoded before the first intra picture of the pre-encoded clip follows in decoding order. Thus, the values of AbsDon for the NAL units of the first intra picture of the pre-encoded clip have to be estimated when they are transmitted, and gaps in values of AbsDon may occur. Another example is MST where the AbsDon values must indicate cross-layer decoding order for NAL units conveyed in all the RTP sessions. 4.6 Single NAL Unit Packets A single NAL unit packet contains exactly one NAL unit, and consists of a payload header (denoted as PayloadHdr), an optional 16-bit DONL field (in network byte order), and the NAL unit payload data (the NAL unit excluding its NAL unit header) of the contained NAL unit, as shown in Figure 3. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | PayloadHdr | DONL (optional) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | NAL unit payload data | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3 The structure a single NAL unit packet The payload header SHOULD be an exact copy of the NAL unit header of the contained NAL unit. However, the Type (i.e. nal_unit_type) field MAY be changed, e.g. when it is desirable to handle a CRA picture to be a BLA picture [JCTVC-J0107]. The DONL field, when present, specifies the value of the 16 least significant bits of the decoding order number of the contained NAL unit. If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater than 0, the DONL field MUST be present, and the variable DON for the Wang, et al Expires March 6, 2014 [Page 28] Internet-Draft RTP Payload Format for HEVC September 6, 2013 contained NAL unit is derived as equal to the value of the DONL field. Otherwise (tx-mode is equal to "SST" and sprop-depack-buf- nalus is equal to 0), the DONL field MUST NOT be present. 4.7 Aggregation Packets (APs) Aggregation packets (APs) are introduced to enable the reduction of packetization overhead for small NAL units, such as most of the non- VCL NAL units, which are often only a few octets in size. An AP aggregates NAL units within one access unit. Each NAL unit to be carried in an AP is encapsulated in an aggregation unit. NAL units aggregated in one AP are in NAL unit decoding order. An AP consists of a payload header (denoted as PayloadHdr) followed by two or more aggregation units, as shown in Figure 4. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | PayloadHdr | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | one or more aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4 The structure of an aggregation packet The fields in the payload header are set as follows. The F bit MUST be equal to 0 if the F bit of each aggregated NAL unit is equal to zero; otherwise, it MUST be equal to 1. The Type field MUST be equal to 48. The value of LayerId MUST be equal to the lowest value of LayerId of all the aggregated NAL units. The value of TID MUST be the lowest value of TID of all the aggregated NAL units. Informative Note: All VCL NAL units in an AP have the same TID value since they belong to the same access unit. However, an AP may contain non-VCL NAL units for which the TID value in the NAL Wang, et al Expires March 6, 2014 [Page 29] Internet-Draft RTP Payload Format for HEVC September 6, 2013 unit header may be different than the TID value of the VCL NAL units in the same AP. An AP MUST carry at least two aggregation units and can carry as many aggregation units as necessary; however, the total amount of data in an AP obviously MUST fit into an IP packet, and the size SHOULD be chosen so that the resulting IP packet is smaller than the MTU size so to avoid IP layer fragmentation. An AP MUST NOT contain Fragmentation Units (FUs) specified in section 4.8. APs MUST NOT be nested; i.e., an AP MUST NOT contain another AP. The first aggregation unit in an AP consists of an optional 16-bit DONL field (in network byte order) followed by a 16-bit unsigned size information (in network byte order) that indicates the size of the NAL unit in bytes (excluding these two octets, but including the NAL unit header), followed by the NAL unit itself, including its NAL unit header, as shown in Figure 5. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : DONL (optional) | NALU size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU size | | +-+-+-+-+-+-+-+-+ NAL unit | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 5 The structure of the first aggregation unit in an AP The DONL field, when present, specifies the value of the 16 least significant bits of the decoding order number of the aggregated NAL unit. If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater than 0, the DONL field MUST be present in an aggregation unit that is the first aggregation unit in an AP, and the variable DON for the aggregated NAL unit is derived as equal to the value of the DONL field. Otherwise (tx-mode is equal to "SST" and sprop-depack-buf- Wang, et al Expires March 6, 2014 [Page 30] Internet-Draft RTP Payload Format for HEVC September 6, 2013 nalus is equal to 0), the DONL field MUST NOT be present in an aggregation unit that is the first aggregation unit in an AP. An aggregation unit that is not the first aggregation unit in an AP consists of an optional 8-bit DOND field followed by a 16-bit unsigned size information (in network byte order) that indicates the size of the NAL unit in bytes (excluding these two octets, but including the NAL unit header), followed by the NAL unit itself, including its NAL unit header, as shown in Figure 6. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : DOND(optional)| NALU size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | NAL unit | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 6 The structure of an aggregation unit that is not the first aggregation unit in an AP When present, the DOND field plus 1 specifies the difference between the decoding order number values of the current aggregated NAL unit and the preceding aggregated NAL unit in the same AP. If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater than 0, the DOND field MUST be present in an aggregation unit that is not the first aggregation unit in an AP, and the variable DON for the aggregated NAL unit is derived as equal to the DON of the preceding aggregated NAL unit in the same AP plus the value of the DOND field plus 1 modulo 65536. Otherwise (tx-mode is equal to "SST" and sprop-depack-buf-nalus is equal to 0), the DOND field MUST NOT be present in an aggregation unit that is not the first aggregation unit in an AP. Figure 7 presents an example of an AP that contains two aggregation units, labeled as 1 and 2 in the figure, without the DONL and DOND fields being present. Wang, et al Expires March 6, 2014 [Page 31] Internet-Draft RTP Payload Format for HEVC September 6, 2013 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | PayloadHdr | NALU 1 Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 1 HDR | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | | . . . | | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | . . . | NALU 2 Size | NALU 2 HDR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 2 HDR | | +-+-+-+-+-+-+-+-+ NALU 2 Data | | . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 7 An example of an AP packet containing two aggregation units without the DONL and DOND fields Figure 8 presents an example of an AP that contains two aggregation units, labeled as 1 and 2 in the figure, with the DONL and DOND fields being present. Wang, et al Expires March 6, 2014 [Page 32] Internet-Draft RTP Payload Format for HEVC September 6, 2013 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | PayloadHdr | NALU 1 DONL | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 1 Size | NALU 1 HDR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | NALU 1 Data . . . | | | + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | NALU 2 DOND | NALU 2 Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 2 HDR | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | | | | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 8 An example of an AP containing two aggregation units with the DONL and DOND fields 4.8 Fragmentation Units (FUs) Fragmentation units (FUs) are introduced to enable fragmenting a single NAL unit into multiple RTP packets, possibly without cooperation or knowledge of the HEVC encoder. A fragment of a NAL unit consists of an integer number of consecutive octets of that NAL unit. Fragments of the same NAL unit MUST be sent in consecutive order with ascending RTP sequence numbers (with no other RTP packets within the same RTP packet stream being sent between the first and last fragment). When a NAL unit is fragmented and conveyed within FUs, it is referred to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST NOT be nested; i.e., an FU MUST NOT contain a subset of another FU. Wang, et al Expires March 6, 2014 [Page 33] Internet-Draft RTP Payload Format for HEVC September 6, 2013 The RTP timestamp of an RTP packet carrying an FU is set to the NALU-time of the fragmented NAL unit. An FU consists of a payload header (denoted as PayloadHdr), an FU header of one octet, an optional 16-bit DONL field (in network byte order), and an FU payload, as shown in Figure 9. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | PayloadHdr | FU header | DONL(optional)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| | DONL(optional)| | |-+-+-+-+-+-+-+-+ | | FU payload | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 9 The structure of an FU The fields in the payload header are set as follows. The Type field MUST be equal to 49. The fields F, LayerId, and TID MUST be equal to the fields F, LayerId, and TID, respectively, of the fragmented NAL unit. The FU header consists of an S bit, an E bit, and a 6-bit FuType field, as shown in Figure 10. +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |S|E| FuType | +---------------+ Figure 10 The structure of FU header Wang, et al Expires March 6, 2014 [Page 34] Internet-Draft RTP Payload Format for HEVC September 6, 2013 The semantics of the FU header fields are as follows: S: 1 bit When set to one, the S bit indicates the start of a fragmented NAL unit i.e., the first byte of the FU payload is also the first byte of the payload of the fragmented NAL unit. When the FU payload is not the start of the fragmented NAL unit payload, the S bit MUST be set to zero. E: 1 bit When set to one, the E bit indicates the end of a fragmented NAL unit, i.e., the last byte of the payload is also the last byte of the fragmented NAL unit. When the FU payload is not the last fragment of a fragmented NAL unit, the E bit MUST be set to zero. FuType: 6 bits The field FuType MUST be equal to the field Type of the fragmented NAL unit. The DONL field, when present, specifies the value of the 16 least significant bits of the decoding order number of the fragmented NAL unit. If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater than 0, and the S bit is equal to 1, the DONL field MUST be present in the FU, and the variable DON for the fragmented NAL unit is derived as equal to the value of the DONL field. Otherwise (tx-mode is equal to "SST" and sprop-depack-buf-nalus is equal to 0, or the S bit is equal to 0), the DONL field MUST NOT be present in the FU. A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., the Start bit and End bit MUST NOT both be set to one in the same FU header. The FU payload consists of fragments of the payload of the fragmented NAL unit so that if the FU payloads of consecutive FUs, starting with an FU with the S bit equal to 1 and ending with an FU with the E bit equal to 1, are sequentially concatenated, the payload of the fragmented NAL unit can be reconstructed. The NAL unit header of the fragmented NAL unit is not included as such in Wang, et al Expires March 6, 2014 [Page 35] Internet-Draft RTP Payload Format for HEVC September 6, 2013 the FU payload, but rather the information of the NAL unit header of the fragmented NAL unit is conveyed in F, LayerId, and TID fields of the FU payload headers of the FUs and the FuType field of the FU header of the FUs. An FU payload MAY have any number of octets and MAY be empty. Informative note: Empty FU payloads are allowed to reduce the latency of a certain class of senders in nearly lossless environments. These senders can be characterized in that they packetize fragments of a NAL unit before the NAL unit is completely generated and, hence, before the NAL unit size is known. If zero-length FU payloads were not allowed, the sender would have to generate at least one bit of data of the following fragment of the NAL unit before the current FU could be sent. Due to the characteristics of HEVC, where sometimes several CTUs occupy zero bits, this is undesirable and can add delay. However, the (potential) use of zero-length FU payloads should be carefully weighted against the increased risk of the loss of at least a part of the fragmented NAL unit because of the additional packets employed for its transmission. If an FU is lost, the receiver SHOULD discard all following fragmentation units in transmission order corresponding to the same fragmented NAL unit, unless the decoder in the receiver is known to be prepared to gracefully handle incomplete NAL units. A receiver in an endpoint or in a MANE MAY aggregate the first n-1 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment n of that NAL unit is not received. In this case, the forbidden_zero_bit of the NAL unit MUST be set to one to indicate a syntax violation. Wang, et al Expires March 6, 2014 [Page 36] Internet-Draft RTP Payload Format for HEVC September 6, 2013 5. Packetization Rules The following packetization rules apply: o If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater than 0 for an RTP session, the transmission order of NAL units carried in the RTP session MAY be different than the NAL unit decoding order. Otherwise (tx-mode is equal to "SST" and sprop- depack-buf-nalus is equal to 0 for an RTP session), the transmission order of NAL units carried in the RTP session MUST be the same as the NAL unit decoding order. o A NAL unit of a small size SHOULD be encapsulated in an aggregation packet together with one or more other NAL units in order to avoid the unnecessary packetization overhead for small NAL units. For example, non-VCL NAL units such as access unit delimiters, parameter sets, or SEI NAL units are typically small and can often be aggregated with VCL NAL units without violating MTU size constraints. o Each non-VCL NAL unit SHOULD be encapsulated in an aggregation packet together with its associated VCL NAL unit, as typically a non-VCL NAL unit would be meaningless without the associated VCL NAL unit being available. o For carrying exactly one NAL unit in an RTP packet, a single NAL unit packet MUST be used. 6. De-packetization Process The general concept behind de-packetization is to get the NAL units out of the RTP packets in an RTP session and all the dependent RTP sessions, if any, and pass them to the decoder in the NAL unit decoding order. The de-packetization process is implementation dependent. Therefore, the following description should be seen as an example of a suitable implementation. Other schemes may be used as well as long as the output for the same input is the same as the process described below. The output is the same when the set of NAL units Wang, et al Expires March 6, 2014 [Page 37] Internet-Draft RTP Payload Format for HEVC September 6, 2013 and their order are both identical. Optimizations relative to the described algorithms are possible. All normal RTP mechanisms related to buffer management apply. In particular, duplicated or outdated RTP packets (as indicated by the RTP sequences number and the RTP timestamp) are removed. To determine the exact time for decoding, factors such as a possible intentional delay to allow for proper inter-stream synchronization must be factored in. NAL units with NAL unit type values in the range of 0 to 47, inclusive may be passed to the decoder. NAL-unit-like structures with NAL unit type values in the range of 48 to 63, inclusive, MUST NOT be passed to the decoder. The receiver includes a receiver buffer, which is used to compensate for transmission delay jitter, to reorder NAL units from transmission order to the NAL unit decoding order, and to recover the NAL unit decoding order in MST, when applicable. In this section, the receiver operation is described under the assumption that there is no transmission delay jitter. To make a difference from a practical receiver buffer that is also used for compensation of transmission delay jitter, the receiver buffer is here after called the de-packetization buffer in this section. Receivers SHOULD also prepare for transmission delay jitter; i.e., either reserve separate buffers for transmission delay jitter buffering and de-packetization buffering or use a receiver buffer for both transmission delay jitter and de-packetization. Moreover, receivers SHOULD take transmission delay jitter into account in the buffering operation; e.g., by additional initial buffering before starting of decoding and playback. There are two buffering states in the receiver: initial buffering and buffering while playing. Initial buffering starts when the reception is initialized. After initial buffering, decoding and playback are started, and the buffering-while-playing mode is used. Regardless of the buffering state, the receiver stores incoming NAL units, in reception order, into the de-packetization buffer. NAL units carried in single NAL unit packets, APs, and FUs are stored in the de-packetization buffer individually, and the value of AbsDon is Wang, et al Expires March 6, 2014 [Page 38] Internet-Draft RTP Payload Format for HEVC September 6, 2013 calculated and stored for each NAL unit. When MST is in use, NAL units of all RTP packet streams are stored in the same de- packetization buffer. Initial buffering lasts until condition A (the number of NAL units in the de-packetization buffer is greater than the value of sprop- depack-buf-nalus of the highest RTP session) is true. After initial buffering, whenever condition A is true, the following operation is repeatedly applied until condition A becomes false: o The NAL unit in the de-packetization buffer with the smallest value of AbsDon is removed from the de-packetization buffer and passed to the decoder. When no more NAL units are flowing into the de-packetization buffer, all NAL units remaining in the de-packetization buffer are removed from the buffer and passed to the decoder in the order of increasing AbsDon values. 7. Payload Format Parameters This section specifies the parameters that MAY be used to select optional features of the payload format and certain features or properties of the bitstream. The parameters are specified here as part of the media type registration for the HEVC codec. A mapping of the parameters into the Session Description Protocol (SDP) [RFC4566] is also provided for applications that use SDP. Equivalent parameters could be defined elsewhere for use with control protocols that do not use SDP. 7.1 Media Type Registration The media subtype for the HEVC codec is allocated from the IETF tree. The receiver MUST ignore any unspecified parameter. Media Type name: video Media subtype name: H265 Wang, et al Expires March 6, 2014 [Page 39] Internet-Draft RTP Payload Format for HEVC September 6, 2013 Required parameters: none OPTIONAL parameters: In the following definitions of parameters, "the stream" or "the NAL unit stream" refers to all NAL units conveyed in the current RTP session in SST, and all NAL units conveyed in the current RTP session and all NAL units conveyed in other RTP sessions that the current RTP session depends on in MST. profile-space, profile-id: The profile-space parameter indicates the context for interpretation of the profile-id parameter value. The profile, which specifies the subset of coding tools that may have been used to generate the stream or that the receiver supports, as specified in [HEVC], is defined by the combination of profile-space and profile-id. Note that profile-space is required to be equal to 0 in [HEVC], but other values for it may be specified in the future by ITU-T or ISO/IEC. If the profile-space and profile-id parameters are used to indicate properties of a NAL unit stream, it indicates that, to decode the stream, the minimum subset of coding tools a decoder has to support is the profile specified by both parameters. If the profile-space and profile-id parameters are used for capability exchange or session setup, it indicates the subset of coding tools, which is equal to the profile, that the codec supports for both receiving and sending. If no profile-space is present, a value of 0 MUST be inferred and if no profile-id is present the Main profile (i.e. a value of 1) MUST be inferred. The profile-space and profile-id parameters are derived from the sequence parameter set or video parameter set NAL units, as specified in [HEVC], as follows. Wang, et al Expires March 6, 2014 [Page 40] Internet-Draft RTP Payload Format for HEVC September 6, 2013 For SST or for the stream corresponding to the highest RTP session of MST when MST is applied, the following applies: o profile_space = general_profile_space o profile_id = general_profile_idc For streams not corresponding to the highest RTP session of MST when MST is applied, the following applies, with j being the value of the sub-layer-id parameter: o profile_space = sub_layer_profile_space[j] o profile_id = sub_layer_profile_idc[j] tier-flag, level-id: The tier-flag parameter indicates the context for interpretation of the level-id value. The default level, which limits values of syntax elements or on arithmetic combinations of values of syntax elements, as specified in [HEVC], is defined by the combination of tier-flag and level- id. If the tier-flag and level-id parameters are used to indicate properties of a NAL unit stream, it indicates that, to decode the stream the lowest level the decoder has to support is the default level. If the tier-flag and level-id parameters are used for capability exchange or session setup, the following applies. If max-recv-level-id is not present, the default level defined by tier-flag and level-id indicates the highest level the codec wishes to support. Otherwise, tier-flag and max-recv- level-id indicate the highest level the codec supports for receiving. For either receiving or sending, all levels that are lower than the highest level supported MUST also be supported. If no tier-flag is present, a value of 0 MUST be inferred and if no level-id is present, a value of 93 (i.e. level 3.1) MUST be inferred. Wang, et al Expires March 6, 2014 [Page 41] Internet-Draft RTP Payload Format for HEVC September 6, 2013 The tier-flag and level-id parameters are derived from the sequence parameter set or video parameter set NAL units, as specified in [HEVC], as follows. For SST or for the stream corresponding to the highest RTP session of MST when MST is applied, the following applies: o tier-flag = general_tier_flag o level-id = general_level_idc For streams not corresponding to the highest RTP session of MST when MST is applied, the following applies, with j being the value of the sub-layer-id parameter: o tier-flag = sub_layer_tier_flag[j] o level-id = sub_layer_level_idc[j] interop-constraints: A base16 [RFC4648] (hexadecimal) representation of the six bytes derived from the sequence parameter set or video parameter set NAL units as specified in [HEVC] consisting of progressive_source_flag, interlaced_source_flag, non_packed_constraint_flag, frame_only_constraint_flag, and reserved_zero_44bits. Note that reserved_zero_44bits is required to be equal to 0 in [HEVC], but other values for it may be specified in the future by ITU-T or ISO/IEC. If no interop-constraints are present, the following MUST be inferred: o progressive_source_flag = 1 o interlaced_source_flag = 0 o non_packed_constraint_flag = 1 o frame_only_constraint_flag = 1 o reserved_zero_44bits = 0 For SST or for the stream corresponding to the highest RTP session of MST when MST is applied, the following applies: Wang, et al Expires March 6, 2014 [Page 42] Internet-Draft RTP Payload Format for HEVC September 6, 2013 o progressive_source_flag = general_progressive_source_flag o interlaced_source_flag = general_interlaced_source_flag o non_packed_constraint_flag = general_non_packed_constraint_flag o frame_only_constraint_flag = general_frame_only_constraint_flag o reserved_zero_44bits = general_reserved_zero_44bits For streams not corresponding to the highest RTP session of MST when MST is applied, the following applies, with j being the value of the sub-layer-id parameter: o progressive_source_flag = sub_layer_progressive_source_flag[j] o interlaced_source_flag = sub_layer_interlaced_source_flag[j] o non_packed_constraint_flag = sub_layer_non_packed_constraint_flag[j] o frame_only_constraint_flag = sub_layer_frame_only_constraint_flag[j] o reserved_zero_44bits = sub_layer_reserved_zero_44bits[j] profile-compatibility-indicator: A base16 [RFC4648] representation of the four bytes representing the 32 profile compatibility flags in the sequence parameter set or video parameter set NAL units. A decoder conforming to a certain profile may be able to decode bitstreams conforming to other profiles. The profile- compatibility-indicator provides exact information of the ability of a decoder conforming to a certain profile to decode bitstreams conforming to another profile. More concretely, if the profile compatibility flag corresponding to the profile, which a decoder conforms to, is set, then the decoder is able to decode that bitstream with the flag set, irrespective of the profile, which a bitstream conforms to (provided that the decoder supports the highest level of the bitstream). For SST or for the stream corresponding to highest RTP session of MST when MST is used with temporal scalability the following applies with j = 0..31: Wang, et al Expires March 6, 2014 [Page 43] Internet-Draft RTP Payload Format for HEVC September 6, 2013 o The 32 flags = general_profile_compatibility_flag[j] When MST is in use, for streams not corresponding to the highest RTP session, the following applies with i being the value of the sub-layer-id parameter and j = 0..31: o The 32 flags = sub_layer_profile_compatibility_flag[i][j] sub-layer-id: This parameter MAY be used to indicate the highest allowed value of TID in the stream. When not present, the value of sub-layer-id is inferred to be equal to 6. recv-sub-layer-id: This parameter MAY be used to signal a receiver's choice of the offers or declared sub-layers in the sprop-vps. The value of recv-sub-layer-id indicates the index of the highest sub- layer of the stream that a receiver supports. When not present, the value of recv-sub-layer-id is inferred to be equal to sub-layer-id. max-recv-level-id: This parameter MAY be used, together with tier-flag, to indicate the highest level a receiver supports. The highest level the receiver supports is equal to the value of max-recv- level-id divided by 30 for the Main or High tier (as determined by tier-flag equal to 0 or 1, respectively). When max-recv-level-id is not present, the value is inferred to be equal to level-id. max-recv-level-id MUST NOT be present when the highest level the receiver supports is not higher than the default level. sprop-vps: This parameter MAY be used to convey any video parameter set NAL unit of the stream. When present, the parameter MAY be used to indicate codec capability and sub-stream Wang, et al Expires March 6, 2014 [Page 44] Internet-Draft RTP Payload Format for HEVC September 6, 2013 characteristics (i.e. properties of sub-layer representations as defined in [HEVC]) as well as for out-of-band transmission of video parameter sets. The value of the parameter is a comma-separated (',') list of base64 [RFC4648] representations of the video parameter set NAL units as specified in Section 7.3.2.1 of [HEVC]. sprop-sps: This parameter MAY be used to convey sequence parameter set NAL units of the stream for out-of-band transmission of sequence parameter sets. The value of the parameter is a comma-separated (',') list of base64 [RFC4648] representations of the sequence parameter set NAL units as specified in Section 7.3.2.2 of [HEVC]. sprop-pps: This parameter MAY be used to convey picture parameter set NAL units of the stream for out-of-band transmission of picture parameter sets. The value of the parameter is a comma- separated (',') list of base64 [RFC4648] representations of the picture parameter set NAL units as specified in Section 7.3.2.3 of [HEVC]. max-ls, max-lps, max-cpb, max-dpb, max-br, max-tr, max-tc: These parameters MAY be used to signal the capabilities of a receiver implementation. These parameters MUST NOT be used for any other purpose. The highest level (specified by tier- flag and max-recv-level-id) MUST be such that the receiver is fully capable of supporting. max-ls, max-lps, max-cpb, max- dpb, max-br, max-tr, and max-tc MAY be used to indicate capabilities of the receiver that extend the required capabilities of the highest level, as specified below. When more than one parameter from the set (max-ls, max-lps, max-cpb, max-dpb, max-br, max-tr, max-tc) is present, the receiver MUST support all signaled capabilities simultaneously. For example, if both max-ls and max-br are present, the highest level with the extension of both the Wang, et al Expires March 6, 2014 [Page 45] Internet-Draft RTP Payload Format for HEVC September 6, 2013 picture rate and bitrate is supported. That is, the receiver is able to decode NAL unit streams in which the luma sample rate is up to max-ls (inclusive), the bitrate is up to max-br (inclusive), the coded picture buffer size is derived as specified in the semantics of the max-br parameter below, and the other properties comply with the highest level specified by tier-flag and max-recv-level-id. Informative note: When the OPTIONAL media type parameters are used to signal the properties of a NAL unit stream, and max-ls, max-lps, max-cpb, max-dpb, max-br, max-tr, and max- tc are not present, the values of profile-space, profile- id, tier-flag, and level-id must always be such that the NAL unit stream complies fully with the specified profile and level. max-ls: The value of max-ls is an integer indicating the maximum processing rate in units of luma samples per second. The max- ls parameter signals that the receiver is capable of decoding video at a higher rate than is required by the highest level. When max-ls is signaled, the receiver MUST be able to decode NAL unit streams that conform to the highest level, with the exception that the MaxLumaSR value in Table A-2 of [HEVC] for the highest level is replaced with the value of max-ls. The value of max-ls MUST be greater than or equal to the value of MaxLumaSR given in Table A-2 of [HEVC] for the highest level. Senders MAY use this knowledge to send pictures of a given size at a higher picture rate than is indicated in the highest level. When not present, the value of max-ls is inferred to be equal to the value of MaxLumaSR given in Table A-2 of [HEVC] for the highest level. max-lps: The value of max-lps is an integer indicating the maximum picture size in units of luma samples. The max-lps parameter signals that the receiver is capable of decoding larger picture sizes than are required by the highest level. When Wang, et al Expires March 6, 2014 [Page 46] Internet-Draft RTP Payload Format for HEVC September 6, 2013 max-lps is signaled, the receiver MUST be able to decode NAL unit streams that conform to the highest level, with the exception that the MaxLumaPS value in Table A-1 of [HEVC] for the highest level is replaced with the value of max-lps. The value of max-lps MUST be greater than or equal to the value of MaxLumaPS given in Table A-1 of [HEVC] for the highest level. Senders MAY use this knowledge to send larger pictures at a proportionally lower picture rate than is indicated in the highest level. When not present, the value of max-lps is inferred to be equal to the value of MaxLumaPS given in Table A-1 of [HEVC] for the highest level. max-cpb: The value of max-cpb is an integer indicating the maximum coded picture buffer size in units of CpbBrVclFactor bits for the VCL HRD parameters and in units of CpbBrNalFactor bits for the NAL HRD parameters, where CpbBrVclFactor and CpbBrNalFactor are defined in Section A.4 of [HEVC]. The max- cpb parameter signals that the receiver has more memory than the minimum amount of coded picture buffer memory required by the highest level. When max-cpb is signaled, the receiver MUST be able to decode NAL unit streams that conform to the highest level, with the exception that the MaxCPB value in Table A-1 of [HEVC] for the highest level is replaced with the value of max-cpb. The value of max-cpb MUST be greater than or equal to the value of MaxCPB given in Table A-1 of [HEVC] for the highest level. Senders MAY use this knowledge to construct coded video streams with greater variation of bitrate than can be achieved with the MaxCPB value in Table A- 1 of [HEVC]. When not present, the value of max-cpb is inferred to be equal to the value of MaxCPB given in Table A-1 of [HEVC] for the highest level. Informative note: The coded picture buffer is used in the hypothetical reference decoder (Annex C of HEVC). The use of the hypothetical reference decoder is recommended in HEVC encoders to verify that the produced bitstream Wang, et al Expires March 6, 2014 [Page 47] Internet-Draft RTP Payload Format for HEVC September 6, 2013 conforms to the standard and to control the output bitrate. Thus, the coded picture buffer is conceptually independent of any other potential buffers in the receiver, including de-packetization and de-jitter buffers. The coded picture buffer need not be implemented in decoders as specified in Annex C of HEVC, but rather standard-compliant decoders can have any buffering arrangements provided that they can decode standard-compliant bitstreams. Thus, in practice, the input buffer for a video decoder can be integrated with de-packetization and de-jitter buffers of the receiver. max-dpb: The value of max-dpb is an integer indicating the maximum decoded picture buffer size in units decoded pictures at the MaxLumaPS for the highest level, i.e. number of decoded pictures at the maximum picture size defined by the highest level. The value of max-dpb MUST be smaller than or equal to 16. The max-dpb parameter signals that the receiver has more memory than the minimum amount of decoded picture buffer memory required by default, which is MaxDpbPicBuf as defined in [HEVC] (equal to 6). When max-dpb is signaled, the receiver MUST be able to decode NAL unit streams that conform to the highest level, with the exception that the MaxDpbPicBuff value defined in [HEVC] as 6 is replaced with the value of max-dpb. Consequently, a receiver that signals max-dpb MUST be capable of storing the following number of decoded pictures (MaxDpbSize) in its decoded picture buffer: if( PicSizeInSamplesY <= ( MaxLumaPS >> 2 ) ) MaxDpbSize = Min( 4 * max-dpb, 16 ) else if ( PicSizeInSamplesY <= ( MaxLumaPS >> 1 ) ) MaxDpbSize = Min( 2 * max-dpb, 16 ) else if ( PicSizeInSamplesY <= ( ( 3 * MaxLumaPS ) >> 2 ) ) MaxDpbSize = Min( (4 * max-dpb) / 3, 16 ) else MaxDpbSize = max-dpb Wherein MaxLumaPS given in Table A-1 of [HEVC] for the highest level and PicSizeInSamplesY is the current size of each decoded picture in units of luma samples as defined in [HEVC]. Wang, et al Expires March 6, 2014 [Page 48] Internet-Draft RTP Payload Format for HEVC September 6, 2013 The value of max-dpb MUST be greater than or equal to the value of MaxDpbPicBuf (i.e. 6) as defined in [HEVC]. Senders MAY use this knowledge to construct coded video streams with improved compression. When not present, the value of max-dpb is inferred to be equal to the value of MaxDpbPicBuf (i.e. 6) as defined in [HEVC]. Informative note: This parameter was added primarily to complement a similar codepoint in the ITU-T Recommendation H.245, so as to facilitate signaling gateway designs. The decoded picture buffer stores reconstructed samples. There is no relationship between the size of the decoded picture buffer and the buffers used in RTP, especially de- packetization and de-jitter buffers. max-br: The value of max-br is an integer indicating the maximum video bitrate in units of CpbBrVclFactor bits per second for the VCL HRD parameters and in units of CpbBrNalFactor bits per second for the NAL HRD parameters, where CpbBrVclFactor and CpbBrNalFactor are defined in Section A.4 of [HEVC]. The max-br parameter signals that the video decoder of the receiver is capable of decoding video at a higher bitrate than is required by the highest level. When max-br is signaled, the video codec of the receiver MUST be able to decode NAL unit streams that conform to the highest level, with the following exceptions in the limits specified by the highest level: o The value of max-br replaces the MaxBR value in Table A-2 of [HEVC] for the highest level. o When the max-cpb parameter is not present, the result of the following formula replaces the value of MaxCPB in Table A-1 of [HEVC]: (MaxCPB of the highest level) * max-br / (MaxBR of the highest level) Wang, et al Expires March 6, 2014 [Page 49] Internet-Draft RTP Payload Format for HEVC September 6, 2013 For example, if a receiver signals capability for Main profile Level 2 with max-br equal to 2000, this indicates a maximum video bitrate of 2000 kbits/sec for VCL HRD parameters, a maximum video bitrate of 2200 kbits/sec for NAL HRD parameters, and a CPB size of 2000000 bits (2000000 / 1500000 * 1500000). The value of max-br MUST be greater than or equal to the value MaxBR given in Table A-2 of [HEVC] for the highest level. Senders MAY use this knowledge to send higher bitrate video as allowed in the level definition of Annex A of HEVC to achieve improved video quality. When not present, the value of max-br is inferred to be equal to the value of MaxBR given in Table A-2 of [HEVC] for the highest level. Informative note: This parameter was added primarily to complement a similar codepoint in the ITU-T Recommendation H.245, so as to facilitate signaling gateway designs. The assumption that the network is capable of handling such bitrates at any given time cannot be made from the value of this parameter. In particular, no conclusion can be drawn that the signaled bitrate is possible under congestion control constraints. max-tr: The value of max-tr is an integer indication the maximum number of tile rows. The max-tr parameter signals that the receiver is capable of decoding video with a larger number of tile rows than the value allowed by the highest level. When max-tr is signaled, the receiver MUST be able to decode NAL unit streams that conform to the highest level, with the exception that the MaxTileRows value in Table A-1 of [HEVC] for the highest level is replaced with the value of max-tr. The value of max-tr MUST be greater than or equal to the value of MaxTileRows given in Table A-1 of [HEVC] for the highest level. Senders MAY use this knowledge to send pictures Wang, et al Expires March 6, 2014 [Page 50] Internet-Draft RTP Payload Format for HEVC September 6, 2013 utilizing a larger number of tile rows than the value allowed by the highest level. When not present, the value of max-tr is inferred to be equal to the value of MaxTileRows given in Table A-1 of [HEVC] for the highest level. max-tc: The value of max-tc is an integer indication the maximum number of tile columns. The max-tc parameter signals that the receiver is capable of decoding video with a larger number of tile columns than the value allowed by the highest level. When max-tc is signaled, the receiver MUST be able to decode NAL unit streams that conform to the highest level, with the exception that the MaxTileCols value in Table A-1 of [HEVC] for the highest level is replaced with the value of max-tc. The value of max-tc MUST be greater than or equal to the value of MaxTileCols given in Table A-1 of [HEVC] for the highest level. Senders MAY use this knowledge to send pictures utilizing a larger number of tile columns than the value allowed by the highest level. When not present, the value of max-tc is inferred to be equal to the value of MaxTileCols given in Table A-1 of [HEVC] for the highest level. max-fps: The value of max-fps is an integer indicating the maximum picture rate in units of hundreds of pictures per second that can be efficiently received. The max-fps parameter MAY be used to signal that the receiver has a constraint in that it is not capable of decoding video efficiently at the full picture rate that is implied by the highest level and, when present, one or more of the parameters max-ls, max-lps, and max-br. Wang, et al Expires March 6, 2014 [Page 51] Internet-Draft RTP Payload Format for HEVC September 6, 2013 The value of max-fps is not necessarily the picture rate at which the maximum picture size can be sent, it constitutes a constraint on maximum picture rate for all resolutions. Informative note: The max-fps parameter is semantically different from max-ls, max-lps, max-cpb, max-dpb, max-br, max-tr, and max-tc in that max-fps is used to signal a constraint, lowering the maximum picture rate from what is implied by other parameters. The encoder MUST use a picture rate equal to or less than this value. In cases where the max-fps parameter is absent the encoder is free to choose any picture rate according to the highest level and any signaled optional parameters. tx-mode: This parameter indicates whether the transmission mode is SST or MST. The value of tx-mode MUST be equal to either "MST" or "SST". When not present, the value of tx-mode is inferred to be equal to "SST". If the value is equal to "MST", MST MUST be in use. Otherwise (the value is equal to "SST"), SST MUST be in use. The value of tx-mode MUST be equal to "MST" for all RTP sessions in an MST. sprop-depack-buf-nalus: This parameter specifies the maximum number of NAL units that precede a NAL unit in the de-packetization buffer in reception order and follow the NAL unit in decoding order. The value of sprop-depack-buf-nalus MUST be an integer in the range of 0 to 32767, inclusive. When not present, the value of sprop-depack-buf-nalus is inferred to be equal to 0. Wang, et al Expires March 6, 2014 [Page 52] Internet-Draft RTP Payload Format for HEVC September 6, 2013 When the RTP session depends on one or more other RTP sessions (in this case tx-mode MUST be equal to "MST"), this parameter MUST be present and the value of sprop-depack-buf-nalus MUST be greater than 0. sprop-depack-buf-bytes: This parameter signals the required size of the de- packetization buffer in units of bytes. The value of the parameter MUST be greater than or equal to the maximum buffer occupancy (in units of bytes) of the de-packetization buffer as specified in section 6. The value of sprop-depack-buf-bytes MUST be an integer in the range of 0 to 4294967295, inclusive. When the RTP session depends on one or more other RTP sessions (in this case tx-mode MUST be equal to "MST") or sprop-depack- buf-nalus is present and is greater than 0, this parameter MUST be present and the value of sprop-depack-buf-bytes MUST be greater than 0. Informative note: The value of sprop-depack-buf-bytes indicates the required size of the de-packetization buffer only. When network jitter can occur, an appropriately sized jitter buffer has to be available as well. depack-buf-cap: This parameter signals the capabilities of a receiver implementation and indicates the amount of de-packetization buffer space in units of bytes that the receiver has available for reconstructing the NAL unit decoding order. A receiver is able to handle any stream for which the value of the sprop- depack-buf-bytes parameter is smaller than or equal to this parameter. When not present, the value of depack-buf-cap is inferred to be equal to 0. The value of depack-buf-cap MUST be an integer in the range of 0 to 4294967295, inclusive. Wang, et al Expires March 6, 2014 [Page 53] Internet-Draft RTP Payload Format for HEVC September 6, 2013 Informative note: depack-buf-cap indicates the maximum possible size of the de-packetization buffer of the receiver only. When network jitter can occur, an appropriately sized jitter buffer has to be available as well. sprop-segmentation-id: This parameter MAY be used to signal the segmentation tools present in the stream and that can be used for parallelization. The value of sprop-segmentation-id MUST be an integer in the range of 0 to 3, inclusive. When not present, the value of sprop-segmentation-id is inferred to be equal to 0. When sprop-segmentation-id is equal to 0, no information about the segmentation tools is provided. When sprop-segmentation- id is equal to 1, it indicates that slices are present in the stream. When sprop-segmentation-id is equal to 2, it indicates that tiles are present in the stream. When sprop- segmentation-id is equal to 3, it indicates that WPP is used in the stream. sprop-spatial-segmentation-idc: A base16 [RFC4648] representation of the syntax element min_spatial_segmentation_idc as specified in [HEVC]. This parameter MAY be used to describe parallelization capabilities of the stream. dec-parallel-cap: This parameter MAY be used to indicate the decoder's additional decoding capabilities given the presence of tools enabling parallel decoding, such as slices, tiles, and WPP, in the video stream. The decoding capability of the decoder may vary with the setting of the parallel decoding tools present in the stream, e.g. the size of the tiles that are present in a stream. Therefore, multiple capability points may be provided, each indicating the minimum required decoding capability that is associated with a parallelism requirement, Wang, et al Expires March 6, 2014 [Page 54] Internet-Draft RTP Payload Format for HEVC September 6, 2013 which is a requirement on the video stream that enables parallel decoding. Each capability point is defined as a combination of 1) a parallelism requirement, 2) a profile (determined by profile- space and profile-id), 3) a highest level, and 4) a maximum processing rate, a maximum picture size, and a maximum video bitrate that may be equal to or greater than that determined by the highest level.The parameter's syntax in ABNF [RFC5234] is as follows: dec-parallel-cap = "dec-parallel-cap={" cap-point *("," cap-point) "}" cap-point = ("w" / "t") ":" spatial-seg-idc 1*(";" cap-parameter) spatial-seg-idc = 1*4DIGIT ; 1-4095 cap-parameter = tier-flag / level-id / max-ls / max-lps / max-br The set of capability points expressed by the dec-parallel-cap parameter is enclosed in a pair of curly braces ("{}"). Each set of two consecutive capability points is separated by a comma (','). Within each capability point, each set of two consecutive parameters, and when present, their values, is separated by a semicolon (';'). The profile of all capability points is determined by profile- space and profile-id that are outside the dec-parallel-cap parameter. Each capability point starts with an indication of the parallelism requirement, which consists of a parallel tool type, which may be equal to 'w' or 't', and a decimal value of the spatial-seg-idc parameter. When the type is 'w', the capability point is valid only for H.265 bitstreams with WPP in use, i.e., entropy_coding_sync_enabled_flag equal to 1. When the type is 't', the capability point is valid only for H.265 bitstreams with WPP not in use (i.e. Wang, et al Expires March 6, 2014 [Page 55] Internet-Draft RTP Payload Format for HEVC September 6, 2013 entropy_coding_sync_enabled_flag equal to 0). The capability- point is valid only for H.265 bitstreams with min_spatial_segmentation_idc equal to or greater than spatial- seg-idc. The value of spatial-seg-idc MUST be greater than 0. After the parallelism requirement indication, each capability point continues with one or more pairs of parameter and value in any order for any of the following parameters: o tier-flag o level-id o max-ls o max-lps o max-br At most one occurrence of each of the above five parameters is allowed within each capability point. The values of dec-parallel-cap.tier-flag and dec-parallel- cap.level-id for a capability point indicate the highest level of the capability point. The values of dec-parallel-cap.max- ls, dec-parallel-cap.max-lps, and dec-parallel-cap.max-br for a capability point indicate the maximum processing rate in units of luma samples per second, the maximum picture size in units of luma samples, and the maximum video bitrate (in units of CpbBrVclFactor bits per second for the VCL HRD parameters and in units of CpbBrNalFactor bits per second for the NAL HRD parameters) where CpbBrVclFactor and CpbBrNalFactor are defined in Section A.4 of [HEVC]). When not present, the value of dec-parallel-cap.tier-flag is inferred to be equal to the value of tier-flag outside the dec-parallel-cap parameter. When not present, the value of dec-parallel-cap.level-id is inferred to be equal to the value of max-recv-level-id outside the dec-parallel-cap parameter. When not present, the value of dec-parallel-cap.max-ls, dec- parallel-cap.max-lps, or dec-parallel-cap.max-br is inferred to be equal to the value of max-ls, max-lps, or max-br, respectively, outside the dec-parallel-cap parameter. Wang, et al Expires March 6, 2014 [Page 56] Internet-Draft RTP Payload Format for HEVC September 6, 2013 The general decoding capability, expressed by the set of parameters outside of dec-parallel-cap, is defined as the capability point that is determined by the following combination of parameters: 1) the parallelism requirement corresponding to the value of sprop-segmentation-id equal to 0 for a stream, 2) the profile determined by profile-space and profile-id, 3) the highest level determined by tier-flag and max-recv-level-id, and 4) the maximum processing rate, the maximum picture size, and the maximum video bitrate determined by the highest level. The general decoding capability MUST NOT be included as one of the set of capability points in the dec-parallel-cap parameter. For example, the following parameters express the general decoding capability of 720p30 (Level 3.1) plus an additional decoding capability of 1080p30 (Level 4) given that the spatially largest tile or slice used in the bitstream is equal to or less than 1/3 of the picture size: a=fmtp:98 level-id=93;dec-parallel-cap={t:8;level-id=120} For another example, the following parameters express an additional decoding capability of 1080p30, using dec-parallel- cap.max-ls and dec-parallel-cap.max-lps, given that WPP is used in the stream: a=fmtp:98 level-id=93;dec-parallel-cap={w:8; max-ls=2088960;max-lps=62668800} Informative note: When min_spatial_segmentation_idc is present in a stream and WPP is not used, [HEVC] specifies that there is no slice or no tile in the stream containing more than 4 * PicSizeInSamplesY / ( min_spatial_segmentation_idc + 4 ) luma samples. Encoding considerations: This type is only defined for transfer via RTP (RFC 3550). Security considerations: Wang, et al Expires March 6, 2014 [Page 57] Internet-Draft RTP Payload Format for HEVC September 6, 2013 See Section 9 of RFC XXXX. Public specification: Please refer to Section 13 of RFC XXXX. Additional information: None File extensions: none Macintosh file type code: none Object identifier or OID: none Person & email address to contact for further information: Intended usage: COMMON Author: See Section 14 of RFC XXXX. Change controller: IETF Audio/Video Transport Payloads working group delegated from the IESG. 7.2 SDP Parameters The receiver MUST ignore any parameter unspecified in this memo. 7.2.1 Mapping of Payload Type Parameters to SDP The media type video/H265 string is mapped to fields in the Session Description Protocol (SDP) [RFC4566] as follows: o The media name in the "m=" line of SDP MUST be video. o The encoding name in the "a=rtpmap" line of SDP MUST be H265 (the media subtype). o The clock rate in the "a=rtpmap" line MUST be 90000. Wang, et al Expires March 6, 2014 [Page 58] Internet-Draft RTP Payload Format for HEVC September 6, 2013 o The OPTIONAL parameters "profile-space", "profile-id", "tier- flag", "level-id", "interop-constraints", "profile-compatibility- indicator", "sub-layer-id", "recv-sub-layer-id", "max-recv-level- id", "max-ls", "max-lps", "max-cpb", "max-dpb", "max-br", "max- tr", "max-tc", "max-fps", "tx-mode", "sprop-depack-buf-nalus", "sprop-depack-buf-bytes", "depack-buf-cap", "sprop-segmentation- id", "sprop-spatial-segmentation-idc", and "dec-parallel-cap", when present, MUST be included in the "a=fmtp" line of SDP. This parameter is expressed as a media type string, in the form of a semicolon separated list of parameter=value pairs. o The OPTIONAL parameters "sprop-vps", "sprop-sps", and "sprop- pps", when present, MUST be included in the "a=fmtp" line of SDP or conveyed using the "fmtp" source attribute as specified in section 6.3 of [RFC5576]. For a particular media format (i.e., RTP payload type), "sprop-vps" "sprop-sps", or "sprop-pps" MUST NOT be both included in the "a=fmtp" line of SDP and conveyed using the "fmtp" source attribute. When included in the "a=fmtp" line of SDP, these parameters are expressed as a media type string, in the form of a semicolon separated list of parameter=value pairs. When conveyed using the "fmtp" source attribute, these parameters are only associated with the given source and payload type as parts of the "fmtp" source attribute. Informative note: Conveyance of "sprop-vps", "sprop-sps", and "sprop-pps" using the "fmtp" source attribute allows for out- of-band transport of parameter sets in topologies like Topo- Video-switch-MCU as specified in [RFC5117]. An example of media representation in SDP is as follows: m=video 49170 RTP/AVP 98 a=rtpmap:98 H265/90000 a=fmtp:98 profile-id=1; sprop-vps=