CLUE                                                      C. Groves, Ed.
Internet-Draft                                                   W. Yang
Intended status: Informational                                   R. Even
Expires: February 13, 2014                                        Huawei
                                                         August 12, 2013


  Describing Captures in CLUE and relation to multipoint conferencing
                   draft-groves-clue-multi-content-00

Abstract

   In a multipoint Telepresence conference, there are more than two
   sites participating.  Additional complexity is required to enable
   media streams from each participant to show up on the displays of the
   other participants.  Common policies to address the multipoint case
   include "site-switch" and "segment-switch".  The document will
   discuss these policies as well as the "composed" policy and how they
   work in the multipoint case.

   The current CLUE framework document contains the "composed" and
   "switched" attributes to describe situations where a capture is mix
   or composition of streams or where the capture represents a dynamic
   subset of streams.  "Composed" and "switched" are capture level
   attributes.  In addition to these attributes the framework defines an
   attribute "Scene-switch-policy" on a capture scene entry (CSE) level
   which indicates how the captures are switched.

   This draft discusses composition/switching in CLUE and makes a number
   of proposals to better define and support these capabilities.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on February 13, 2014.


Groves, et al.          Expires February 13, 2014               [Page 1]

Internet-Draft              Abbreviated Title                August 2013


Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Issues  . . . . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.1.  Role of an MCU in a multipoint conference . . . . . . . .   4
     2.2.  Relation to scene . . . . . . . . . . . . . . . . . . . .   6
     2.3.  Description of the contents of a switched/composed
           capture . . . . . . . . . . . . . . . . . . . . . . . . .   6
     2.4.  Attribute interaction . . . . . . . . . . . . . . . . . .   7
     2.5.  Policy  . . . . . . . . . . . . . . . . . . . . . . . . .   8
     2.6.  Media stream composition and encodings  . . . . . . . . .   8
     2.7.  Relation of switched captures to simultaneous
           transmission sets . . . . . . . . . . . . . . . . . . . .   9
     2.8.  Conveying spatial information for switched/composed
           captures  . . . . . . . . . . . . . . . . . . . . . . . .   9
     2.9.  Consumer selection  . . . . . . . . . . . . . . . . . . .  10
   3.  Proposal  . . . . . . . . . . . . . . . . . . . . . . . . . .  10
     3.1.  CLUE Syntax Updates . . . . . . . . . . . . . . . . . . .  11
       3.1.1.  Definitions . . . . . . . . . . . . . . . . . . . . .  12
       3.1.2.  Multiple Content Capture Details  . . . . . . . . . .  12
       3.1.3.  MCC Attributes  . . . . . . . . . . . . . . . . . . .  13
       3.1.4.  MCC Attributes  . . . . . . . . . . . . . . . . . . .  13
       3.1.5.  Composition policy  . . . . . . . . . . . . . . . . .  14
       3.1.6.  Synchronisation . . . . . . . . . . . . . . . . . . .  14
       3.1.7.  MCC and encodings . . . . . . . . . . . . . . . . . .  15
       3.1.8.  MCCs and STSs . . . . . . . . . . . . . . . . . . . .  16
       3.1.9.  Consumer Behaviour  . . . . . . . . . . . . . . . . .  16
       3.1.10. MCU Behaviour . . . . . . . . . . . . . . . . . . . .  17
         3.1.10.1.  Single content captures and multiple contents
                    capture in the same Advertisement  . . . . . . .  17
         3.1.10.2.  Several multiple content captures in the same
                    Advertisement  . . . . . . . . . . . . . . . . .  18
     3.2.  Multipoint Conferencing Framework Updates . . . . . . . .  19


Groves, et al.          Expires February 13, 2014               [Page 2]

Internet-Draft              Abbreviated Title                August 2013


     3.3.  Existing Parameter Updates  . . . . . . . . . . . . . . .  20
       3.3.1.  Composed  . . . . . . . . . . . . . . . . . . . . . .  20
       3.3.2.  Switched  . . . . . . . . . . . . . . . . . . . . . .  21
       3.3.3.  Scene-switch-policy . . . . . . . . . . . . . . . . .  22
       3.3.4.  MCU behaviour . . . . . . . . . . . . . . . . . . . .  24
   4.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  24
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  24
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  25
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  25
     7.1.  Normative References  . . . . . . . . . . . . . . . . . .  25
     7.2.  Informative References  . . . . . . . . . . . . . . . . .  25
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  26

1.  Introduction

   One major objective for Telepresence is to be able to preserve the
   "Being there" user experience.  However, in multi-site conferences it
   is often (in fact usually) not possible to simultaneously provide
   full size video, eye contact, common perception of gestures and gaze
   by all participants.  Several policies can be used for stream
   distribution and display: all provide good results but they all make
   different compromises.

   The policies are described in [I-D.ietf-clue-telepresence-use-cases].
   [RFC6501] has the following requirement:

   REQMT-14:  The solution MUST support mechanisms to make possible for
           either or both site switching or segment switching.  [Edt:
           This needs rewording.  Deferred until layout discussion is
           resolved.]

   The policies described in the use case draft include the site-switch,
   segment-switch and composed policies.

   Site switch is described in the CLUE use case "One common policy is
   called site switching.  Let's say the speaker is at site A and
   everyone else is at a "remote" site.  When the room at site A shown,
   all the camera images from site A are forwarded to the remote sites.
   Therefore at each receiving remote site, all the screens display
   camera images from site A.  This can be used to preserve full size
   image display, and also provide full visual context of the displayed
   far end, site A.  In site switching, there is a fixed relation
   between the cameras in each room and the displays in remote rooms.
   The room or participants being shown is switched from time to time
   based on who is speaking or by manual control, e.g., from site A to
   site B."


Groves, et al.          Expires February 13, 2014               [Page 3]

Internet-Draft              Abbreviated Title                August 2013


   These policies are mirrored in the framework document through a
   number of attributes.

   Currently in the CLUE framework document [I-D.ietf-clue-framework]
   there are two media capture attributes: Composed and Switched.

   Composed is defined as:

           A field with a Boolean value which indicates whether or not
           the Media Capture is a mix (audio) or composition (video) of
           streams.

           This attribute is useful for a media consumer to avoid
           nesting a composed video capture into another composed
           capture or rendering.  This attribute is not intended to
           describe the layout a media provider uses when composing
           video streams.

   Switched is defined as:

           A field with a Boolean value which indicates whether or not
           the Media Capture represents the (dynamic) most appropriate
           subset of a 'whole'.  What is 'most appropriate' is up to the
           provider and could be the active speaker, a lecturer or a
           VIP.

   There is also a Capture Scene Entry (CSE) attribute "scene switch
   policy" defined as:

           A media provider uses this scene-switch-policy attribute to
           indicate its support for different switching policies.

2.  Issues

   This section discusses a number of issues in the current framework
   around the support of switched/composed captures and media streams
   when considering multipoint conferencing.  Some issues are more
   required functions and some are related to the current description in
   the framework document.

2.1.  Role of an MCU in a multipoint conference

   In a multipoint conference there is a central control point (MCU).
   The MCU will have the CLUE advertisements from all the conference
   participants and will prepare and send advertisements to all the
   conference participants.  The MCU will also have more information
   about the conference, participants and media which it receives at
   conference creation and via call signalling.  This data is not stable


Groves, et al.          Expires February 13, 2014               [Page 4]

Internet-Draft              Abbreviated Title                August 2013


   since each user who joins or leaves the conference causes a change is
   conference state.  An MCU supporting SIP may utilise the Conference
   event package, XCON and CCMP to maintain and distribute conference
   state.

   [RFC4575] defines a conference event package.  Using the event
   framework notifications are sent about changes in the membership of
   this conference and optionally about changes in the state of
   additional conference components.  The conference information is
   composed of the conference description, host information, conference
   state, users that has endpoints where each endpoint includes the
   media description.

   [RFC6501] extends the conference event package and tries to be
   signalling protocol agnostic.  RFC6501 adds new elements but also
   provides values for some of the elements defined in RFC4575, for
   example it defines roles ( like "administrator", "moderator", "user",
   "participant", "observer", and "none").

   [RFC6503] Centralized Conferencing Manipulation Protocol (CCMP)
   allows authenticated and authorized users to create, manipulate, and
   delete conference objects.  Operations on conferences include adding
   and removing participants, changing their roles, as well as adding
   and removing media streams and associated endpoints.

   CCMP implements the client-server model within the XCON framework,
   with the conferencing client and conference server acting as client
   and server, respectively.  CCMP uses HTTP as the protocol to transfer
   requests and responses, which contain the domain-specific XML-encoded
   data objects defined in [RFC6501] "Conference Information Data Model
   for Centralized Conferencing (XCON)".

   The XCON data model and CCMP provides a generic way to create and
   control conferences.  CCMP is not SIP specific but SIP endpoint will
   subscribe to the conference event package to get information about
   changes in the conference state.

   Therefore when a MCU implements the above protocols there will be an
   interaction between any CLUE states and those within a conferencing
   framework.  For example: if an endpoint leaves a conference this will
   mean that an MCU may need to indicate via CLUE to the other endpoints
   that those captures are no longer available and it would also need to
   indicate via the Conferencing framework that the endpoint is longer
   part of the conference.

   The question is how do these concepts relate as the Conferencing
   framework does not have the concept of captures or scenes?  Other
   aspects overlap, for example:


Groves, et al.          Expires February 13, 2014               [Page 5]

Internet-Draft              Abbreviated Title                August 2013


           The conference framework has "available media" , CLUE has
           encodings to indicate codec.

           The conference framework has "users", CLUE has no concept of
           users although it has capture attributes that relate to the
           users in a capture.

   It is noted point to point calls may not implement the conferencing
   framework.  It is desirable that CLUE procedures be the same whether
   an endpoint is communicating with a peer endpoint or an MCU.

2.2.  Relation to scene

   One of the early justifications for switching / composition was the
   ability to switch between sites.  When looking at the CLUE framework
   there is no concept of "site" in the CLUE hierarchy.  The closest
   concept is an "endpoint" but this has no identity within the CLUE
   syntax.  The highest level is the "clueInfo" that includes
   captureScenes and an endpoint may have multiple capture scenes.

   If the switched and composed attributes are specified at a capture
   level it is not clear what the correlation is between the capture and
   the endpoint / scenes, particularly when the attributes are described
   in the context of sites.  A scene may be composed of multiple
   captures.  Where an MCU is involved in a conference with multiple
   endpoints, multiple capture scenes are involved.  It becomes
   difficult to map all the scenes and capture information from the
   source endpoints into one capture scene sent to an endpoint.
   Discussion of switching, composition et al. needs to be described in
   terms of the CLUE concepts.

   When considering the SIP conferencing framework it can be seen that
   there are complications with interworking with the scene concept.
   There may be multiple media of the same type e.g room view and
   presentation but they are easily identified.  This also needs to be
   considered.

2.3.  Description of the contents of a switched/composed capture


Groves, et al.          Expires February 13, 2014               [Page 6]

Internet-Draft              Abbreviated Title                August 2013


   When considering switching and composition whilst this may be
   represented by one capture and one resulting media stream there may
   be multiple original source captures.  Each of these source captures
   would have had their own set of attributes.  A media capture with the
   composed attribute allows the description of the capture as whole but
   not a description of the constituent parts.  In the case of a MCU
   taking multiple input media captures and compositing them into one
   output capture the CLUE related characteristics of these inputs are
   lost in the current solution.  Alternate methods such as the RFC6501
   layout field etc. may need to be investigated.

   Consider the case where MCUs receive CLUE advertisements from various
   endpoints.  Having a single capture with a switched attribute makes
   it difficult to fully express what the content is when it is from
   multiple endpoints.  It may be possible to specify lists of capture
   attribute values when sending an advertisement from the MCU, i.e.
   role=speaker,audience but it becomes difficult to relate multiple
   attributes, i.e.
   (role=speaker,language=English),(role=audience,language=french).

   One capture could represent source captures from multiple locations.
   A consumer may wish to examine the inputs to a switched capture ,
   i.e. choose which of the original endpoints it wants to see/hear.  In
   order to do this the original capture information would need to be
   conveyed in a manner that minimises overhead for the MCU.

   By being able to link multiple source captures to one mixed (switched
   /composed) capture in a CLUE advertisement allows a fuller
   description of the content of the capture.

2.4.  Attribute interaction

   Today the "composed" and "switched" attributes appear at a media
   capture level.  If "switched" is specified for multiple captures in a
   capture scene it's not clear from the framework what the switching
   policy is.  For example: If a CSE contains three VCs each with
   "switched" does the switch occurs between these captures?  Does the
   switch occur internal to each capture?

   The "scene-switch-policy" CSE attribute has been defined to indicate
   switch policy but there doesn't appear to be a description of whether
   this only relates to captures marked with "switch" and/or "composed"?
   If a CSE marked with "scene-switch-policy" contains non-switched,
   non-composed captures what does this mean?


Groves, et al.          Expires February 13, 2014               [Page 7]

Internet-Draft              Abbreviated Title                August 2013


   What are the interactions between the two properties?  E.g. Are
   "switched" and "composed" attributes mutually exclusive or not?  Is
   switched capture with a scene switch policy of "segment-switched" a
   "composed" capture?

   These issues need to be clarified in the framework.

2.5.  Policy

   The "Scene-switch-policy" attribute allows the indication of whether
   switched captures are "site" or "segment" switched.  However there is
   no indication of what the switch or the composition "trigger" policy
   is.  Content could be provided based on a round robin view, loudest
   speaker etc.  Where an advertising endpoint supports different
   algorithms it would be advantageous for a consumer to know and select
   an applicable policy.

2.6.  Media stream composition and encodings

   Whether single or multiple streams are used for switched captures is
   not clear from the capture description.  For example:

   There are 3 endpoints (A,B,C) each with 3 video
   captures.(VCa1,VCa2,VCa3, etc.).  A MCU wants to indicate to endpoint
   C that it can offer a switched view of endpoints A and B.

   It could send an Advertisement with CSE (VCa1,VCa2,VCa3,
   VCb1,VCb2,VCb3),scene-switch-policy=site-switch.

   Normally such a configuration (without the switch policy) would
   relate to 6 different media streams.  Switching introduces several
   possibilities.

   For site switching:

   a)      There could one media stream with the contents of all the 6
           captures.  The MCU always send a composed image with the VCs
           from the applicable end point.

   b)      There could be two media streams each containing the VCs from
           one endpoint, the MCU chooses which stream to send.

   c)      There could be 6 media streams.  The MCU chooses which 3
           streams to send.

   For segment switching this is further complicated because the MCU may
   choose to send media related to endpoint A or B.  There no text
   describing any limitation so the MCU may send 1 VC or 5.


Groves, et al.          Expires February 13, 2014               [Page 8]

Internet-Draft              Abbreviated Title                August 2013


   Utilising CLUE "encodings" may be a way to describe how the switch is
   taking place in terms of media provided but such description is
   missing from the framework.  One could assume that an individual
   encoding be assigned to multiple media captures (i.e. multiple VCs to
   indicate they are encoded in the same stream) but again this is
   problematic as the framework indicates that "An Individual Encoding
   can be assigned to at most one Capture Encoding at any given time."

   This could do with further clarification in the framework.

2.7.  Relation of switched captures to simultaneous transmission sets

   Simultaneous Transmission Set is defined as "a set of Media Captures
   that can be transmitted simultaneously from a Media Provider."  It's
   not clear how this definition would relate to switched or composed
   streams.  The captures may not be able to be sent at the same time
   but may form a timeslot on a particular stream.  They may be provided
   together but not at precisely the same time.

   The current version of the framework in section 6.3 it indicates
   that:

           "It is a syntax conformance requirement that the simultaneous
           transmission sets must allow all the media captures in any
           particular Capture Scene Entry to be used simultaneously."

   If switching or composition is specified at a capture level only it
   is evident that simultaneity constraints do not come into play.
   However if multiple captures are used in a single media stream I.e.
   associated with the CSE then these may be subject to a simultaneous
   transmission set description.

   It is also noted that there is a similar issue for encoding group.
   See section 8/[Framework]:

           "It is a protocol conformance requirement that the Encoding
           Groups must allow all the Captures in a particular Capture
           Scene Entry to be used simultaneously."

   If "switching" is used then there is no need to send the encodings at
   the same time.

   This needs to be clarified.

2.8.  Conveying spatial information for switched/composed captures

   CLUE currently allows the ability to signal spatial information
   related to a media capture.  It is unclear in the current draft how


Groves, et al.          Expires February 13, 2014               [Page 9]

Internet-Draft              Abbreviated Title                August 2013


   this would work with switching/composition.  In section 6.1 /
   [I-D.ietf-clue-telepresence-use-cases] it does say:

           "For a switched capture that switches between different
           sections within a larger area, the area of capture should use
           coordinates for the larger potential area."

   This describes a single capture not when there are multiple switched
   captures.  It appears to focus on segment switching rather than site
   switching and does not appear to cover "composed" (if it is related).

   An advertiser may or may not want to use common spatial attributes
   for captures associated with a switched captures.  For example: it
   may be beneficial for the Advertiser in a composed image to indicate
   that different captures have a different capture area in a virtual
   space.

   This should be given consideration in the framework.

2.9.  Consumer selection

   In section 6.2.2 of version 9 [I-D.ietf-clue-framework] it indicates
   that an Advertiser may provide multiple values for the "scene-switch-
   policy" and that the Consumer may choose and return the value it
   requires.

   In version 9 of the framework there was no mechanism in CLUE for a
   Consumer to choose and return individual values from capture scene,
   CSE or media capture attributes.

   In version 10 of the framework the text was updated to indicate that
   the consumer could choose values from a list.  It is not clear that
   this capability is needed as the procedure only relates to the
   "scene-switch-policy".  The switching policy may be better specified
   by other means.

3.  Proposal

   As has been discussed above there are a number of issues with regards
   to the support of switched/composed captures/streams in CLUE
   particularly when considering MCUs.  The authors believe that there
   is no single action that can address the above issues.  Several
   options are discussed below.  The options are not mutually exclusive.

   1)      Introduce syntax to CLUE to better describe source captures

   2)      Introduce updates to the XCON conferencing framework (e.g.
           Conference package, XCON etc.) to introduce CLUE concepts.


Groves, et al.          Expires February 13, 2014              [Page 10]

Internet-Draft              Abbreviated Title                August 2013


   3)      Update CLUE to better describe the current suite of
           attributes with the understanding these provide limited
           information with respect to source information.

3.1.  CLUE Syntax Updates

   The authors believe that there are a number of requirements for this:

   -       It should be possible to advertise the individual captures
           that make up a single switched/composed media stream before
           receiving the actual media stream.

   -       It should be possible to describe the relationship between
           captures that make up a single switched/composed media
           stream.

   -       It should be possible to describe this using CLUE semantics
           rather than with terms such as "site" or "segment" which need
           their own definition.

   The authors also believe that whether media is composed, segment
   switched, site switched the common element is that the media stream
   contains multiple captures from potentially multiple sources.

   [I-D.ietf-clue-framework] does have the "Scene-switch-policy"
   attribute at a CSE level but as described in section 2 it is not
   sufficient for several reasons.  E.g. it is not possible assign an
   encoding to a CSE, a CSE cannot reference captures from multiple
   scenes and there is a relationship with STSs that needs to be
   considered.

   In order to be able to fully express and support media stream with
   multiple captures the authors propose a new type of capture, the
   "multiple content capture" (MCC).  The MCC is essentially the same as
   audio or video captures in that it may have its own attributes the
   main difference is that it can also include other captures.  It
   indicates that the MCC capture is composed of other captures.  This
   composition may be positional (i.e. segments/tiling) or time
   composition (switched) etc. and specified by a policy attribute.  The
   MCC can be assigned an encoding.  For example:

           MCC1(VC1,VC2,VC3),[POLICY]

   This would indicate that MCC1 is composed of 3 video captures
   according to the policy.

   One further difference is that a MCC may reference individual
   captures from multiple scenes.  For example:


Groves, et al.          Expires February 13, 2014              [Page 11]

Internet-Draft              Abbreviated Title                August 2013


           CS#1(VC1,VC2)

           CS#2(VC3,VC4)

           CS#3(MCC1(VC1,VC3))

   This would indicate that scene #3 contains a MCC that is composed
   from individual encodings VC1 and VC3.  This allows the consumer to
   associate any capture scene properties from the original scene with
   the multiple content capture.

   The MCC would be able to be utilised by both normal endpoints and
   MCUs.  For example: it would allow an endpoint to construct a mixed
   video stream that is a virtual scene with a composition of
   presentation video and individual captures.

   This proposal does not consider any relation to the SIP conferencing
   framework.

   The sections below provide more detail on the proposal.

3.1.1.  Definitions

   Multiple content capture: Media capture for audio or video that
   indicates the capture contains multiple audio or video captures.
   Individual media captures may or may not be present in the resultant
   capture encoding depending on time or space.  Denoted as MCCn in the
   example cases in this document.

3.1.2.  Multiple Content Capture Details

   The MCC indicates that multiple captures are contained in one media
   capture by referencing the applicable individual media captures.
   Only one capture type (i.e. audio, video, etc.) is allowed in each
   MCC instance.  The MCC contains a reference to the media captures as
   well attributes associated with the MCC itself.  The MCC may
   reference individual captures from other capture scenes.  If an MCC
   is used in a CSE that CSE may also reference captures from other
   Capture Scenes.

   Note: Different Capture Scenes are not spatially related.

   Each instance of the MCC has its own captureID i.e. MCC1.  This
   allows all the individual captures contained in the MCC to be
   referenced by a single ID.

   The example below shows the use of a MultipleContent capture:


Groves, et al.          Expires February 13, 2014              [Page 12]

Internet-Draft              Abbreviated Title                August 2013


    CaptureScene1 [VC1 {attributes},
                   VC2 {attributes},
                   VC3 {attributes},
                   MCC1(VC1,VC2,VC3){attributes}]


   This indicates that MCC1 is a single capture that contains the
   captures VC1, VC2 and VC3 according to any MCC1 attributes.

   One or more MCCs may also specified in a CSE.  This allows an
   Advertiser to indicate that several MCC captures are used to
   represent a capture scene.

   Note: Section 6.1/[I-D.ietf-clue-framework] indicates that "A Media
   Capture is associated with exactly one Capture Scene".  For MCC this
   could be further clarified to indicate that "A Media Capture is
   defined in a capture scene and is given an advertisement unique
   identity.  The identity may be referenced outside the Capture Scene
   that defines it through a multiple content capture (MCC).

3.1.3.  MCC Attributes

   Attributes may be associated with the MCC instance and the individual
   captures that the MCC references.  A provider should avoid providing
   conflicting attribute values between the MCC and individual captures.
   Where there is conflict the attributes of the MCC override any that
   may be present in the individual captures.

   There are two MCC specific attributes "MaxCaptures" and "Policy"
   which are used to give more information regarding when the individual
   captures appears and what policy is used to determine this.

   The spatial related attributes can be further used to determine how
   the individual captures "appear" within a stream.  For example a
   virtual scene could be constructed for the MCC capture with two video
   captures with a "MaxCaptures" attribute of 2 and an "area of capture"
   attribute provided with an overall area.  Each of the individual
   captures could then also include an "area of capture" attribute with
   a sub-set of the overall area.  The consumer would then know the
   relative position of the content in the composed stream.  For
   example: The above capture scene may indicate that VC1 has an x-axis
   capture area 1-5, VC2 6-10 and VC3 11-15.  The MCC capture may
   indicate an x-axis capture area 1-15.

3.1.4.  MCC Attributes

   MaxCaptures:{integer}


Groves, et al.          Expires February 13, 2014              [Page 13]

Internet-Draft              Abbreviated Title                August 2013


   This field is only associated with MCCs and indicates the maximum
   number of individual captures that may appear in a capture encoding
   at a time.  It may be used to derive how the individual captures
   within the MCC are composed with regards to space and time.
   Individual content in the capture may be switched in time so that
   only one of the individual captures/CSEs are shown (MaxCaptures:1).
   The individual captures may be composed so that they are all shown in
   the MCC (MaxCaptures:n).

   For example:

           MCC1(VC1,VC2,VC3),MaxCaptures:1

   This would indicate that the Advertiser in the capture encoding would
   switch (or compose depending on policy) between VC1, VC2 or VC3 as
   there may be only a maximum of one capture at a time.

3.1.5.  Composition policy

   TBD - This attribute is to address what algorithm the endpoint/MCU
   uses to determine what appears in the MCC captures.  E.g. loudest,
   round robin.

3.1.6.  Synchronisation

   Note: The {scene-switch-policy} attribute has values that indicates
   "site-switch" or "segment" switch.  The distinction between these is
   that "site-switch" indicates that when there is mixed content that
   captures related to an endpoint appear together.  "segment-switch"
   indicates that different endpoints captures could appear together.
   An issue is that a Consumer has no concept of "endpoints" only
   "capture scenes".  Also as highlighted a Consumer has no method to
   return parameters for CSEs.

   The use of MCCs enables the Advertiser to communicate to the Consumer
   that captures originate from different captures scenes.  In cases
   where multiple MCCs represent a scene (i.e. multiple MCCs in a CSE)
   an Advertiser may wish to indicate that captures from one capture
   scene are present in the capture encodings of specified MCCs at the
   same time.  Having an attribute at capture level removes the need for
   CSE level attributes which are problematic for consumers.

   Synch-id: { integer}

   This MCC attribute indicates how the individual captures in multiple
   MCC captures are synchronised.  To indicate that the capture
   encodings associated with MCCs contain captures from the source at
   the same time the Advertiser should set the same SynchID on each of


Groves, et al.          Expires February 13, 2014              [Page 14]

Internet-Draft              Abbreviated Title                August 2013


   the concerned MCCs.  It is the provider that determines what the
   source for the captures is.  For example when the provider is in an
   MCU it may determine that each separate CLUE endpoint is a remote
   source of media.

   For example:

    CaptureScene1[Description=AustralianConfRoom,
                  VC1(left),VC2(middle),VC3(right),
                  CSE1(VC1,VC2,VC3)]
    CaptureScene2[Description=ChinaConfRoom,
                  VC4(left),VC5(middle),VC6(right),
                  CSE2(VC4,VC5,VC6)]
    CaptureScene3[MCC1(VC1,VC4){Sync-id:1}{encodinggroup1},
                  MCC2(VC2,VC5){Sync-id:1}{encodinggroup2},
                  MCC3(VC3,VC6){encodinggroup3},
                  CSE3(MCC1,MCC2,MCC3)]

                     Figure 1: Synchronisation Example

   The above advertisement would indicate MCC1,MCC2,MCC3 make up a
   capture scene.  There would be three capture encodings.  Because MCC1
   and MCC2 have the same Sync-id, each encoding1 and encoding2 would
   together have content from only capture scene 1 or only capture scene
   2 at a particular point in time.  Encoding3 would not be synchronised
   with encoding1 or encoding2.

   Without this attribute it is assumed that multiple MCCs may provide
   different sources at any particular point in time.

3.1.7.  MCC and encodings

   MCCs shall be assigned an encoding group and thus become a capture
   encoding.  The captures referenced by the MCC do not need to be
   assigned to an encoding group.  This means that all the individual
   captures referenced by the MCC will appear in the capture encoding
   according to any MCC attributes.  This allows an Advertiser to
   specify capture attributes associated with the individual captures
   without the need to provide an individual capture encoding for each
   of the inputs.

   If an encoding group is assigned to an individual capture referenced
   by the MCC it indicates that this capture may also have an individual
   capture encoding.

   For example:


Groves, et al.          Expires February 13, 2014              [Page 15]

Internet-Draft              Abbreviated Title                August 2013


    CaptureScene1 [VC1 {encoding group1},
                   VC2 ]
                   MCC1(VC1,VC2){encoding group3}]


   This would indicate that VC1 may be sent as its own capture encoding
   from encoding group1 or that it may be sent as part of a capture
   encoding from encoding group3 along with VC2.

   Note: The section 8/[I-D.ietf-clue-framework] indicates that every
   capture is associated with an encoding group.  To utilise MCCs this
   requirement has to be relaxed.

3.1.8.  MCCs and STSs

   The MCC can be used in simultaneous sets, therefore providing a means
   to indicate whether several multiple content captures can be provided
   at the same time.  Captures within a MCC can be provided together but
   not necessarily at the same time.  Therefore by specifying a MCC in
   an STS it does not indicate that all the referenced individual
   captures may be present at a time.  The MaxCaptures attributes
   indicates the maximum number of captures that may be present.

   An MCC instance of is limited to one media type e.g. video, audio,
   text.

   Note: This gets around the problem where the framework says that all
   captures (even switched ones) within a CSE have to be allowed in a
   STS to be sent at the same time.

3.1.9.  Consumer Behaviour

   On receipt of an advertisement with an MCC the Consumer treats the
   MCC as per other individual captures with the following differences:

   -       The Consumer would understand that the MCC is a capture that
           includes the referenced individual captures and that these
           individual captures would be delivered as part of the MCC's
           capture encoding.

   -       The Consumer may utilise any of the attributes associated
           with the referenced individual captures and any capture scene
           attributes from where the individual capture was defined to
           choose the captures.

   -       The Consumer may or may not want to receive all the indicated
           captures.  Therefore it can choose to receive a sub-set of
           captures indicated by the MCC.


Groves, et al.          Expires February 13, 2014              [Page 16]

Internet-Draft              Abbreviated Title                August 2013


   For example if the Consumer receives:

           MCC1(VC1,VC2,VC3){attributes}

   A Consumer should choose all the captures within a MCCs however if
   the consumer determines that it doesn't want VC3 it can return
   MCC1(VC1,VC2).  If it wants all the individual capture then it
   returns just a reference to the MCC (i.e. MCC1).

   Note: The ability to return a subset of capture is for consistency
   with the current framework.  It says that a Consumer should choose
   all the captures from a CSE but it allows it to select a subset (if
   the STS is provided).  The intent was to provide equivalent
   functionality for a MCC.

3.1.10.  MCU Behaviour

   The use of MCCs allows the MCU to easily construct outgoing
   Advertisements.  The following sections provide several examples.

3.1.10.1.  Single content captures and multiple contents capture in the
           same Advertisement

   Four endpoints are involved in a CLUE session.  To formulate an
   Advertisement to endpoint 4 the following Advertisements received
   from endpoint 1 to 3 and used by the MCU.  Note: The IDs overlap in
   the incoming advertisements.  The MCU is responsible for making these
   unique in the outgoing advertisement.

    Endpoint 1 CaptureScene1[Description=AustralianConfRoom,
                             VC1(role=audience)]
    Endpoint 2 CaptureScene1[Description=ChinaConfRoom,
                             VC1(role=speaker),VC2(role=audience),
                             CSE1(VC1,VC2)]
    Endpoint 3 CaptureScene1[Description=USAConfRoom,
                             VC1(role=audience)]

                Figure 2: MCU case: Received advertisements

   Note: Endpoint 2 above indicates that it sends two streams.

   If the MCU wanted to provide a multiple content capture containing
   the audience of the 3 endpoints and the speaker it could construct
   the following advertisement:

    CaptureScene1[Description=AustralianConfRoom,
                  VC1(role=audience)]
    CaptureScene2[Description=ChinaConfRoom,


Groves, et al.          Expires February 13, 2014              [Page 17]

Internet-Draft              Abbreviated Title                August 2013


                  VC2(role=speaker),VC3(role=audience),
                  CSE1(VC2,VC3)]
    CaptureScene3[Description=USAConfRoom,
                  VC4(role=audience)]
    CaptureScene4[MCC1(VC1,VC2,VC3,VC4){encodinggroup1}]

        Figure 3: MCU case: MCC with multiple audience and speaker

   Alternatively if the MCU wanted to provide the speaker as one stream
   and the audiences as another it could assign an encoding group to VC2
   in Capture Scene 2 and provide a CSE in Capture Scene 4:

    CaptureScene1[Description=AustralianConfRoom,
                  VC1(role=audience)]
    CaptureScene2[Description=ChinaConfRoom,
                  VC2(role=speaker){encodinggroup2},
                  VC3(role=audience),
                  CSE1(VC2,VC3)]
    CaptureScene3[Description=USAConfRoom,
                  VC4(role=audience)]
    CaptureScene4[MCC1(VC1,VC3,VC4){encodinggroup1},
                  CSE2(MCC1,VC2)]

        Figure 4: MCU case: MCC with audience and separate speaker

   Therefore a Consumer could choose whether or not to have a separate
   "role=speaker" stream and could choose which endpoints to see.  If it
   wanted the second stream but not the Australian conference room it
   could indicate the following captures in the Configure message:

    MCC1(VC3,VC4),VC2

                   Figure 5: MCU case: Consumer Response

3.1.10.2.  Several multiple content captures in the same Advertisement

   Multiple MCCs can be used where multiple streams are used to carry
   media from multiple endpoints.  For example:

   A conference has three endpoints D,E and F, each end point has three
   video captures covering the left, middle and right regions of each
   conference room.  The MCU receives the following advertisements from
   D and E:

    Endpoint D CaptureScene1[Description=AustralianConfRoom,
                             VC1(left){encodinggroup1},
                             VC2(middle){encodinggroup2},
                             VC3(right){encodinggroup3},


Groves, et al.          Expires February 13, 2014              [Page 18]

Internet-Draft              Abbreviated Title                August 2013


                             CSE1(VC1,VC2,VC3)]
    Endpoint E CaptureScene1[Description=ChinaConfRoom,
                             VC1(left){encodinggroup1},
                             VC2(middle){encodinggroup2},
                             VC3(right){encodinggroup3},
                             CSE1(VC1,VC2,VC3)]

       Figure 6: MCU case: Multiple captures from multiple endpoints

   Note: The Advertisement uses the same identities.  There is no co-
   ordination between endpoints so it is likely there would be identity
   overlap between received advertisements.

   The MCU wants to offer Endpoint F three capture encodings.  Each
   capture encoding would contain a capture from either Endpoint D or
   Endpoint E depending on the policy.  The MCU would send the
   following:

    CaptureScene1[Description=AustralianConfRoom,
                  VC1(left),VC2(middle),VC3(right),
                  CSE1(VC1,VC2,VC3)]
    CaptureScene2[Description=ChinaConfRoom,
                  VC4(left),VC5(middle),VC6(right),
                  CSE2(VC4,VC5,VC6)]
    CaptureScene3[MCC1(VC1,VC4){encodinggroup1},
                  MCC2(VC2,VC5){encodinggroup2},
                  MCC3(VC3,VC6){encodinggroup3},
                  CSE3(MCC1,MCC2,MCC3)]

          Figure 7: MCU case: Multiple MCCs for multiple captures

   Note: The identities from Endpoint E have been renumbered so that
   they are unique in the outgoing advertisement.

3.2.  Multipoint Conferencing Framework Updates

   The CLUE protocol extends the EP description defined in the
   signalling protocol (SDP for SIP) by providing more information about
   the available media.  If we look at XCON it uses the information
   available from the signalling protocol but instead of using SDP to
   distribute the participants information and to control the multipoint
   conference.  This is done using a data structure defined in XML using
   the CCMP protocol over HTML (note that CCMP can be used also over
   CLUE channel if required).  XCON provide a hierarchy the starts from
   conference information that includes users having endpoints that have
   media.


Groves, et al.          Expires February 13, 2014              [Page 19]

Internet-Draft              Abbreviated Title                August 2013


   The role is part of the user structure while the mixing mode is part
   of the conference level information specifying the mixing mode per
   each of the media available in the conference.

   CLUE on the other end does not have such structure it start from what
   is probably, in XCON terms, an end points that has media structured
   by Scenes that has media.  There is no user or conference level
   information though the "role" proposal tries to add the user
   information (note that use information is different from the role in
   the call or the conference).

   The XCON structure looks better when looking at a multipoint
   conference.  Yet it does not make sense to have such a data model for
   the point to point calls.  Therefore only going with this option
   means that capture attribute information will not be available for
   point to point calls.

3.3.  Existing Parameter Updates

   As discussed in section 2 the existing CLUE attributes surrounding
   switching and composition have a number of open issues.  This section
   proposes changes to the text describing the attributes to better
   describe their usage and interaction.  It is also assumed that by
   using these attributes there is no attempt to describe the any
   component source capture information.

3.3.1.  Composed

   The current CLUE framework describes the "Composed" attribute as:

           A boolean field which indicates whether or not the Media
           Capture is a mix (audio) or composition (video) of streams.

           This attribute is useful for a media consumer to avoid
           nesting a composed video capture into another composed
           capture or rendering.  This attribute is not intended to
           describe the layout a media provider uses when composing
           video streams.

   It is proposed to update the description:


Groves, et al.          Expires February 13, 2014              [Page 20]

Internet-Draft              Abbreviated Title                August 2013


           A boolean field which indicates whether or not the Media
           Capture has been composed from a mix of audio sources or
           several video sources.  The sources may be local to the
           provider (i.e. video capture device) or remote to the
           provider (i.e. a media stream received by the provider from a
           remote endpoint).  This attribute is useful for a media
           consumer to avoid nesting a composed video capture into
           another composed capture or rendering.

           This attribute does not imply anything with regards to the
           attributes of the source audio or video except that the
           composed capture will be contained in a capture encoding from
           a single source.  This attribute is not intended to describe
           the layout a media provider uses when composing video
           streams.

           The "composed" attribute may be used in conjunction with a
           "switched" attribute when one or more of the dynamic sources
           is a composition.

3.3.2.  Switched

   The current CLUE framework describes the "Switched" attribute as:

           A boolean field which indicates whether or not the Media
           Capture represents the (dynamic) most appropriate subset of a
           'whole'.  What is 'most appropriate' is up to the provider
           and could be the active speaker, a lecturer or a VIP.

   It is proposed to update the description:

           A boolean field which indicates whether the Media Capture
           represents a dynamic representation of the capture scene that
           contains the capture.  It applies to both audio and video
           captures.

           A dynamic representation is one that provides alternate
           capture sub-areas within the overall area of capture
           associated with the capture over time in a single capture
           encoding from one source.  What capture area is contained in
           the capture encoding at a particular time is dependent on the
           provider policy.  For example: a provider may encode the
           active speaker or lecturer based on volume level.  It is not
           possible for consumers to associate attributes with a
           particular capture sub-area nor to indicate which sub-capture
           area they require.


Groves, et al.          Expires February 13, 2014              [Page 21]

Internet-Draft              Abbreviated Title                August 2013


3.3.3.  Scene-switch-policy

   The current CLUE framework describes the "Scene Switch Policy"
   attribute as:

           Scene-switch-policy: {site-switch, segment-switch}

           A media provider uses this scene-switch-policy attribute to
           indicate its support for different switching policies.  In
           the provider's Advertisement, this attribute can have
           multiple values, which means the provider supports each of
           the indicated policies.

           The consumer, when it requests media captures from this
           Capture Scene Entry, should also include this attribute but
           with only the single value (from among the values indicated
           by the provider) indicating the Consumer's choice for which
           policy it wants the provider to use.  The Consumer must
           choose the same value for all the Media Captures in the
           Capture Scene Entry.  If the provider does not support any of
           these policies, it should omit this attribute.

           The "site-switch" policy means all captures are switched at
           the same time to keep captures from the same endpoint site
           together.  Let's say the speaker is at site A and everyone
           else is at a "remote" site.

           When the room at site A shown, all the camera images from
           site A are forwarded to the remote sites.  Therefore at each
           receiving remote site, all the screens display camera images
           from site A. This can be used to preserve full size image
           display, and also provide full visual context of the
           displayed far end, site A. In site switching, there is a
           fixed relation between the cameras in each room and the
           displays in remote rooms.  The room or participants being
           shown is switched from time to time based on who is speaking
           or by manual control.

           The "segment-switch" policy means different captures can
           switch at different times, and can be coming from different
           endpoints.  Still using site A as where the speaker is, and
           "remote" to refer to all the other sites, in segment
           switching, rather than sending all the images from site A,
           only the image containing the speaker at site A is shown.
           The camera images of the current speaker and previous
           speakers (if any) are forwarded to the other sites in the
           conference.


Groves, et al.          Expires February 13, 2014              [Page 22]

Internet-Draft              Abbreviated Title                August 2013


           Therefore the screens in each site are usually displaying
           images from different remote sites - the current speaker at
           site A and the previous ones.  This strategy can be used to
           preserve full size image display, and also capture the non-
           verbal communication between the speakers.  In segment
           switching, the display depends on the activity in the remote
           rooms - generally, but not necessarily based on audio /
           speech detection.

   Firstly it is proposed to rename this attribute to "Capture Source
   Synchronisation" in order to remove any confusion with the switch
   attribute and also to remove the association with a scene as the any
   information regarding source scenes is lost.  This is due to that the
   CSE represents the current scene.  No change in functionality is
   intended by the renaming.  It is proposed to describe it as follows:

           Capture Source Synchronisation: {source-synch,asynch}

           By setting this attribute against a CSE it indicates that
           each of the media captures specified within the CSE results
           in a capture encoding that contains media related to
           different remote sources.  For example if CSE1 contains
           VC1,VC2,VC3 then there will be three capture encodings sent
           from the provider each displaying captures from different
           remote sources.  It is the provider that determines what the
           source for the captures is.  For example when the provider is
           in an MCU it may determine that each separate CLUE endpoint
           is a remote source of media.  Likewise it is the provider
           that determines how many remote sources are involved.
           However it is assumed that each capture within the CSE will
           contain the same number and set of sources.

           "Source-synch" indicates that each capture encoding related
           to the captures within the CSE contains media related to one
           remote source at the same point in time.

           "Asynch" indicates that that each capture encoding may
           contain media related to any remote source at any point in
           time.

           If a provider supports both synchronisation methods it should
           send separate CSEs containing separate captures, each CSE
           with a separate capture source synchronisation label.

           A provider when setting attributes against captures within a
           Capture Source Synchronisation marked CSE should consider
           that the media related to the remote sources may have its own
           separate characteristics.  For example: each source may have


Groves, et al.          Expires February 13, 2014              [Page 23]

Internet-Draft              Abbreviated Title                August 2013


           its own capture area therefore this needs to be taken into
           account in the providers advertisement.

           The "Switched" attribute may be used with a capture in a
           "Capture Source Synchronisation" marked CSE.  This indicates
           that one or more of the remote sources associated with the
           capture has dynamic media that may change within its own time
           frame. i.e. the media from a remote source may change without
           an impact on the other captures.

           The "Composed" attribute may be used with captures in the
           "Capture Source Synchronisation" marked CSE.  This indicates
           the capture encoding contains a composition or multiple
           sources from one remote endpoint at a particular point in
           time.

   Furthermore it is assumed that if the current set of parameters is
   maintained that the indication of the mechanism for the trigger of
   switching sources (e.g. loudest source, round robin) is not possible
   because the Consumer only chooses captures and not sources.  If it's
   purely up to the provider then this information would be superfluous.
   It is proposed to capture this:

           The trigger (or policy) that decides when a source is present
           is up to the provider.  The ability to provide detailed
           information about sources is for further study.

3.3.4.  MCU behaviour

   When a CLUE endpoint is acting as a MCU it implies the need for an
   advertisement aggregation function.  That is the endpoint receives
   CLUE advertisements from multiple endpoints uses this information,
   its media processing capabilities and any policy information to form
   advertisements to the other endpoints.

   Contributor's note: TBD I think there needs to be a discussion here
   about that source information is lost.  How individual attributes are
   affected. i.e. it may be possible to simply aggregate language
   information but not so simple when there's different spatial
   information.  Also need to consider capture encodings.

4.  Acknowledgements

   This template was derived from an initial version written by Pekka
   Savola and contributed by him to the xml2rfc project.

5.  IANA Considerations


Groves, et al.          Expires February 13, 2014              [Page 24]

Internet-Draft              Abbreviated Title                August 2013


   It is not expected that the proposed changes present the need for any
   IANA registrations.

6.  Security Considerations

   It is not expected that the proposed changes present any addition
   security issues to the current framework.

7.  References

7.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

7.2.  Informative References

   [I-D.groves-clue-capture-attr]
              Groves, C., Yang, W., and R. Even, "CLUE media capture
              description", draft-groves-clue-capture-attr-01 (work in
              progress), February 2013.

   [I-D.ietf-clue-framework]
              Duckworth, M., Pepperell, A., and S. Wenger, "Framework
              for Telepresence Multi-Streams", draft-ietf-clue-
              framework-11 (work in progress), July 2013.

   [I-D.ietf-clue-telepresence-requirements]
              Romanow, A., Botzko, S., and M. Barnes, "Requirements for
              Telepresence Multi-Streams", draft-ietf-clue-telepresence-
              requirements-04 (work in progress), July 2013.

   [I-D.ietf-clue-telepresence-use-cases]
              Romanow, A., Botzko, S., Duckworth, M., and R. Even, "Use
              Cases for Telepresence Multi-streams", draft-ietf-clue-
              telepresence-use-cases-05 (work in progress), April 2013.

   [RFC2629]  Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629,
              June 1999.

   [RFC4575]  Rosenberg, J., Schulzrinne, H., and O. Levin, "A Session
              Initiation Protocol (SIP) Event Package for Conference
              State", RFC 4575, August 2006.

   [RFC6501]  Novo, O., Camarillo, G., Morgan, D., and J. Urpalainen,
              "Conference Information Data Model for Centralized
              Conferencing (XCON)", RFC 6501, March 2012.


Groves, et al.          Expires February 13, 2014              [Page 25]

Internet-Draft              Abbreviated Title                August 2013


   [RFC6503]  Barnes, M., Boulton, C., Romano, S., and H. Schulzrinne,
              "Centralized Conferencing Manipulation Protocol", RFC
              6503, March 2012.

Authors' Addresses

   Christian Groves (editor)
   Huawei
   Melbourne
   Australia

   Email: Christian.Groves@nteczone.com


   Weiwei Yang
   Huawei
   P.R.China

   Email: tommy@huawei.com


   Roni Even
   Huawei
   Tel Aviv
   Isreal

   Email: roni.even@mail01.huawei.com


Groves, et al.          Expires February 13, 2014              [Page 26]