Internet Research Task Force P. Ashwood-Smith Internet Draft Huawei Intended status: Informational M. Soliman Expires: February 03, 2014 Carleton University T. Wan Huawei July 03, 2013 SDN State Reduction draft-ashwood-sdnrg-state-reduction-00.txt Abstract This document makes the argument that to support the centralized control of a substantial number of forwarding devices (as Software Defined Networking (SDN) proposes) that the scale, speed, cost and general quality of such a solution will be improved by reducing the state needed to be distributed into the network of devices by the controller(s). To this end we re-visit forms of Source Routing (SR), in particular Strict Link Source Routing (SLSR) and suggest that light weight SLSR could allow substantial reduction in controller burden while potentially reducing the costs/complexity on forwarding devices. We discuss some simulation results that demonstrate these advantages and how the advantages grow substantially as the network diameter grows. We also look at various implementation possibilities including existing IPV4, V6, MPLS, new/modified MPLS vs. something brand new that could possibly be implemented with new SDN technology like Protocol Oblivious Forwarding-POF. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt Ashwood-Smith, et al. Expires February 03 2014 [Page 1] Internet-Draft SDN state reduction July 2013 The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on December 3, 2012. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Table of Contents 1. Terminology 3 2. Introduction 3 3. Logical Example 5 4. Expressing a Path 6 5. Computing a Path 7 6. Downloading Forwarding State 8 7. Logically Forwarding SLSR 10 7.1. Ingress Logical Unicast Forwarding 10 7.2. Tandem Logical Unicast Forwarding 11 7.3. Egress Logical Unicast Forwarding 12 8. Logical Multicast Forwarding SLSR Packets 13 9. Failure Recovery 14 10. Comparison of Logical Model to Existing Source Routing 15 10.1. MPLS as a SLSR 15 10.2. IPV4/6 Options as SLSR 18 10.3. Protocol Oblivious Forwarding as SLSR mechanism 19 11. Security Considerations 20 12. Conclusions and Future work 21 13. IANA Considerations 21 14. References 21 14.1. Informative References 21 15. Authors' Addresses 23 16. Contributors 23 17. Acknowledgements 23 Ashwood-Smith, et al. Expires February 18, 2012 [Page 2] Internet-Draft SDN state reduction July 2013 1. Terminology ATM Asynchronous Transfer Mode (a cell based network) BGP Boarder Gateway Protocol CSPF Constrained Shortest Path First DOS Denial of Service (attack) ECMP Equal Cost Multi Path flow Logically related packets following the same path IS-IS Intermediate System to Intermediate System LACP Link Aggregation Control Protocol LAG Link Aggregation Loose A source route that enumerates only some of all hops MPLS Multi Protocol Label Switching MPLS-TE MPLS Traffic Engineering. NPU Network Processor Unit (programmable forwarding) OpenFlow Open data path programming protocol OSPF Open Shortest Path First PCE Path Computation Element (used with MPLS-TE) PNNI Private Network to Network Interface (link state ATM) POF Protocol Oblivious Forwarding - more generic OpenFlow) RSVP-TE Resource Reservation Protocol - Traffic Engineering SDN Software Defined Networking (as per [OPENFLOW]) SDN-domain Set of forwarding devices controlled as a unit. SR Source Routing - enumerating hops to traverse SLSR Strict Link Source Routing - enumerating links [SLSR] SPF Shortest Path First - (Dijkstra etc.) SSRR Strict Source Record Route - IPV4 header option 9 Strict Source route that enumerates every hop(unlike Loose) TE Traffic Engineering VFI Virtual Forwarding Instance (layer 2) VPLS Virtual Private Lan Service VRF Virtual Routing and Forwarding (layer 3) 2. Introduction The centralized control of a network is not a new idea. Indeed centralized control was widely deployed in voice networks and some early data networks but of course gave way to distributed control for IP. Centralized computation is however still widely used for traffic engineered networks, like MPLS-TE and GMPLS where a Path Computation Engine (PCE) makes use of a global view of a sub- network and its resource usage for the planning of new paths and their resources. The data path state distribution with these models is however not initiated centrally and relies on protocols like RSVP-TE to install the hop by hop state. In fact this form Ashwood-Smith, et al. Expires February 18, 2012 [Page 3] Internet-Draft SDN state reduction July 2013 of distributed control with centralized traffic engineering computations is the norm today. Notwithstanding the massive deployment of this kind of hybrid distributed/central control, we have in the last several years seen a huge resurgence of interest in fully centralized control of at least a set of forwarding devices [ONF] [OPENFLOW] with Software Defined Networking (SDN). This SDN proposes a central controller (or controllers) using IP protocols such as TCP to talk to a set of arbitrarily interconnected (and cheap/dumb) forwarding devices (SDN-domain) and which is responsible for the configuration of the majority of forwarding state on those devices. This state may be produced either as a result of pro- active configuration, or based on re-active responses to packet flow indications from the forwarding devices themselves. Since this central controller has knowledge of the entire sub- network of devices, and potentially of the traffic demands into/out of the sub-network, it can perform a variety of path optimization computations similar to CSPF/MPLS-TE/PCE/GMPLS, or even more elaborate forms of optimization (trading flows against each other rather than individually optimizing them, exploiting quiet areas of the network to offload busy areas etc), the output of which is forwarding state for all meta flows in the entire sub network of devices and a sub network which more optimally meets the desired local constraints. One such deployment reports a substantial increase in network utilization from 30% to 70%-90% [SDNGOOG]. A central controller can also more effectively solve problems such as bin-packing and path blocking [SDNGOOG], which occur when flows are optimized individually with greedy type algorithms rather than considering other orderings of the flows. The finer grained ability to place traffic can also permit much more detailed placement of traffic after a failure, including traffic not directly affected by the failure but the replacement of which is critical to achieving fair/efficient use of the remaining bandwidth subsequent to the failure. Since the output of the controller is much closer to a TE (Traffic Engineered) type solution from a PCE (Path Control Element) than an SPF (Shortest Path First) solution the controller cannot simply install destination based forwarding entries. A controller either needs to install tunnels that follow the explicit routes it wishes and then map traffic to those tunnels at the edges, or it must install n-tuple < etc.> state and configure these n-tuple matches on every hop along the desired path. Packets which fail to match an n-tuple are either discarded or sent to the controller. Ashwood-Smith, et al. Expires February 18, 2012 [Page 4] Internet-Draft SDN state reduction July 2013 In the normal case of SDN (as given in [OPENFLOW]) the controller is required to send configuration information to all devices along the path from ingress of the SDN-domain of this controller to the egress of that SDN-domain, alternatively a tunnel setup protocol like RSVP-TE is required to be triggered to distribute the per hop state between the ingress and egress. This draft proposes that since the controller knows the exact end- to-end path (down to the level of the links it wishes the packets to traverse) and that the diameter of an SDN-domain is likely to be a reasonable number of hops, that the controller should instead simply insert into a header the exact links it wishes the packet to traverse and thereby not have to deal either with per hop n- tuple state installation (very expensive) or with MPLS tunnel installation via RSVP-TE(complex). Such a mechanism also eliminates any concerns about Equal Cost Multi Path (ECMP) and/or Link Aggregation (LAG) as the controller can place traffic on exact links. Operations, Administration and Management (OAM) is also greatly simplified since data packets will flow on invariant paths that are known by both ends of the flow and can be the same as any OAM packets that probe the flow. This OAM "fate sharing" property is widely valued by network operators and considerable effort has already been expended to permit similar fate sharing between OAM and data paths with other carrier scale networking protocols such as 802.1ag and MPLS-TP. Of course if a controller does not wish to enforce symmetry and congruence it need not. 3. Logical Example The following is an example of an idealized strict link based source routing (SLSR) forwarding. We talk about possible implementations including MPLS methods after looking at the logical ideal. Consider the simple 7 node network shown in Figure 1 below. Here the nodes are named {A, B, C, D, E, F, G} and where each node has locally numbered interfaces named {1, 2, 3, 4, 5, 6}. For example node A has interfaces named 1, 2, 3, 4 and where interfaces 4 and 2 both go to node B. Node B has local interfaces 1, 2, 3, 4 and 5 but the two interfaces going back to node A are locally named 1 and 3. Clearly node interface names are likely (but not necessarily) different at both ends of a link. Ashwood-Smith, et al. Expires February 18, 2012 [Page 5] Internet-Draft SDN state reduction July 2013 +----+ +----+ | | 4 ----- 3 | | | A | 2 ----- 1 | B | +----+ +----+ 1 3 4 2 5 / \ / \ \ 3 4 3 2 3 +----+ +----+ +----+ | C | 2----1 | D | 2------1 | E | +----+ +----+ +----+ 1 6 5 4 \ / \ / 3 2 1 2 +----+ +----+ | F | 1-------3 | G | +----+ +----+ Figure 1 - simple 7 node network with local link identifiers. 4. Expressing a Path A path through a network labeled as per Figure 1 can clearly be expressed as a sequence of link names (an SLSR). For example, between nodes C and E the following are all valid paths. C.3 -> A.2 -> B.2 C.2 -> D.2 C.1 -> F.2 -> D.2 C.3 -> A.4 -> B.5 Now since the links lead unambiguously to a known node, the paths can be more compactly expressed without the node names as follows: {3,2,2} {2,2} {1,2,2} {3,4,5} As long as we know the origin of the path (in this case node C), the list of link names unambiguously identifies a path and an egress point. In addition it identifies unambiguously which link from among parallel links between neighbors should be traversed. Of course it is possible to give a name to the set of links that all attach to the same neighbor and thereby leave the exact link in that path deliberately ambiguous and thereby subject to a local forwarding decision as to exactly which link in the set to follow. Ashwood-Smith, et al. Expires February 18, 2012 [Page 6] Internet-Draft SDN state reduction July 2013 Each path of course also has exactly one perfectly symmetric reverse. Note that the symmetric reverse path is not simply the same list of link names in reverse order. A reverse path has to be specified from the opposite end of the path so in this example the origin has to be E. The forward and corresponding reverse paths are therefore. C->E E->C {3,2,2} {2,1,1} {2,2} {1,1} {1,2,2} {1,6,3} {3,4,5} {3,3,1} Various very efficient encodings of these kinds of paths in source routed headers are possible. Even a simple encoding using 8 bits per hop can encode every path in a large 8 hop network with fewer bits than an IP in IP tunnel. 5. Computing a Path It should be obvious that the output of any graph based computation which has as its goal various optimization criteria for flows can express its results as a series of such paths where each path is expressed as a Strict Link based Source Route (SLSR). This includes multiple different metric Dijkstra computations (i.e. shortest path, multi topology shortest path), CSPF type and of course more elaborate linear-programming or other convex type optimizations. The expression of the path as an SLSR imposes no constraints on the type of computation being performed except possibly in path length. However in any real network under the control of a single controller it is not likely that path length would be a real issue unless perhaps unreasonably large link names are encoded. Convex and linear-programming type solutions to traffic placement are of particular interest because to do a good job they must exploit a considerable number of paths through a network (many more than shortest). These algorithms take the matrix of ingress/egress flows in a network together with all the usable paths between all sources and destinations and will assign percentages of the ingress/egress flows to the available paths in ratios that can optimize a number of simultaneous constraints. For example they can optimize the network's total throughput, the average link utilization, the fairness of the bandwidth available to each flow and can even optimize different linear and non linear combinations of those goals. What is interesting about all of Ashwood-Smith, et al. Expires February 18, 2012 [Page 7] Internet-Draft SDN state reduction July 2013 these kinds of optimizations however is that they need access to all of the reasonable paths across the network since it is by making trade-offs between busy and less busy parts of the network that they achieve their goals. Unfortunately the number of paths (shortest or otherwise) in a network grows exponentially with network size and with it the state distribution problem (or burden) the controller must deal with. It is important to make the distinction between a flow and a path. This draft concerns itself not with the immense numbers of micro flows but with the very large numbers of paths required to be supported onto which those micro flows are then aggregated. A set of micro flows can be treated as a single flow, and a single flow has a unique path through the network. 6. Downloading Forwarding State A controller likely takes as input the fields that identify the flow and its various statistical attributes. The controller then likely computes an end to end path for this flow either based on the single flow's attributes (in a re-active manner), or on more global knowledge of multiple flow attributes (in a pro-active manner). Flows may be meta (many micro flows) or individual micro flows depending on the implementation and its scale. The output of course is just a list of links that must be traversed for this flow together with matching rules to identify the flow at the ingress. The controller then delivers the flow matching rules and the Strict Link Based Source Route to the *single* node where the flow is to be encapsulated (i.e. where the flow first enters the SDN- domain). The fact of only having to communicate with the *single* node at the head end of the path means that the controller experiences a reduction in its work load directly proportional to the number of hops in the path (as compared to traditional SDN which must program every hop along the path). Intuitively this translates to the following I/O burden reduction at the controller based on the number of links that must be traversed per average path. Ashwood-Smith, et al. Expires February 18, 2012 [Page 8] Internet-Draft SDN state reduction July 2013 +----------------+-------------------------+ | #Avg Path Len | % I/O Burden Reduction | +----------------+-------------------------+ | 1 | 0% | | 2 | 50% | | 3 | 66% | | 4 | 75% | | 5 | 80% | | .. | .. | | N | (100-100/n) % | +----------------+-------------------------+ Since forwarding state download is typically a substantial part of a "normal" routers' re-convergence time, it seems reasonable that this will become a similar bottleneck for a central controller and quite possibly be further aggravated by the increased delays and larger amount of state that the central device must deal with. As a result this reduction in state and I/O burden should have a marked impact on convergence times assuming there are appropriate forwarding mechanisms that can implement the Strict Link Source Route (SLSR). Note also that the position of the controller relative to the ingress/egress nodes is now more important than its position relative to all nodes. Therefore studies as to controller optimum placement as defined by the Controller Placement Problem [PLACEMENT] would require different optimization goals. An additional 50% reduction can also be obtained should the implementation of the forwarding be able to reverse the path on the fly. Such a reversal permits the implicit communication of the desired reverse path to the receiver thereby eliminating communication with the controller to obtain a reverse path. Of course if symmetry is not desired this further optimization is not possible. For example, consider a network with 1000 nodes. It therefore has O(1,000,000) meta flows and assuming 10 possible paths for each flow has O(10,000,000) ingress forwarding entries that must be centrally configured (its burden). If each path on average takes 5 hops then the burden on the controller grows 5 fold to O(50,000,000) entries but with SLSR the burden remains at O(10,000,000). If path reversal is supported and symmetric routing is desired then the burden with SLSR drops further to O(5,000,000). Ashwood-Smith, et al. Expires February 18, 2012 [Page 9] Internet-Draft SDN state reduction July 2013 Simulations done by one of us in [SRSDN] provide additional weight to the above arguments. In particular we simulated for various network sizes and diameters the differences between hop by hop SDN and SLSR and saw up to 3 x performance improvements in convergence times with SLSR. There were also a number of other benefits such as a markedly reduced standard deviation in convergence times for the different nodes (81% decreases) and a significantly reduced sensitivity to the placement of the controller (80% reduction in standard deviation). The performance improvements can perhaps better be understood by an analogy comparing the work required to fill in the area of an object (traditional SDN) vs. simply drawing the circumference of that object (SLSR). Since the circumference varies as a function of the diameter but the area varies as a function of the diameter squared the relative burden reduction with just dealing with the circumference (the edge of the network) becomes apparent. In fact in this simulation study we varied the radius and then plotted the relative convergence times of SLSR and traditional hop by hop forwarded SDN and saw a ratio of convergence times as a function of radius that indicated a trend towards 1/R as expected. Simply stated, the bigger the centrally controlled network the better source routing performs compared to hop by hop. 7. Logically Forwarding SLSR There are three distinct phases to be performed to logically forward unicast SLSR. These are similar to any tunnel technology and consist of 1) Ingress Encapsulation, 2) Tandem Forwarding, and 3) Egress Decapsulation and Forwarding. We address the generic concepts first before looking at possible existing or new encapsulations and their applicability. Multicast SLSR is also possible (but with limitations to keep the header sizes from growing too large) and is briefly discussed after unicast. 7.1. Ingress Logical Unicast Forwarding Here the flow information, likely IP header(s) + UDP/TCP header(s) is looked up and a sequence of link identifiers and a current hop must be placed on the packet, the packet must then be forwarded to the first of those links. This operation is identical to almost every tunnel protocol except that IP ECMP and/or LAG hash would potentially be unnecessary because the first link name would often resolve to a physical link not a LAG bundle. For example: Ashwood-Smith, et al. Expires February 18, 2012 [Page 10] Internet-Draft SDN state reduction July 2013 +----------+----------+---------+--------+--------+ |SrcIP | DstIP | SrcPrt | DstPrt | SLSR | +----------+----------+---------+--------+--------+ |192.0.2.4 |192.0.2.9 | 1000 | 98 |{3,2,5} | |192.0.2.4 |192.0.2.9 | 1001 | 99 |{3,4,5} | +----------+----------+---------+--------+--------+ Of course nothing precludes the use of LAG and the link identifier therefore identifying an entire LAG bundle rather than an element of that LAG. In fact it is possible to simultaneously support both concepts so that some traffic can be forwarded to the entire LAG while other traffic could be placed on a particular LAG bundle member at the discretion of the central computation. 7.2. Tandem Logical Unicast Forwarding At tandem devices the operation would start by incrementing the current hop in the packet header (shown with a ^ symbol) and then forward to the link identified in the new current hop. If we support reversal, we change the previous link name to the local link name for that link. For example, referring to Figure 1 (and disregarding non relevant headers/options) after matching the first flow tuple at the ingress node C the packet is encapsulated with the SLSR header {3,2,5} and then leaves node C on interface 3 toward A. Then: Packet arrives at node A on local interface 1 where it looks like this: +--------------------------------------~~~------------+ | 3 | 2 | 5 | 192.0.2.9 | 192.0.2.4 |.. | +-^------------------------------------~~~------------+ Current hop is incremented while previous hop is changed to local interface name (3 changes to a 1). +--------------------------------------~~~------------+ | 1 | 2 | 5 | 192.0.2.9 | 192.0.2.4 |.. | +-----^--------------------------------~~~------------+ Packet is forwarded to interface for current hop i.e. 2. Ashwood-Smith, et al. Expires February 18, 2012 [Page 11] Internet-Draft SDN state reduction July 2013 Packet arrives at node B on local interface 1. +--------------------------------------~~~------------+ | 1 | 2 | 5 | 192.0.2.9 | 192.0.2.4 |.. | +-----^--------------------------------~~~------------+ Current hop is incremented while previous hop is changed to local interface name 1 (2 changes to a 1). +--------------------------------------~~~------------+ | 1 | 1 | 5 | 192.0.2.9 | 192.0.2.4 |.. | +---------^----------------------------~~~------------+ Packet is forwarded to interface for current hop i.e. 5. Packet arrives at node E on local interface 3. +--------------------------------------~~~------------+ | 1 | 1 | 5 | 192.0.2.9 | 192.0.2.4 |.. | +---------^----------------------------~~~------------+ Current hop is incremented while previous hop is changed to local interface name (5 changes to 3). +--------------------------------------~~~------------+ | 1 | 1 | 3 | 192.0.2.9 | 192.0.2.4 |.. | +-----------^--------------------------~~~------------+ We are at the end of the path, so egress processing begins. One additional step not described above is a reverse path check. Prior to substituting the reverse link identifier into the SLSR header, the link identifier from the neighbor can be validated and the packet discarded if the neighbor link identifier in the packet is incorrect for the port the packet arrived on. This would reduce the chances of mis-delivery of the packet should a link identifier change or a link destination change while a packet is in flight. 7.3. Egress Logical Unicast Forwarding Here the operation consists of normal IP/Ethernet etc. forwarding based on the IP destination / MAC or other ACL rules. Basically the SLSR header is stripped and the packet is submitted to the Virtual Forwarding Instance, or Virtual Forwarding Function (VFI or VRF) for further processing. Optionally the link identifier from the neighbor can be validated against what is expected and the packet discarded in the case of a Ashwood-Smith, et al. Expires February 18, 2012 [Page 12] Internet-Draft SDN state reduction July 2013 mismatch. This reduces the chance of mis-delivery as in the tandem case. In addition, if path reversal is supported, the reverse path is compared against the current reverse path for this reverse flow and if it has changed the local forwarding state for the reverse flow would be updated. This would allow the head end to always dictate the forward and reverse path to be used for all packets in the flow without involving the controller on the egress side (and of course not needing to communicate with any tandem device). Processing the reverse flow/path in this manner means that a flow is already present for the reverse direction without having to re- actively or pro-actively consult the controller. This results in a further 50% reduction in controller load. In the case of asymmetry this optimization is of course not possible. 8. Logical Multicast Forwarding SLSR Packets Multicasting packets usually involve one of two approaches. The first approach simply re-uses unicast and sends multiple copies to a pre-determined list of receivers. There is little to discuss with this approach as we can replicate SLSR based unicast packets just as easily as any other tunneling mechanism. Clearly such a serial unicast approach has nearly identical bandwidth overhead as other protocols like VPLS which also use this serial unicast mechanism. It is therefore interesting to look at more efficient methods that involve the second multicast mechanism, which uses replication points in the network. These replication points are chosen so that copies are more efficiently made thereby eliminating multiple copies of the packet traversing any given link. Various logical tree structures are usually involved e.g. STP, SPB, TRILL, PIM, MOSPF etc. These tree based mechanisms could in theory be implemented without requiring tandem state as an SLSR by introducing a branch point concept into the list of indexes. In this manner a complete tree as a pre-order traversal could be encoded along with the packet payload. It is not difficult to define a variety of different encodings that would accomplish this. The obvious objection to such a scheme is the sheer size of header required especially where a large network with many multicast receivers is concerned. It is therefore unlikely to be practical to encode any large tree of receivers and the SLSRs between them in any single header. Ashwood-Smith, et al. Expires February 18, 2012 [Page 13] Internet-Draft SDN state reduction July 2013 This leads to a hybrid approach which would encode a subset of the tree, say a single replication point and 5 or so recipients. This little tree or 'tree-let', would efficiently get a single packet to 5 (or some suitably small number) of recipients with an SLSR to the replication point and then SLSRs to each of the receivers. Such an encoding is much more reasonable than trying to encode all receivers and all replication points of a single tree in one packet. However, since this one packet would not reach all receivers, the head end would have to generate as many copies of the data packets as necessary to cover all recipients. As a result this approach would be a compromise between a full tree, and full head end replication. Variations in the size of the 'tree-let' header would allow for space v.s. bandwidth efficiency trade-offs while meeting the goal of remaining stateless in the core. In the literature there are also non-exact methods to multicast without state such as with Bloom filters [BLOOM]. In this approach the links to be traversed are logically mapped into a field which is carried in the packet (for example if the links are given unique 128 bit sparse addresses then a 128 bit union of all the links to be traversed on the tree is encoded in the header). These mechanisms guarantee that all receivers will get a copy of the packet (because they check each link for inclusion in the Bloom Filter at each hop) however they do so at the expense of sending false positive copies to unintended receivers which must then filter the unwanted packets egress. Depending on the size of the Bloom Filter and the link identifiers various statistical trade- offs in false positives vs. packet header size can be made. Other exact methods to encode and methods to compute SLSR multicast etc. are FFS. 9. Failure Recovery A variety of failure recovery techniques can be employed with SLSR. The most obvious is to just re-compute all affected paths on indication of a link failure. This won't be discussed further. More interesting are the so called fast restoration mechanisms. These can broadly be broken down into head end and tandem restoration. Head end mechanisms that provide 1+1 protection have been around for a long time with MPLS-TP, PBT and SONET/DWDM. Similar mechanisms can be used with any tunnel type and of course SLSR is no exception. Probes can be sent down one source route, reflected back along the reverse source route and in this manner the forward and reverse paths can be simultaneously probed for failure. In the event of a failure a diverse alternate source route can rapidly be Ashwood-Smith, et al. Expires February 18, 2012 [Page 14] Internet-Draft SDN state reduction July 2013 added to the packet and the flow restored. The advantage of course with SLSR is that no state is required for either the primary or the backup path. As a result there is little added cost to having even greater redundancy than 1+1 with SLSR. The mechanisms to accomplish this are fairly obvious. Having the reverse path available at the egress means that fate sharing the forward and reverse probes is easy. In addition to 1+1 protection it is possible to do hop by hop fast reroute type detour protection. This can be done by substitution of a failed link identifier with a set of link identifiers that merge with the path downstream of the failure. An example is given further below for MPLS label stacks, however many other possibilities exist when a history of the packet's path is available to the detour mechanism. The history would permit the detour mechanism to spread the failing packets over different detours and thereby reduce the concentration of additional load imposed by the failure on the same set of links. 10. Comparison of Logical Model to Existing Source Routing There are a number of existing protocols that support forms of source routing (or can be used to do something close to source routing). IPV4 and V6 had strict and loose node-by-node source routing options (now deprecated) and we'll discuss them briefly. Likewise MPLS behavior can be used to do strict link source routing where a label stack represents a list of link names, this has recently been called segment routing in [SEGMENT]. 10.1. MPLS as a SLSR MPLS is of course not a source routed forwarding protocol, at least not by design. Rather, packets follow an arbitrary path by substitution of a previous hop label with a next hop label and each hop must be pre-configured with the to relationship. This is clearly not source routing because tandem configuration is required per path and per hop. However MPLS has a stacking mechanism that can be exploited to create a consumable list of link names to be traversed as they are popped. The MPLS label stack can therefore be used to implement a flavor of SLSR. This is accomplished by pre-assigning a locally unique MPLS label to each outgoing link of a node. For example in figure 1, node D's link 3 would be assigned MPLS label 3 (but more likely a label value which is 1:1 related to link 3, however we stick with label=link for simplicity of explanation). Ashwood-Smith, et al. Expires February 18, 2012 [Page 15] Internet-Draft SDN state reduction July 2013 The tunnel encapsulation operation would therefore be to push a set of labels onto the frame where each label indicates which link to follow at that given exact strict hop. For example: +----------+----------+--------+--------+-----------------------+ |SrcIP | DstIP | SrcPrt | DstPrt | MPLS SLSR | +----------+----------+--------+--------+-----------------------+ |192.0.2.4 |192.0.2.9 | 1000 | 98 |push(3,push(2,push(5)))| |192.0.2.4 |192.0.2.9 | 1001 | 99 |push(3,push(4,push(5)))| +----------+----------+--------+--------+-----------------------+ The tunnel tandem operation would then be to pop the label on the incoming frame (after optionally validating its reverse link identifier) and forward to the interface specified by the just popped label value. Every tandem node would be pre-configured approximately as per below. Note that as with any source routing mechanism, this tandem pre-configuration is independent of the actual paths that traverse the node. A table like the one below, with a few hundred interfaces and hence a few hundred labels, could support the transit of an infinite number of TE (or SPF) paths. For clarity we use label N = interface/N but in reality it would be label N = F(interface/N) since a 1:1 mapping 'F' is almost certainly required. +-------------------+-------+--------------------------------+ |Incoming Interface | Label | Actions | +-------------------+-------+--------------------------------+ | any | 1 | Pop, forward to interface/1 | | any | 2 | Pop, forward to interface/2 | | : | : | : | | any | N | Pop, forward to interface/N | +-------------------+-------+--------------------------------+ If reverse validation is required the tables would be a bit different because they must match the label to the incoming interface and then pop it and then forward based on the next label. Reverse validation therefore requires two label lookups per forwarding operation. Finally the tunnel egress operation would be normal forwarding to a VFI or VRF. MPLS in this manner could be made to do SLSR of unicast frames but cannot be made to reverse the route because the route is consumed in transit. This method also uses many more bits than are really necessary. Each label consumes 32 bits which is rather more than required to express the number of links/adjacencies on a typical switch or router. For example, if the average packet size if 512 bytes, a 5 hop MPLS source route imposes a 4% overhead (20/512) on Ashwood-Smith, et al. Expires February 18, 2012 [Page 16] Internet-Draft SDN state reduction July 2013 some links with the largest overhead on the first few links. For larger packets this is likely not an issue, for smaller packets it is possibly a concern. A more realistic number actually required per hop is probably 8 or 12 bits (256-4K links) and if more bits are required two hops can be consumed by any node with such a large nodal degree. The MPLS label also has an 8 bit TTL which is of course redundant in any source routing mechanism. This begs the question of if a smaller MPLS label would not be more suitable? There are other issues with the use of MPLS, in particular current hardware can usually not stack very many labels at a time (3 on some popular ASICs). This would limit the network diameter to 4 hops. Of course NPUs or new ASICs could be extended to allow further ingress stacking. It does not seem possible to do SLSR multicast with MPLS except of course via head end replication. The hop(ingress stack size) limit, lack of reverse, consumable route and lack of efficient multicast still do not invalidate use of MPLS source routing for many networks and its use would have a noticeable positive impact on the scale/speed of a central controller in such environments. MPLS fast reroute mechanisms can also be implemented locally in a similar fashion thus further improving controller scale by alleviating the need for 50ms responses network wide from the controller and giving the controller more lee-way to recover after the fast reroutes have detoured traffic around the failed nodes and/or links. Consider possible local actions when the link A.2 between nodes A and B in Figure 1 fails. Since there is still a link A.4 available, the node A can locally change the action associated with label 2 to instead send to interface 4 when interface 2 fails. If an entire adjacency fails, such as would happen when both A.2 and A.4 fail, then a link detour can be locally performed by reprogramming the actions for labels 2 and 4 to now push labels 3,3 and send to interface 3. This will cause a detour via D back to B. More elaborate kinds of detour are possible by processing two link names ahead instead of one, including nodal detours. These can be done locally without end to end path knowledge and hence scale independently to the number of paths. Eventually the controller will detect the failure and reconstruct the SLSRs at the head end and the use of the detour will stop without having to withdraw any state in the core. Ashwood-Smith, et al. Expires February 18, 2012 [Page 17] Internet-Draft SDN state reduction July 2013 If MPLS is of use in the context of SLSR then it would be worth considering a number of future extensions to MPLS. Some things to consider could be a smaller MPLS label option, say 16 bits with no TTL and the possibility of not popping but rotating the label to the bottom of the stack to preserve the path history for OAM and reversibility reasons. While these sorts of things are of course not possible with existing ASICs they are easy to do on existing NPU's and new work on Protocol Oblivious Forwarding [POF] allows near arbitrary bit pattern/action matches to be programmed by an SDN controller permitting a more optimal data path encoding of SLSR than can be obtained by simply reusing MPLS. 10.2. IPV4/6 Options as SLSR IP header option 9 [RFC791] defined (but now deprecated) the Strict Source and Record Route (SSRR) option for IPV4 packets. This option has(had) a 'length' field, a 'pointer' field and an array of 'route data' fields. The element in the array of 'route data' indexed by the 'pointer' field contains the IP address of the immediate next hop towards which the packet must be forwarded, the 'pointer' field is incremented, and the previous hop is filled in with the IP address of the current device prior to actually forwarding the packet. Up to 9 hops could be specified in this manner. IPV6 also had a similar option "RH0" which is also now deprecated [SRBAD]. IPV4 and V6 Strict Source and Record Route methods could be used to implement Strict Link Source Routing. This would be accomplished by assigning a 32 bit number to the link and then using the 32 bit number in place of the IPV4 or V6 address in the route list. In both IPV4 and IPV6 the source routing options were found to be harmful to the Internet at large for a number of reasons. These reasons are described in [SRBAD] but briefly there were two broad classes of problem encountered. 1) Harm to intermediate links and 2) harm to end hosts. For example: - Since it was possible to list a waypoint more than once in the route data, it was possible to loop traffic around multiple times (9 times in the case of IPV4 and 90 times in the case of IPV6). This looping allowed saturation of high speed links by hosts that had an order (or two) smaller bandwidth access to the Internet. A congestion style DOS was therefore possible from low speed access links against higher speed core links. - Various schemes such as bypassing of firewalls etc. are of course easy to do when a host can specify waypoints that detour around a firewall. Ashwood-Smith, et al. Expires February 18, 2012 [Page 18] Internet-Draft SDN state reduction July 2013 - Spoofing using the reverse route. Since the reverse source route is installed against the IP SA by a host that receives it, it is possible to use a bogus IP SA in combination with a reverse source route that detours the packets to the imposter host. 10.3. Protocol Oblivious Forwarding as SLSR mechanism The OpenFlow [OPENFLOW] protocol defines methods for an external controller to cause the manipulation of known packet headers and fields within those headers by a forwarding element. As such it is currently limited to matching on known fields like MPLS, IP, Ethernet etc. and taking actions on those fields. While flexible there are still many things at the data path level that OpenFlow cannot do including generic source routing such as SLSR. The Protocol Oblivious Forwarding [POF] protocol is a proposed extension to OpenFlow which permits arbitrary bit pattern matching/actions and is therefore much more flexible. The goal of POF is to allow a controller to define a new data path in addition to a new control plane and to then program the data path on the forwarding elements to its specifications. POF is therefore not limited to existing IP, MPLS, Ethernet fields. It would therefore be possible with POF to implement a highly flexible SDN tunnel data plane that closely resembles the idealized SLSR data path. Strictly by way of example POF could implement a flexible SLSR header along the following lines: +----------+---------+---------+------+------+-~~------+ | NextHop | Hop | Hop | Hop | Hop | Hop | | Index:4 | Count:4 | Size:4 | 0 | 1 | N | +----------+---------+---------+------+------+-~~------+ With only five bytes, this header could represent 3 hops with 256 links per hop, 4 hops with 64 links per hop, or 6 hops with 16 links per hop, etc. With additional bytes of course more/longer combinations are possible with very reasonable overhead. This is considerably more compact than the other described options and without sacrificing reversibility or giving up the OAM benefits of knowing the exact path the packet has taken. POF however could also implement other variations of SLSR based on MPLS. For example POF could implement a smaller MPLS label, say a 16 bit label without a TTL. POF could theoretically also implement a rotating label list instead of a popping label stack. POF appears to be ideally suited for SLSR developments beyond what can currently be done with MPLS. Ashwood-Smith, et al. Expires February 18, 2012 [Page 19] Internet-Draft SDN state reduction July 2013 11. Security Considerations Source Routing security concerns are also discussed in the previous section related to IPV4 and IPV6 now deprecated nodal source routing. This draft is proposing link based source routing and that it be used as a tunneling mechanism only. This means that only devices that are at the edge of an SDN sub-network would be allowed to insert strict link source routes. Note that an MPLS label can only be inserted by a Label Edge Router (LER) and processed by Label Switch Routers (LSR) and not by end hosts. Therefore SLSR should be no more or less secure than MPLS. In fact the absence of signaling protocols like RSVP-TE removes a point of attack. The fact that this mechanism is intended for use by a central controller further mitigates the possible attacks as encrypted communications are used to the edge devices which are the only device able to insert the strict link source routes. There is however the possibility that an attacker could attach to a core device and inject strict link source routed packets. Methods to prevent this however are not hard, in particular the adjacency would have to be reported to the controller and the controller would have to enable packet forwarding. Unless the controller recognized both ends of the link as being part of its controlled domain it should not enable the strict link source routing capability on that interface thus preventing the threat. Other interfaces, such as those facing a network of hosts or devices not in the domain of the controller would, as with current BCP's, drop any source routed frame in any format (new or old). As previously mentioned there are ways to spread the link names into a 32 bit space such that the exact mappings are only known by the controller and the tandem node in question. This would prevent any easy form of guessing being used to construct an SLSR. One such example of this kind of secure source routing is given in [SANE]. Source Routing also is unique in that the packets themselves give details about slices/cuts through the topology, therefore with sufficient interception of packets from diverse sources and destinations in the network, an attacker could build up a detailed view of the network topology, this would be a concern for a carrier SDN network in particular where details of topology are considered a valuable asset, although exploiting knowledge of the topology would be more challenging given the secure protocols that exist between a controller and the forwarding entities. Ashwood-Smith, et al. Expires February 18, 2012 [Page 20] Internet-Draft SDN state reduction July 2013 In the SDN context there appears to be little need for a loose source route. Loose source routing adds additional security concerns because it does not require knowledge of the entire path to construct an attack. If loose source routing is included the security concerns should be addressed. 12. Conclusions and Future work SDN where a central controller creates either pro-actively or re- actively the state for a sub-network of forwarding devices will have performance limitations that are related to network diameter/size, network recovery requirements and the amount of state they need to distribute. Strict Link Source Routing mechanisms can alleviate these problems allowing greater scale and faster recovery. MPLS can be used to implement this on a small scale with some of the benefits. IPV4 and IPV6 source routing options can be used to implement this on a larger scale with more of the benefits but at much larger packet overhead but are however perceived as risky and have been deprecated from IP. These risks however can be mitigated in this specific use. No existing mechanism however is optimum, and therefore there is room for a new mechanism that addresses these requirements and includes multicast methods and more efficient encoding of link names than is currently possible. One possible solution is to look at a smaller MPLS label for this purpose and to look at ways to retain the popped labels for the purposes of end to end path reversal and OAM. New work in SDN, in particular Protocol Oblivious Forwarding may make these kinds of things possible in a generic manner. 13. IANA Considerations This memo includes no request to IANA. 14. References 14.1. Informative References [BLOOM] Active Bloom Filters for Multicast Addressing, Z. Heszberger et. al. Budapest University of Technology and Economics. [OPENFLOW] www.openflow.org [ONF] www.opennetworking.org [POF] Protocol Oblivious Forwarding: http://www.poforwarding.org/ Ashwood-Smith, et al. Expires February 18, 2012 [Page 21] Internet-Draft SDN state reduction July 2013 [PLACEMENT] The Controller Placement Problem, Nick McKeon, Brandon Heller, Rob Sherwood. HotSDN'12, August 13, 2012, Helsinki, Finland. 2012 ACM 978-1- 4503-1477-0/12/08, http://conferences.sigcomm.org/sigcomm/2012/paper /hotsdn/p7.pdf [RFC791] Internet Protocol, Information Sciences Institute, RFC 791, September 1981. [SANE] SANE: A Protection Architecture for Enterprise Networks, Martin Casado, Nick McKeown, http://yuba.stanford.edu/~casado/sane.pdf, Stanford and ICSI 2005. [SDNGOOG] SDN at Google - Opportunities for WAN optimization, E. Crabbe, V. Valancius, 8/1/2012. Presentation at IETF84 SDN BOF. [SEGMENT] Segment Routing with IS-IS, S.Previdi et. Al. http://tools.ietf.org/html/draft-previdi- filsfils-isis-segment-routing-00 [SLSR] Software Defined Networking and Centralized Controller State Distribution Reduction, www.ieee802.org/1/files/public/ docs2012/new- ashwood-sdn-optimizations-0712-v01.pdf [SRBAD] Deprecation of Source Routing Options in IPV4 http://tools.ietf.org/html/draft-reitzel-ipv4- source-routing-is-evil-00 [SRSDN] Source Routed Forwarding with SDN, M. Soliman http://conferences.sigcomm.org/co- next/2012/eproceedings/student/p43.pdf Ashwood-Smith, et al. Expires February 18, 2012 [Page 22] Internet-Draft SDN state reduction July 2013 15. Authors' Addresses Peter Ashwood-Smith Huawei Canada Inc. 303 Terry Fox Drive, Suite 400, Kanata, Ontario K2K 3J1 Email: Peter.AshwoodSmith@huawei.com Mourad Soliman Carleton University, 1125 Colonel By Drive Ottawa, Ontario K1S 5B6 Canada Email: MouradSoliman@cmail.carleton.ca Tao Wan Huawei Canada Inc. 303 Terry Fox Drive, Suite 400, Kanata, Ontario K2K 3J1 Email: Tao.Wan@huawei.com 16. Contributors We invite more contributors. 17. Acknowledgements We gratefully appreciate the feedback of Nigel Bragg, Sue Hares, Peter Willis, Biswajit Nandy and Linda Dunbar. Ashwood-Smith, et al. Expires February 18, 2012 [Page 23]