Developing a Scale-Out Data Center Network with the Cloud
Enterprise products, solutions & services
Living in an information society, we are producing an exponentially growing volume of information year by year. As storage, processing, and analysis of data primarily take place in data centers, new demands and challenges are brought to the data center network. At the moment, we must explore ways of dealing with these demands and challenges.
As we all know, the core concepts of cloud computing are based on pooled hardware resources, fully distributed software, and fully automated operations. The basic need of this new distributed computing and storage architecture is to access data across multiple compute nodes. As a result, east-west traffic within the data center is far larger than north-south traffic between the data center and users and, in some scenarios, such as searching, east-west traffic can be 40 times more than north-south traffic. Since non-blocking networks are essential for implementing cloud computing, the current converged Clos data center networking architecture is confronted with new challenges.
A typical cloud data center consists of 50,000 to 100,000 servers. These servers can be either located within one large data center base or distributed in multiple server rooms within a radius of 200 kilometers. Three to four thousand servers can constitute a Point-of-Delivery (PoD) cluster in which a strict non-blocking network is implemented. Between clusters, the non-blocking network is implemented to the maximum possible extent so as to enable computing and sharing on a larger scale. In this scenario, the network demand for switching capacity is extremely high. With four 10G ports on each server to achieve a network capacity of 2 to 4 petabits, the demand for network capacity of the cloud data center will reach the petabit level (1 Pbit = 1,000 Tbits), even based on a convergence ratio of 1:4 between clusters.
For a conventional networking architecture based on device convergence, the maximum capacity of its core switch is about 50 Tbit/s. If the networking architecture remains unchanged to meet future demand, the capacity of the core switch has to reach over 200 Tbit/s; however, it is very difficult to increase the SerDes speed and the unit capacity using technologies that are based on electrical interconnections. Moreover, the impact of a Single Point of Failure (SPoF) will become greater, resulting in high costs that discredit SerDes as a sustainable approach.
Power consumption is also a huge challenge to data centers. There are many big power consumers among the facilities of a data center that have long been dubbed ‘power killers.’ More importantly, a uniform density of energy must be guaranteed because uneven density of energy can greatly affect the power system, cooling system, data center space, and data center security.
Because of its huge capacity, a typical core switch has the power of nearly 30 kilowatts. The power supply capacity of a single rack in an old server room, however, is merely four to five kilowatts, while that in a new server room ranges from eight to 12 kilowatts. If the power consumption of a single device is too large, adequate space must be kept all around it to ensure its power supply. Moreover, since the working conditions of the cooling system must be perfectly ensured, it will be difficult to increase the spatial distribution density of the entire server room, and great challenges will be brought to the cooling for power supply. With the increasing size of the network, power consumption and cooling become more and more pressing issues that call for more efficient resolutions.
For the connection of a conventional Layer-3 network device, traffic between groups of the Top-of-Rack (ToR) switches should be ultimately forwarded by the core switch. In other words, the optical fibers should be converged in the core server room. As a result, a ‘fiber wall’ problem is caused in which the dense fibers resemble a wall, complicating Operations and Maintenance (O&M) of the data center.
The number of fibers can be reduced using large-capacity ports, such as 40 Gigabit Ethernet (GE) or 100 GE ports. However, given the cost of the optical module, the multi-mode parallel optical module of 4 x 10 GB or 10 x 10 GB is usually adopted because a 40 GE port requires four pairs of optical fibers while a 100 GE port requires 10 pairs. This means that the number of fibers is not reduced, and the great challenges to O&M remain the same. Meanwhile, in the design of the ToR in a server room, the general specifications support about 2,000 bundles of fibers. For this reason, the number of connected fibers in the core room further limits the capacity of non-blocking switching in the entire network to a maximum of nearly 200 Tbit/s (2,000 x 100 GE).
Therefore, with the development of cloud computing, the cloud data center is becoming larger and larger in size, with growing east-west traffic. The data center network will face new demands, especially with the capacity demand for petabit-level non-blocking; however, the conventional networking architectures are confronted with a series of difficult problems in capacity, power supply, power consumption, scalability, and O&M, all of which can only be resolved with a new architecture.
Besides the scale-up approach that makes larger single devices, is there an alternative solution using the approach of scale-out that can resolve the capacity problem? Our answer is yes, and we have proposed the MESH2 networking architecture.
MESH2 networking architecture, also known as two-level Mesh networking architecture, is a distributed network topology that adopts the scale-out approach of cloud computing. The idea of cloud computing is to build super-large capacity of storage and computing by using ‘small particles of hardware + large-scale distributed software.’ With this approach, the cost of the system can be lowered by replacing the reliability of a single device with that of a distributed system. MESH2 networking architecture results from this idea, with its overall logical structure (Figure 1).
Figure 1: MESH2 networking architecture with its overall logical structure
The MESH2 networking architecture has a number of key features. The first feature is super-flat, as there is only one layer of ToR switches in the entire network, directly deployed in each server cabinet. By changing the multi-level convergent structure of the data center network into a one-layer physical network structure, the entire network is connected by small switches of the same specification and configuration. Each switch is connected by both intra-group MESH and inter-group MESH, eliminating the need for the large-capacity convergent and core switches in a conventional architecture.
The ports of each ToR switch are divided into three groups. The first group is the local ports that connect to the server. The second group is the intra-group connection ports that connect to other ToR switchers within the same PoD, forming an intra-group one-level MESH connection. The third group is the inter-group connection ports that connect ToR switches between different PoDs in the same inter-group plane, forming an inter-group two-level MESH connection. A standard two-level MESH network consists of N x N ToR nodes: There are N PODs, each of which has N ToR nodes.
The second feature is that, when the optical network enters the data center, Wavelength-Division Multiplexing (WDM) and passive optical device Cyclic Array Waveguide Grating (CAWG) are used to implement the MESH interconnection. Both the intra- and inter-group MESH connections require direct connection of the fiber to the associated nodes. If the network size is very large, say, one with 48 x 48 nodes, the number of fibers connected to the network can be enormous, as hundreds of thousands of pairs of fibers may be required, and the node direction of each connected fiber varies. To resolve the MESH connection problem of the optical fiber, WDM interfaces and CAWG are introduced. WDM interfaces can be either built inside switches or deployed independently. After multiplexing, the N transmit ports of ToR switches are connected to the input fiber of CAWG. This optical device not only converts logical MESH connections in switches to a physical star connection but also resolves the problem of massive fiber connections that affect a large-scale data center network.
The third feature is that the distributed forwarding of MESH2 network enables non-blocking switching and smart route scheduling, thereby improving the network throughput. The MESH network is physically a one-layer network, but it is still a three-layered Clos network in the forwarding model. It is distributed, that is, the ToR switches perform all functions of the switches at the three layers: ToR, convergence, and core. In this way, the capabilities of the convergence and core switches are distributed to each ToR switch, eliminating the central point and bottleneck of the system. Moreover, unlike the conventional Clos architecture, the hop count of traffic forwarded in the data center can be reduced by a smart UCMP path route-scheduling algorithm in a MESH network, due to the presence of the direct path. This significantly reduces delay and improves the forwarding efficiency of the MESH network.
The new architecture actually reallocates the switching capability of aggregation and core switches in the traditional three-layer Clos network to ToRs in the MESH2 network, which breaks through the bottlenecks of traditional networks. Its values are described as:
First, a fully distributed flat architecture breaks the limit on capacity by allowing the construction of a network with a super-large capacity. A two-level MESH2 network can be used to set up a non-blocking data center network with Pbit/s-level switching capacity — where 1 Pbit/s has the capacity to support 50,000 servers with dual 10 Gbit/s ports. The capacity demand for each ToR is 5 x 48 x 10 Gbit/s = 2.4 Tbit/s, which can be provided by two hundred and forty 10 Gbit/s ports, or forty-eight 10 Gbit/s ports (connecting to the server) and ninety-six 25 Gbit/s ports (connecting between the ToRs). It is easy for the ToRs to deliver such a capability. In contrast, if a conventional Clos network is adopted, the core switches will need a switching capacity of more than 200 Tbit/s, which will be extremely challenging.
Second, the introduction of a decentralized architecture and optical technology eliminates the engineering limits on power consumption, cooling, wiring, and maintenance. The new architecture removes the need for large devices, such as the core and convergence switches, leaving only the ToR switches, similar to a rack server. As a result, the system gets rid of the big power consumers, making power supply, cooling, and security no longer a problem for data centers. At the same time, the introduction of WDM and CAWG decreases the number of fiber connections in the entire network by several dozen times and evenly distributes the connections among server room modules, greatly simplifying the issues of cabling and O&M, and significantly reducing the costs of open O&M.
Third, replacing the reliability an individual large device with that of a distributed system eliminates the risk of SPoFs. In a conventional data center, with the increase in switching capacity, the roles of the nodes at the convergence and core layers become more prominent. Especially in a converged topology, the failure of a core switch can produce tremendous impacts on the traffic switching of the entire network, so the staff must be particularly careful during the maintenance of convergence and core nodes. However, in the new architecture, there is only one layer of physical network nodes (ToRs) in the entire network. Due to the large number of ToRs, the failure of a single node only affects the traffic switching in the respective cabinet server, accounting for only a few thousandths of the entire network. This completely rules out the possibility of a node failure leading to a large-scale network breakdown and greatly enhances the reliability of the network.
Nevertheless, the scale-out network has two shortcomings. One is that CAWG only provides wavelength cross-connection in a fixed direction, restraining flexible networking and seamless scaling. The other is that the bandwidths of the interconnection ports between ToRs are the same, allowing only an overall upgrade instead of a flexible upgrade. Although these problems can be avoided or optimized by engineering methods, project deployment modes, or practical applications, they cannot be completely resolved and can only be addressed through further innovation, such as flexible optical cross-connection technologies and optical ports with variable bandwidths. The progress of these optical technologies will also become the focus of the future development of data center networks, resulting in the construction of the data center network on the basis of optical technologies and optical networks.
With the development of cloud computing and services, the explosive growth of information and changes in the data traffic model are bringing unprecedented demands and challenges to the data center network. In this regard, it is necessary to adopt new thinking, designs, and technical architectures to reshape the future of the data center network.
Scale-out networking architecture can be built in a MESH2-based data center network by adopting the concepts and ideas of cloud computing and using the technologies of optical network. It can resolve the problems that are difficult to overcome in the conventional Clos networking architecture and build super-large petabit-level capacity and achieve higher network efficiency through a fully integrated one-layer networking architecture and smart route scheduling algorithms. As it resolves the engineering problems in power consumption, cooling, cabling, and maintenance, and reduces the risk of SPFs, scale-out networking architecture is poised to become the focus of future development of cloud data center networks.