Simplifying O&M in the Cloud Era
Software-Defined Networking (SDN) has become an obvious choice for enterprise CTOs because of its having open interface protocols. Given the magnitude of the transformation, customers are naturally concerned about whether their existing Operations & Maintenance (O&M) systems could be used to supervise that new SDN technology.
SDN O&M has been a critical target for Huawei research and development since the company first embraced the technology. Huawei examined the O&M lifecycle based on new SDN features and built a closed-loop negative feedback system called Fabric Insight.
Why Is SDN O&M Needed?
Compared with traditional networks, SDN-enabled networks have the following three features:
- Dynamic O&M: Logical networks are built or deleted on the fly based on changes to application traffic. In legacy O&M processes, 50 percent of a customer’s workload can be spent supporting out-of-date firewall policies, which leads to network slackness and fragmentation.
- Real-Time Response: Traditional networks rely on manual intervention based on slow, decades-old Simple Network Management Protocol (SNMP). This low-speed mechanism, with a message lifetime of five minutes, has become a point of criticism.
- Large Scale: Scale involves the number and complexity of devices to be managed and the number of failures to be resolved. In recent years, the number of devices has increased by 50 times from physical Network Elements (NEs) to logical NEs (vSwitches/vRouters), and, according to LinkedIn, the number of faults increased by 18 times from 2010 to 2015.
What Is SDN O&M?
To establish a dynamic, real-time, and scaled SDN architecture, Huawei has proposed that the entire O&M system be updated based on the following criteria:
Visible and Accurate
‘Visibility’ is crucial to efficient management and includes the following concepts:
- Visible Objects: Physical and logical targets are monitored, including NE-Level nodes and interfaces, Network-Level links, logical routes, and application throughput statistics.
- Real-Time Observation: Millisecond-level phenomena are displayed, traffic bursts and low-frequency (<10-4) packet losses are tracked, and mice and elephant flows are identified.
‘Accuracy’ implies making precise observations based on the analysis of massive quantities of data, including:
- Billing: Data sampling ratios must be highly scalable, ranging from 8K:1 to 2K:1, and, occasionally, 1:1.
- Troubleshooting: Based on Big Data and real-time analyses, incidental packet losses and traffic black holes can be quickly located and resolved.
Automated Repair and Optimization
Past O&M architectures have been unidirectional, issuing commands over a downlink channel and receiving feedback over a second, separate uplink. With limited communications between the administrators and the physical plant, this old-style process was incapable of meeting today’s expectation for automated rectification of network failures and automated network optimization. Modern O&M platforms are closed-loop systems that include:
- Postponed Repair: Detected failures are isolated to avoid disrupting active services.
- Diagnostic Repair: Based on the results of Big Data analytics, the automated O&M function performs repairs or provides repair options.
- Network Optimization: Abnormal conditions, such as unbalanced traffic or potential congestion that are observed using the closed-loop system, will automatically invoke targeted adjustments in response.
How Is SDN O&M Achieved?
Huawei’s research led to Fabric Insight, a closed-looped, new-generation O&M solution for SDN architectures. This system consists of four modules that perform the following functions:
Traffic monitoring solutions must improve their capacity to display large amounts of data in real-time in two ways. First, the data collection protocols must be changed to achieve greater efficiency. For medium-scale data capture, SNMP needs to be replaced with gRPC, an open source HTTP/2 Remote Procedure Call framework introduced by Google in 2015. The best results for large-scale reporting of data plane status will be achieved using User Datagram Protocol (UDP)-based channels. Second, the greatest assurance for high-frequency collection is to upgrade the nodes to dedicated components that are designed to allow millisecond-level event capture for data center switches.
Observing the route quality of End-to-End (E2E) services requires sending real-time detection packets to ‘scan’ the network. Unlike earlier random scanning mechanisms, Huawei’s Fabric Insight solution supports ‘directed scanning’ to sweep specific routes over each network topology to deliver higher accuracy and network-wide coverage. Administrators are no longer chasing problems and, instead, are receiving proactive analytics that present a clear, up-to-the-minute picture of a network’s status.
In certain circumstances, the network quality seems normal while the user experience with their applications is poor. The detecting mechanism cannot resolve the issue. The solution lies in the measurement of live service flows to detect packet loss or delay. At what points in the system are the packet losses occurring? If a long delay exists, what is the cause?
After the monitoring, detection, and measurement modules have performed their duties, the diagnostics module seeks to determine the root cause of each problem. Involving loop and packet loss analytics, each tool is designed to resolve a specific issue. Further, Huawei has opened the O&M Application Programming Interfaces (APIs) so customers are able to develop their own collections of diagnostic tools.
Huawei’s Fabric Insight is an intelligent, service-oriented management solution that is built to help customers meet the challenge of implementing O&M best practices in SDN environments. This implementation helps promote the further commercialization of SDN architectures.