Abstract:Moving business to cloud has been a trend recently, and COVID-19 gives a push to this trend. However, not all forms of business are suitable for public cloud computing. For the sake of data privacy, plenty of users, especially government users, prefer to build their own private cloud or hybrid cloud in the post-COVID-19 world, and hyper-converged infrastructure (HCI) is a convenient way to achieve this goal. In HCI, computing, storage, and network are all virtualized, which leads to higher resource utilization and easier way to be deployed. The network elements are no longer present as sensible hardware blocks in HCI but as lines of codes to function instead. To achieve better data forwarding performance in virtualization, many innovative technologies have risen, among which DPDK has been widely studied and applied. With DPDK, developers can customize various network forwarding applications. Virtualization and DPDK can greatly improve resource utilization and network forwarding performance, reducing the difficulties and costs of building data centers or private cloud by enterprises of various scales or institutions. However, virtualization at a high level also poses great challenges to network operation and maintenance owing to the loss of physical network entities. When a virtual network suffers a failure (e.g., packet loss), the traditional diagnosis tools designed for hardware network equipment cannot fulfill the need of cause locating and analyzing, resulting in much more mean time to repair (MTTR) and business loss. Even worse, the virtual network seems like a black box to network operators, which makes the network vulnerable. To solve these problems, this study proposes a proactive diagnostic system for persistent packet loss in HCI based cloud, named Flowprobe, which aims to enable the detection and cause locating of persistent packet loss for user-space virtual networks based on DPDK. With this system, users can have a comprehensive view of the way in which the packet traverses through the virtual network, the actions that the packet has performed, the positions that suffer packet loss, and the causes resulting in the loss. Thoughtful evaluation has proven that the system can handle 576 packet loss scenarios in virtual networks. Meanwhile, it has a good performance, with the performance degradation of data forwarding not exceeding 1% when the system is functioning. The system has been deployed in the HCI production environment for about 3 years and helped solve many problems in virtual networks.