Minimize Supercomputing Data Center Downtime and Increase Security with NVIDIA Mellanox UFM Cyber-AI

By: Brian Sparks, Sr. Director, HPC and InfiniBand Marketing, Consortiums and Alliances, NVIDIA Mellanox Networking.

Data centers host many users and applications and have become the competitive advantage for research organization and manufacturing companies. Keeping the data center intact and healthy is critical, as operational costs of supercomputers continue to rise, driven by growing scientific computing demands and new security threats. Moreover, malicious users may exploit data center access to misuse compute resources by running prohibited applications, such as crypto currency mining, resulting in unexpected downtimes and higher operating costs. This week, NVIDIA unveiled the Unified Fabric Manager (UFM) Cyber-AI platform, which minimizes downtime and saves OPEX in InfiniBand data centers by harnessing AI-powered analytics to detect security threats and operational issues, as well as predicts network failures and provides recommendations for preventive maintenance. UFM Cyber-AI is a new addition to the UFM product line, which includes the widely used UFM Enterprise platform. The UFM Enterprise platform has been helping supercomputer system administrators manage their InfiniBand network for over a decade through its network monitoring, management, performance optimization, configuration checks and secure-cable management features. In addition to these features, the new UFM Cyber-AI platform applies AI to learn a data center’s operational cadence and network workload patterns, drawing on both real-time and historic telemetry and workload data. Against this baseline, it tracks the system’s health and network modifications, and detects performance degradations, usage and profile changes. Furthermore, predictability is optimized over time as system data is collected, and the cadence of the data center is learned. This enables system administrators to quickly detect and respond to potential security threats and address upcoming failures, saving cost and ensuring consistent service to customers.

UFM platforms comprise multiple solution levels and comprehensive feature sets to meet the broadest range of modern scale-out data center requirements. To round out the UFM platform portfolio, NVIDIA also introduced the UFM Telemetry platform. UFM Telemetry provides network validation tools and monitors the network performance and conditions. It captures a rich real-time network telemetry information, workload usage, system configuration and more, and streams it to a defined on-premise or cloud-based database for further analysis.

UFM can easily integrate with existing data center management tools. UFM provides an open and extensible object model to describe data center infrastructure and conduct all relevant management actions. UFM’s REST API enables integration with leading job schedulers, cloud and cluster managers, including for example Slurm and Platform LSF. UFM also provides network provisioning and integration with OpenStack, Azure Cloud and VMware. Finally, regular performance analysis is essential to ensure that your NVIDIA Mellanox solution is aligned with your business objectives and the latest NVIDIA Mellanox technology. Our NVIDIA Mellanox Care Monitoring and NOC Services constantly examine your solution for any potential faults before they occur, giving you a peace of mind by identifying and addressing issues before they become problems. The result is increased ROI and lower system maintenance costs. Through combining enhanced, real-time network telemetry with AI-powered cyber Intelligence and analytics to support scale-out InfiniBand data centers, and NVIDIA Mellanox Care services, the NVIDIA Mellanox UFM platform portfolio can revolutionize your supercomputer data center networking management, save operational costs and maintain customer satisfaction. More information on the UFM platform portfolio can be found here.