Network Devices
The training of LLMs often relies on distributed network systems [171], [172]. During the transmission of gradients through the links between GPU server nodes, significant volumetric traffic is generated. This traffic can be susceptible to disruption by burst traffic, such as pulsating attacks [161]. Furthermore, distributed training frameworks may encounter congestion issues [173].
ENTITY
3 - Other
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit25
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Prioritize congestion-avoidance protocols such as Proactive Congestion Notification (PCN) in the distributed network system to preemptively regulate switch queue lengths, thereby mitigating network congestion before the arrival of the periodic, volumetric burst traffic generated by gradient synchronization during distributed deep learning (DDL) training. 2. Employ specialized network flow control mechanisms, such as MLTCP, that are specifically designed for machine learning workloads. These mechanisms should leverage the periodic nature of deep neural network (DNN) traffic to iteratively approximate a centralized flow schedule, thereby reducing network contention and improving communication efficiency between GPU server nodes. 3. Implement advanced, low-latency Traffic Engineering (TE) systems with sub-second control loops to dynamically manage and distribute traffic across the network. This will effectively alleviate burst-induced congestion and reduce maximum link utilization, a critical necessity for maintaining stability in geographically distributed or commodity cloud networks with high latency and jitter.