Back to the MIT repository
2. Privacy & Security1 - Pre-deployment

Network Devices

The training of LLMs often relies on distributed network systems [171], [172]. During the transmission of gradients through the links between GPU server nodes, significant volumetric traffic is generated. This traffic can be susceptible to disruption by burst traffic, such as pulsating attacks [161]. Furthermore, distributed training frameworks may encounter congestion issues [173].

Source: MIT AI Risk Repositorymit25

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit25

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Prioritize congestion-avoidance protocols such as Proactive Congestion Notification (PCN) in the distributed network system to preemptively regulate switch queue lengths, thereby mitigating network congestion before the arrival of the periodic, volumetric burst traffic generated by gradient synchronization during distributed deep learning (DDL) training. 2. Employ specialized network flow control mechanisms, such as MLTCP, that are specifically designed for machine learning workloads. These mechanisms should leverage the periodic nature of deep neural network (DNN) traffic to iteratively approximate a centralized flow schedule, thereby reducing network contention and improving communication efficiency between GPU server nodes. 3. Implement advanced, low-latency Traffic Engineering (TE) systems with sub-second control loops to dynamically manage and distribute traffic across the network. This will effectively alleviate burst-induced congestion and reduce maximum link utilization, a critical necessity for maintaining stability in geographically distributed or commodity cloud networks with high latency and jitter.