2. Privacy & Security3 - Other

Misuse of interpretability techniques

Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability techniques can be used to simulate a white-box attack scenario. In this case, knowing the internal workings of a model aids in the development of adversarial attacks [24].

Source: MIT AI Risk Repositorymit1131

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit1131

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Centralize and Strictly Control Access to Model Internals: Implement robust access safeguards by centralizing all copies of the model's architecture, weights, and internal state data to a minimal set of highly-secured, monitored systems. Authorization to access these "white-box" elements must be severely restricted to a small number of fully-vetted safety and alignment researchers to prevent unauthorized identification and modification of safety-critical features (neurons/circuits). 2. Deploy Real-Time White-Box Monitoring and Auditing: Utilize advanced internal monitoring techniques, such as linear probes and ensembles of real-time monitors, to continuously inspect model activations and internal reasoning paths. This is essential for detecting telltale circuit patterns or activation anomalies (e.g., context shifts, prompt patterns, internal deception signals) that would indicate intentional subversion of safety features or the execution of a targeted adversarial attack. 3. Limit Output Granularity to Impede Attack Development: For any external-facing interfaces or APIs, reduce the information leakage necessary for an adversary to simulate white-box conditions. This includes implementing strict query rate limits and reducing the granularity of model outputs (e.g., providing only class labels instead of confidence scores or raw probabilities) to deny attackers the comprehensive data required for reverse-engineering model functionality or refining iterative adversarial examples.