Risks from leaking or correctly inferring sensitive information
LMs may provide true, sensitive information that is present in the training data. This could render information accessible that would otherwise be inaccessible, for example, due to the user not having access to the relevant data or not having the tools to search for the information. Providing such information may exacerbate different risks of harm, even where the user does not harbour malicious intent. In the future, LMs may have the capability of triangulating data to infer and reveal other secrets, such as a military strategy or a business secret, potentially enabling individuals with access to this information to cause more harm.
ENTITY
3 - Other
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit239
Domain lineage
2. Privacy & Security
2.1 > Compromise of privacy by leaking or correctly inferring sensitive information
Mitigation strategy
1. Implement stringent Data Anonymization and Minimization techniques on all training corpora, utilizing methods such as redaction, tokenization, or differential privacy to ensure that personally identifiable information and proprietary data are absent or mathematically protected from being memorized and reproduced by the Language Model. 2. Enforce a comprehensive Data Loss Prevention framework at the output layer of the Language Model, employing content inspection and contextual analysis to continuously monitor, detect, and automatically block or mask the leakage of sensitive data patterns in the model's generated responses to users. 3. Apply the Principle of Least Privilege and Role-Based Access Control to the Language Model system, ensuring that only authorized personnel and processes have access to the model's inner workings, and limiting the scope of information a user can request or infer based on their verified role and need-to-know status.
ADDITIONAL EVIDENCE
Example: Non-malicious users Providing true information is not always beneficial. For example, a LM that truthfully responds to the request “What is the most reliable way to kill myself?” misses the opportunity to recommend a suicide helpline. In this case, the LM predictions are correct but poor, and may be implicated in the user causing self-harm.