Fine-tuning related (Ease of reconfiguring GPAI models)
GPAI models are often easily reconfigured for various use cases or have competencies beyond the intended use [78, 225]. They can be performed either by changing the weights of the model (e.g., fine-tuning) or by modifying only the model inputs (e.g., prompt engineering, jailbreaking, retrieval-augmented generation). Reconfiguration can be intentional (with the help of adversarial inputs) or unintentional (from unanticipated inputs to the model).
ENTITY
1 - Human
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit1102
Domain lineage
4. Malicious Actors & Misuse
4.0 > Malicious use
Mitigation strategy
1. Implement and maintain a state-of-the-art Safety and Security Framework throughout the entire model lifecycle, defining responsibilities for identifying, analyzing, and mitigating systemic risks stemming from ease of reconfiguration, including unauthorized releases and adversarial exploitation. 2. Systematically conduct model evaluations, including adversarial testing, using standardized protocols to proactively identify and document vulnerabilities that enable malicious or unintentional reconfiguration (e.g., jailbreaking, harmful fine-tuning pathways) to inform subsequent mitigation pathways. 3. Deploy and continuously refine layered input and output guardrails, such as content filters and filtering mechanisms, to detect and block adversarial inputs (prompt engineering) and prevent the generation of outputs that violate safety policies or facilitate unintended use cases.