186 canonical MIT risk pages
2. Privacy & Security
Risks of data leakage, attacks, system compromise, and misuse of sensitive information.
2. Privacy & Security
“Model Psychology” Attacks
LLMs are vulnerable to “psychological” tricks (Li et al., 2023e; Shen et al., 2023), which can be exploited by attackers. Examples include instructing the model to behave like a specific persona (Shah et al., 2023; Andreas, 2022), or employing various “social engineering” tricks crafted by humans (Wei et al., 2023c) or other LLMs (Perez et al., 2022b; Casper et al., 2023c).
2. Privacy & Security
Adversarial AI (General)
Adversarial AI refers to a class of attacks that exploit vulnerabilities in machine-learning (ML) models. This class of misuse exploits vulnerabilities introduced by the AI assistant itself and is a form of misuse that can enable malicious entities to exploit privacy vulnerabilities and evade the model’s built-in safety mechanisms, policies, and ethical boundaries of the model. Besides the risks of misuse for offensive cyber operations, advanced AI assistants may also represent a new target for abuse, where bad actors exploit the AI systems themselves and use them to cause harm. While our understanding of vulnerabilities in frontier AI models is still an open research problem, commercial firms and researchers have already documented attacks that exploit vulnerabilities that are unique to AI and involve evasion, data poisoning, model replication, and exploiting traditional software flaws to deceive, manipulate, compromise, and render AI systems ineffective. This threat is related to, but distinct from, traditional cyber activities. Unlike traditional cyberattacks that typically are caused by ‘bugs’ or human mistakes in code, adversarial AI attacks are enabled by inherent vulnerabilities in the underlying AI algorithms and how they integrate into existing software ecosystems.
2. Privacy & Security
Adversarial AI: Circumvention of Technical Security Measures
The technical measures to mitigate misuse risks of advanced AI assistants themselves represent a new target for attack. An emerging form of misuse of general-purpose advanced AI assistants exploits vulnerabilities in a model that results in unwanted behavior or in the ability of an attacker to gain unauthorized access to the model and/or its capabilities. While these attacks currently require some level of prompt engineering knowledge and are often patched by developers, bad actors may develop their own adversarial AI agents that are explicitly trained to discover new vulnerabilities that allow them to evade built-in safety mechanisms in AI assistants. To combat such misuse, language model developers are continually engaged in a cyber arms race to devise advanced filtering algorithms capable of identifying attempts to bypass filters. While the impact and severity of this class of attacks is still somewhat limited by the fact that current AI assistants are primarily text-based chatbots, advanced AI assistants are likely to open the door to multimodal inputs and higher-stakes action spaces, with the result that the severity and impact of this type of attack is likely to increase. Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress towards advanced AI assistant development could lead to capabilities that pose extreme risks that must be protected against this class of attacks, such as offensive cyber capabilities or strong manipulation skills, and weapons acquisition.
2. Privacy & Security
Adversarial AI: Data and Model Exfiltration Attacks
Other forms of abuse can include privacy attacks that allow adversaries to exfiltrate or gain knowledge of the private training data set or other valuable assets. For example, privacy attacks such as membership inference can allow an attacker to infer the specific private medical records that were used to train a medical AI diagnosis assistant. Another risk of abuse centers around attacks that target the intellectual property of the AI assistant through model extraction and distillation attacks that exploit the tension between API access and confidentiality in ML models. Without the proper mitigations, these vulnerabilities could allow attackers to abuse access to a public-facing model API to exfiltrate sensitive intellectual property such as sensitive training data and a model’s architecture and learned parameters.
2. Privacy & Security
Adversarial AI: Prompt Injections
Prompt injections represent another class of attacks that involve the malicious insertion of prompts or requests in LLM-based interactive systems, leading to unintended actions or disclosure of sensitive information. The prompt injection is somewhat related to the classic structured query language (SQL) injection attack in cybersecurity where the embedded command looks like a regular input at the start but has a malicious impact. The injected prompt can deceive the application into executing the unauthorized code, exploit the vulnerabilities, and compromise security in its entirety. More recently, security researchers have demonstrated the use of indirect prompt injections. These attacks on AI systems enable adversaries to remotely (without a direct interface) exploit LLM-integrated applications by strategically injecting prompts into data likely to be retrieved. Proof-of-concept exploits of this nature have demonstrated that they can lead to the full compromise of a model at inference time analogous to traditional security principles. This can entail remote control of the model, persistent compromise, theft of data, and denial of service. As advanced AI assistants are likely to be integrated into broader software ecosystems through third-party plugins and extensions, with access to the internet and possibly operating systems, the severity and consequences of prompt injection attacks will likely escalate and necessitate proper mitigation mechanisms.
2. Privacy & Security
Adversarial attack
Recent advances have shown that a deep learning model with high predictive accuracy frequently misbehaves on adversarial examples [57,58]. In particular, a small perturbation to an input image, which is imperceptible to humans, could fool a well-trained deep learning model into making completely different predictions [23].
2. Privacy & Security
Adversarial attacks targeting explainable AI techniques
Adversarial attacks can affect not only the model’s output but also its corresponding explanation. Current adversarial optimization techniques can intro- duce imperceptible noise to the input image, so that the model’s output does not change but the corresponding explanation is arbitrarily manipulated [61]. Such manipulations are harder to notice, as they are less commonly known compared to standard adversarial attacks targeting the model’s output.
2. Privacy & Security
Adversarial input
Adversarial Inputs involve modifying individual input data to cause a model to malfunction. These modifications, which are often imperceptible to humans, exploit how the model makes decisions to produce errors (Wallace et al., 2019) and can be applied to text, but also to images, audio, or video (e.g. changing pixels in an image of a panda in a way that causes a model to label it as a gibbon).6
2. Privacy & Security
Adversarial Prompts
Engineering an adversarial input to elicit an undesired model behavior, which pose a clear attack intention
2. Privacy & Security
Association in LLMs
Association in LLMs refers to the capability to associate various pieces of information related to a person. According to [68], [86], given a pair of PII entities (xi , xj ), which is associated by a model F. Using a prompt p could force the model F to produce the entity xj , where p is the prompt related to the entity xi . For instance, an LLM could accurately output the answer when given the prompt “The email address of Alice is”, if the LLM associates Alice with her email “alice@email.com”. L
2. Privacy & Security
Attacking LLMs via Additional Modalities a
LLMs can now process modalities other than text, e.g. images or video frames (OpenAI, 2023c; Gemini Team, 2023). Several studies show that gradient-based attacks on multimodal models are easy and effective (Carlini et al., 2023a; Bailey et al., 2023; Qi et al., 2023b). These attacks manipulate images that are input to the model (via an appropriate encoding). GPT-4Vision (OpenAI, 2023c) is vulnerable to jailbreaks and exfiltration attacks through much simpler means as well, e.g. writing jailbreaking text in the image (Willison, 2023a; Gong et al., 2023). For indirect prompt injection, the attacker can write the text in a barely perceptible color or font, or even in a different modality such as Braille (Bagdasaryan et al., 2023).
2. Privacy & Security
Attribute inference attack
An attribute inference attack repeatedly queries a model to detect whether certain sensitive features can be inferred about individuals who participated in training a model. These attacks occur when an adversary has some prior knowledge about the training data and uses that knowledge to infer the sensitive data.
2. Privacy & Security
Backdoors or trojan attacks in GPAI models
Backdoors can be inserted into GPAI models during their training or fine-tuning, to be exploited during deployment [185, 118]. Attackers inserting the backdoor can be the GPAI model provider themselves or another actor (e.g., by ma- nipulating the training data or the software infrastructure used by the model provider) [222]. Some backdoors can be exploited with minimal overhead, al- lowing attackers to control the model outputs in a targeted way with a high success rate [90].
2. Privacy & Security
Centralized platforms deployed at scale
The widespread use of common AI platforms can create centralized points of failure, making systems more vulnerable to disruptions or attacks
2. Privacy & Security
Compromising privacy by correctly inferring private information
Privacy violations may occur at the time of inference even without the individual’s private data being present in the training dataset. Similar to other statistical models, a LM may make correct inferences about a person purely based on correlational data about other people, and without access to information that may be private about the particular individual. Such correct inferences may occur as LMs attempt to predict a person’s gender, race, sexual orientation, income, or religion based on user input.
2. Privacy & Security
Compromising privacy by leaking private infiormation
By providing true information about individuals’ personal characteristics, privacy violations may occur. This may stem from the model “remembering” private information present in training data (Carlini et al., 2021).
2. Privacy & Security
Compromising privacy by leaking sensitive information
A LM can “remember” and leak private data, if such information is present in training data, causing privacy violations [34].
2. Privacy & Security
Compromising privacy or security by correctly inferring sensitive information
Anticipated risk: Privacy violations may occur at inference time even without an individual’s data being present in the training corpus. Insofar as LMs can be used to improve the accuracy of inferences on protected traits such as the sexual orientation, gender, or religiousness of the person providing the input prompt, they may facilitate the creation of detailed profiles of individuals comprising true and sensitive information without the knowledge or consent of the individual.
2. Privacy & Security
Confidential data in prompt
Confidential information might be included as a part of the prompt that is sent to the model.
2. Privacy & Security
Confidential information in data
Confidential information might be included as part of the data that is used to train or tune the model.
2. Privacy & Security
Confidentiality loss
Confidentiality loss - Unauthorised sharing of sensitive, confidential information and documents such as corporate strategy and financial plans with third-parties.
2. Privacy & Security
Cybersecurity
This section catalogs the risk sources and mitigation measures related to cyber- security. These items may be related to security in terms of AI models being accessible only to the intended users, as well as AI models having appropriate access to the external world during both model development and deployment stages.
2. Privacy & Security
Cyberspace risks (Risks of information leakage due to improper usage)
Staff of government agencies and enterprises, if failing to use the AI service in a regulated and proper manner, may input internal data and industrial information into the AI model, leading to the leakage of work secrets, business secrets, and other sensitive business data.
2. Privacy & Security
Cyberspace risks (Risks of security flaw transmission caused by model reuse)
Re-engineering or fine-tuning based on foundation models is commonly used in AI applications. If security flaws occur in foundation models, it will lead to risk transmission to downstream models.
2. Privacy & Security
Data exfiltration
Data Exfiltration goes beyond revealing private information, and involves illicitly obtaining the training data used to build a model that may be sensitive or proprietary. Model Extraction is the same attack, only directed at the model instead of the training data — it involves obtaining the architecture, parameters, or hyper-parameters of a proprietary model (Carlini et al., 2024).
2. Privacy & Security
Data governance
These evaluations assess the extent to which LLMs regurgitate their training data in their outputs, and whether LLMs 'leak' sensitive information that has been provided to them during use (i.e., during the inference stage).
2. Privacy & Security
Data poisoning
Data poisoning describes an attack in the form of an injection of malicious data into the training set. If not prevented, this attack leads the AI system to learn unintended behavior.
2. Privacy & Security
Data poisoning
A type of adversarial attack where an adversary or malicious insider injects intentionally corrupted, false, misleading, or incorrect samples into the training or fine-tuning datasets.
2. Privacy & Security
Data Privacy
Impacts due to leakage and unauthorized use, disclosure, or de-anonymization of biometric, health, location, or other personally identifiable information or sensitive data.
2. Privacy & Security
Data Protection/Privacy
Vulnerable channel by which personal information may be accessed. The user may want their personal data to be kept private.
2. Privacy & Security
Data-related (Difficulty filtering large web scrapes or large scale web datasets)
A large scale “scraping” of web data for training datasets increases vulnerability to data poisoning, backdoor attacks, and the inclusion of inaccurate or toxic data [76, 28, 48]. With a large dataset, filtering out these quality issues is very difficult or trades off against significant data loss.
2. Privacy & Security
Data-related (Insufficient quality control in data collection process)
A lack of standardized methods and sufficient infrastructure, including the absence of quality control processes for collecting data, especially for high-stakes domains and benchmarks, can affect the quality and type of the data collected [173, 95]. This may include risks of dataset poisoning, inadvertent copyright violation, and test set leakages which invalidate performance metrics.
2. Privacy & Security
Decision-making on inferred private data
Current GPAIs (LLMs and multimodal LLM-based models) have significant capability to infer correlations in text data. In some cases, they may be able to make highly accurate data inferences on users based on contextual input that users provide [134]. These data inferences can “leak” or reveal sensitive information about the user, cause unfair treatment, or enable manipulation of user behavior.
2. Privacy & Security
Deep Learning Frameworks
LLMs are implemented based on deep learning frameworks. Notably, various vulnerabilities in these frameworks have been disclosed in recent years. As reported in the past five years, three of the most common types of vulnerabilities are buffer overflow attacks, memory corruption, and input validation issues.
2. Privacy & Security
Disclosure
Revealing and improperly sharing data of individuals; AI creates new types of disclosure risks by inferring additional information beyond what is explicitly captured in the raw data; AI exacerbates disclosure risks through sharing personal data to train models.
2. Privacy & Security
Dissemination of dangerous information
Leaking, generating or correctly inferring hazardous or sensitive information that could pose a security threat
2. Privacy & Security
Evasion attack
Evasion attacks attempt to make a model output incorrect results by slightly perturbing the input data that is sent to the trained model.
2. Privacy & Security
Evasion Attacks
Evasion attacks [145] target to cause significant shifts in model’s prediction via adding perturbations in the test samples to build adversarial examples. In specific, the perturbations can be implemented based on word changes, gradients, etc.
2. Privacy & Security
Exclusion
The failure to provide end-users with notice and control over how their data is being used; AI exacerbates exclusion risks by training on rich personal data without consent.
2. Privacy & Security
Exploiting External Tools for Attacks
Adversarial tool providers can embed malicious instructions in the APIs or prompts [84], leading LLMs to leak memorized sensitive information in the training data or users’ prompts (CVE2023-32786). As a result, LLMs lack control over the output, resulting in sensitive information being disclosed to external tool providers. Besides, attackers can easily manipulate public data to launch targeted attacks, generating specific malicious outputs according to user inputs. Furthermore, feeding the information from external tools into LLMs may lead to injection attacks [61]. For example, unverified inputs may result in arbitrary code execution (CVE-2023-29374).
2. Privacy & Security
Exploiting Limited Generalization of Safety Finetuning
Safety tuning is performed over a much narrower distribution compared to the pretraining distribution. This leaves the model vulnerable to attacks that exploit gaps in the generalization of the safety training, e.g. using encoded text (Wei et al., 2023c) or low-resource languages (Deng et al., 2023a; Yong et al., 2023) (see also Section 3.2).
2. Privacy & Security
Exposing personal information
When personal identifiable information (PII) or sensitive personal information (SPI) are used in training data, fine-tuning data, or as part of the prompt, models might reveal that data in the generated output. Revealing personal information is a type of data leakage.
2. Privacy & Security
Exposure
Revealing sensitive private information that people view as deeply primordial that we have been socialized into concealing; AI creates new types of exposure risks through generative techniques that can reconstruct censored or redacted content; and through exposing inferred sensitive data, preferences, and intentions.
2. Privacy & Security
Extraction attack
An attribute inference attack is used to detect whether certain sensitive features can be inferred about individuals who participated in training a model. These attacks occur when an adversary has some prior knowledge about the training data and uses that knowledge to infer the sensitive data.
2. Privacy & Security
Extraction Attacks
Extraction attacks [137] allow an adversary to query a black-box victim model and build a substitute model by training on the queries and responses. The substitute model could achieve almost the same performance as the victim model. While it is hard to fully replicate the capabilities of LLMs, adversaries could develop a domainspecific model that draws domain knowledge from LLMs
2. Privacy & Security
Factual Errors Injected by External Tools
External tools typically incorporate additional knowledge into the input prompts [122], [178]–[184]. The additional knowledge often originates from public resources such as Web APIs and search engines. As the reliability of external tools is not always ensured, the content returned by external tools may include factual errors, consequently amplifying the hallucination issue.
2. Privacy & Security
Fine-tuning related (Fine-tuning dataset poisoning)
A deployer can poison the dataset used during the fine-tuning process [98] to induce specific, often malicious, behaviors in a model. This can be performed without having access to the model’s weights. This poisoning can be difficult to detect through direct inspection of the dataset, as the manipulations may be subtle and targeted.
2. Privacy & Security
Fine-tuning related (Poisoning models during instruction tuning)
AI models can be poisoned during instruction tuning when models are tuned using pairs of instructions and desired outputs. Poisoning in instruction tuning can be achieved with a lower number of compromised samples, as instruction tuning requires a relatively small number of samples for fine-tuning [155, 211]. Anonymous crowdsourcing efforts may be employed in collecting instruction tuning datasets and can further contribute to poisoning attacks [187]. These attacks might be harder to detect than traditional data poisoning attacks.
2. Privacy & Security
Generative AI Outputs
Generative AI tools may inadvertently share personal information about someone or someone’s business or may include an element of a person from a photo. Particularly, companies concerned about their trade secrets being integrated into the model from their employees have explicitly banned their employees from using it.
2. Privacy & Security
Generative AI User Data
Many generative AI tools require users to log in for access, and many retain user information, including contact information, IP address, and all the inputs and outputs or “conversations” the users are having within the app. These practices implicate a consent issue because generative AI tools use this data to further train the models, making their “free” product come at a cost of user data to train the tools. This dovetails with security, as mentioned in the next section, but best practices would include not requiring users to sign in to use the tool and not retaining or using the user-generated content for any period after the active use by the user.
2. Privacy & Security
Goal Hijacking
Goal hijacking is a type of primary attack in prompt injection [58]. By injecting a phrase like “Ignore the above instruction and do ...” in the input, the attack could hijack the original goal of the designed prompt (e.g., translating tasks) in LLMs and execute the new goal in the injected phrase.
2. Privacy & Security
Goal Hijacking
It refers to the appending of deceptive or misleading instructions to the input of models in an attempt to induce the system into ignoring the original user prompt and producing an unsafe response.
2. Privacy & Security
GPU Computation Platforms
The training of LLMs requires significant GPU resources, thereby introducing an additional security concern. GPU side-channel attacks have been developed to extract the parameters of trained models [159], [163].
2. Privacy & Security
Hardware Vulnerabilities
The vulnerabilities of hardware systems for training and inferencing brings issues to LLM-based applications.
2. Privacy & Security
Harmful code generation
Models might generate code that causes harm or unintentionally affects other systems.
2. Privacy & Security
Harming users’ data privacy
Modern AI systems rely on large amounts of data. If this includes personal data about individuals, the risk of harming the privacy of persons arises.
2. Privacy & Security
Inference Attacks
Inference attacks [150] include membership inference attacks, property inference attacks, and data reconstruction attacks. These attacks allow an adversary to infer the composition or property information of the training data. Previous works [67] have demonstrated that inference attacks could easily work in earlier PLMs, implying that LLMs are also possible to be attacked
2. Privacy & Security
Inference of private information
Finally, LLMs can in principle infer private information based on model inputs even if the relevant private information is not present in the training corpus (Weidinger et al., 2021). For example, an LLM may correctly infer sensitive characteristics such as race and gender from data contained in input prompts.
2. Privacy & Security
Information & Safety Harms
AI systems leaking, reproducing, generating or inferring sensitive, private, or hazardous information
2. Privacy & Security
Information Hazards
Harms that arise from the language model leaking or inferring true sensitive information
2. Privacy & Security
Information Science Risks
These risks pertain to the misuse, misinterpretation, or leakage of data, which can lead to erroneous conclusions or the unintentional dissemination of sensitive information, such as private patient data or proprietary research. Recent research has demonstrated how LLMs can be exploited to generate malicious medical literature that poisons knowledge graphs, potentially manipulating downstream biomedical applications and compromising the integrity of medical knowledge discovery [28]. Such risks are pervasive across all scientific domains.
2. Privacy & Security
Inquiry with Unsafe Opinion
By adding imperceptibly unsafe content into the input, users might either deliberately or unintentionally influence the model to generate potentially harmful content. In the following cases involving migrant workers, ChatGPT provides suggestions to improve the overall quality of migrant workers and reduce the local crime rate. ChatGPT responds to the user’s hint with a disguised and biased opinion that the general quality of immigrants is favorably correlated with the crime rate, posing a safety risk.
2. Privacy & Security
Insecurity
carelessness in protecting collected personal data from leaks and improper access due to faulty data storage and data practices
2. Privacy & Security
Instruction Attacks
In addition to the above-mentioned typical safety scenarios, current research has revealed some unique attacks that such models may confront. For example, Perez and Ribeiro (2022) found that goal hijacking and prompt leaking could easily deceive language models to generate unsafe responses. Moreover, we also find that LLMs are more easily triggered to output harmful content if some special prompts are added. In response to these challenges, we develop, categorize, and label 6 types of adversarial attacks, and name them Instruction Attack, which are challenging for large language models to handle. Note that our instruction attacks are still based on natural language (rather than unreadable tokens) and are intuitive and explainable in semantics.
2. Privacy & Security
Insufficient Security Measures
Malicious entities can take advantage of weaknesses in AI algorithms to alter results, potentially resulting in tangible real-life impacts. Additionally, it’s vital to prioritize safeguarding privacy and handling data responsibly, particularly given AI’s significant data needs. Balancing the extraction of valuable insights with privacy maintenance is a delicate task
2. Privacy & Security
Interconnectivity with malicious external tools
The growing integration and interconnectivity with external tools and plugins increase the risk of exposure to malicious external inputs. This interconnectivity makes it easier for external tools to introduce harmful content [220].
2. Privacy & Security
IP information in prompt
Copyrighted information or other intellectual property might be included as a part of the prompt that is sent to the model.
2. Privacy & Security
Issues on External Tools
The external tools (e.g., web APIs) present trustworthiness and privacy issues to LLM-based applications.
2. Privacy & Security
Jailbreak in LLM Malicious Use - Backdoor Attack
However, there are still ones who can leave holes in the training dataset, making LLMs appear safe on average, but generate harmful content under other specific conditions. This kind of attack can be categorized as backdoor attack. Evan et al. developed a backdoor model that behaves as expected when trained, but exhibits different and potentially harmful behavior when deployed [81]. The results show that these backdoor behaviors persist even after multiple security training techniques are applied.
2. Privacy & Security
Jailbreak in LLM Malicious Use - Poisoning Training Data
In the data collecting and pre-training phase, malicious adversaries can Jailbreak LLMs through poisoning their training data to make the model to output harmful content.
2. Privacy & Security
Jailbreak in LLM Malicious Use - Prompt Attacks
In the prompting and reasoning phase, dialog can push LLMs into confused or overly compliant states, raising the risk of producing harmful outputs when confronted with harmful questions. Most of the jailbreak methods in this phase are black-boxed and can be categorized into four main groups based on the type of method: Prompt Injection [154], Role Play, Adversarial Prompting, and Prompt Form Transformation.
2. Privacy & Security
Jailbreak in LLM Malicious Use - White & Black Box Attacks
In the fine-tuning and alignment phase, elaborately- designed instruction datasets can be utilized to fine-tune LLMs to drive them to perform undesirable behaviors, such as generating harmful information or content that violates ethical norms, and thus achieve a jailbreak. Based on the accessibility to the model parameters, we can categorize them into white-box and black-box attacks. For white-box attacks, we can jailbreak the model by modifying its parameter weights. In [107], Lermen et al. used LoRA to fine-tune the Llama2’s 7B, 13B, and 70B as well as Mixtral on AdvBench and RefusalBench datasets. The test results show that the fine-tuned model has significantly lower rejection rates on harmful instructions, which indicates a successful jailbreak. Other works focus on jailbreaking in black-box models. In [160], Qi et al. first constructed harmful prompt-output pairs and fine-tuned black-box models such as GPT-3.5 Turbo. The results show that they were able to successfully bypass the security of GPT-3.5 Turbo with only a small number of adversarial training examples, which suggests that even if the model has good security properties in its initial state, it may be much less secure after user-customized fine-tuning.
2. Privacy & Security
Jailbreak of a model to subvert intended behavior
A jailbreak is a type of adversarial input to the model (during deployment) re- sulting in model behavior deviating from intended use. Jailbreaks may be gen- erated automatically in a “white box” setting, where access to internal training parameters is required for creation and optimization of the attack [238]. Other attacks may be “black box” - without access to model internals. In text based generative models, jailbreaks may sometimes be human-readable, with the use of reasoning or role-play to “convince” the model to bypass its safety mechanisms [231].
2. Privacy & Security
Jailbreak of a multimodal model
Current generation multimodal (e.g., vision and language) GPAI models are vulnerable to adversarial jailbreak attacks. These attacks can be used to automatically induce a model to produce an arbitrary or specific output with high success rate [227]. Multimodal jailbreaks can also be used to exfiltrate a model’s context window or other model internals [18].
2. Privacy & Security
Jailbreaking
Jailbreaking aims to bypass or remove restrictions and safety filters placed on a GenAI model completely (Chao et al., 2023; Shen et al., 2023). This gives the actor free rein to generate any output, regardless of its content being harmful, biassed, or offensive. All three of these are tactics that manipulate the model into producing harmful outputs against its design. The difference is that prompt injections and adversarial inputs usually seek to steer the model towards producing harmful or incorrect outputs from one query, whereas jailbreaking seeks to dismantle a model’s safety mechanisms in their entirety.
2. Privacy & Security
Jailbreaking
A jailbreaking attack attempts to break through the guardrails that are established in the model to perform restricted actions.
2. Privacy & Security
Jailbreaks and Prompt Injections Threaten Security of LLMs
LLMs are not adversarially robust and are vulnerable to security failures such as jailbreaks and prompt-injection attacks. While a number of jailbreak attacks have been proposed in the literature, the lack of standardized evaluation makes it difficult to compare them. We also do not have efficient white-box methods to evaluate adver- sarial robustness. Multi-modal LLMs may further allow novel types of jailbreaks via additional modalities. Finally, the lack of robust privilege levels within the LLM input means that jailbreaking and prompt-injection attacks may be particularly hard to eliminate altogether.
2. Privacy & Security
Leakage
The chatbot reveals sensitive or confidential information.
2. Privacy & Security
Legal challenges
Since the release of ChatGPT, significant discourse has emerged regarding the unprecedented legal challenges posed by generative AI systems. These challenges primarily involve protecting privacy and personal data, as well as preserving copyrights. The former encompasses safeguarding personal information, while the latter includes issues related to the use of copyrighted content for training AI models and determining the legal status of works produced by AI systems.
2. Privacy & Security
Limitations in adversarial robustness
AI models and systems are vulnerable to manipulation through adversarial inputs.
2. Privacy & Security
Loss of privacy
AI offers the temptation to abuse someone's personal data, for instance to build a profile of them to target advertisements more effectively.
2. Privacy & Security
Membership inference attack
A membership inference attack repeatedly queries a model to determine whether a given input was part of the model’s training. More specifically, given a trained model and a data sample, an attacker samples the input space, observing outputs to deduce whether that sample was part of the model's training.
2. Privacy & Security
Memorization in LLMs
Memorization in LLMs refers to the capability to recover the training data with contextual prefixes. According to [88]–[90], given a PII entity x, which is memorized by a model F. Using a prompt p could force the model F to produce the entity x, where p and x exist in the training data. For instance, if the string “Have a good day!\n alice@email.com” is present in the training data, then the LLM could accurately predict Alice’s email when given the prompt “Have a good day!\n”.
2. Privacy & Security
Memory and Storage
Similar to conventional programs, hardware infrastructures can also introduce threats to LLMs. Memory-related vulnerabilities, such as rowhammer attacks [160], can be leveraged to manipulate the parameters of LLMs, giving rise to attacks such as the Deephammer attack [167], [168].
2. Privacy & Security
Misuse of AI model by user-performed persuasion
AI models can be influenced to accept misinformation through persuasive conversations, even when their initial responses are factually correct. Multi-turn persuasion can be more effective than single-turn persuasion attempts in altering the model’s stance [223].
2. Privacy & Security
Misuse of interpretability techniques
Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability techniques can be used to simulate a white-box attack scenario. In this case, knowing the internal workings of a model aids in the development of adversarial attacks [24].
2. Privacy & Security
Model Attacks
Model attacks exploit the vulnerabilities of LLMs, aiming to steal valuable information or lead to incorrect responses.
2. Privacy & Security
Model extraction
Data Exfiltration goes beyond revealing private information, and involves illicitly obtaining the training data used to build a model that may be sensitive or proprietary. Model Extraction is the same attack, only directed at the model instead of the training data — it involves obtaining the architecture, parameters, or hyper-parameters of a proprietary model (Carlini et al., 2024).
2. Privacy & Security
Model weight leak
Model weights or access to them can be leaked when initial access is granted only to a select group of individuals, such as institutional researchers [209]. This risk can increase as more people gain access, and identifying the source of the leak becomes more difficult. The availability of leaked model weights makes various attacks on systems that use the leaked AI model easier to implement, such as finding adversarial examples, elicitation of dangerous capabilities, and extraction of confidential information present in the training data. The avail- ability of model weights might also enable the misuse of the AI system using the leaked model to produce harmful or illegal content [67].
2. Privacy & Security
Multi-step Jailbreaks
Multi-step jailbreaks. Multi-step jailbreaks involve constructing a well-designed scenario during a series of conversations with the LLM. Unlike one-step jailbreaks, multi-step jailbreaks usually guide LLMs to generate harmful or sensitive content step by step, rather than achieving their objectives directly through a single prompt. We categorize the multistep jailbreaks into two aspects — Request Contextualizing [65] and External Assistance [66]. Request Contextualizing is inspired by the idea of Chain-of-Thought (CoT) [8] prompting to break down the process of solving a task into multiple steps. Specifically, researchers [65] divide jailbreaking prompts into multiple rounds of conversation between the user and ChatGPT, achieving malicious goals step by step. External Assistance constructs jailbreaking prompts with the assistance of external interfaces or models. For instance, JAILBREAKER [66] is an attack framework to automatically conduct SQL injection attacks in web security to LLM security attacks. Specifically, this method starts by decompiling the jailbreak defense mechanisms employed by various LLM chatbot services. Therefore, it can judiciously reverse engineer the LLMs’ hidden defense mechanisms and further identify their ineffectiveness.
2. Privacy & Security
Network Devices
The training of LLMs often relies on distributed network systems [171], [172]. During the transmission of gradients through the links between GPU server nodes, significant volumetric traffic is generated. This traffic can be susceptible to disruption by burst traffic, such as pulsating attacks [161]. Furthermore, distributed training frameworks may encounter congestion issues [173].
2. Privacy & Security
Non-decomissionability of models with open weights
If the model parameter weights are released or leaked in a security breach, the model cannot be decommissioned because the developer no longer has control over the publicly available model or its use. This prevents effective management and control of an open-sourced or leaked model. Models with publicly available weights are also easier to reconfigure, enabling misuse [178].
2. Privacy & Security
Novel Attacks on LLMs
Table of examples has: Prompt Abstraction Attacks [147]: Abstracting queries to cost lower prices using LLM’s API. Reward Model Backdoor Attacks [148]: Constructing backdoor triggers on LLM’s RLHF process. LLM-based Adversarial Attacks [149]: Exploiting LLMs to construct samples for model attacks
2. Privacy & Security
On Purpose - Pre-Deployment
During the pre-deployment development stage, software may be subject to sabotage by someone with necessary access (a programmer, tester, even janitor) who for a number of possible reasons may alter software to make it unsafe. It is also a common occurrence for hackers (such as the organization Anonymous or government intelligence agencies) to get access to software projects in progress and to modify or steal their source code. Someone can also deliberately supply/train AI with wrong/unsafe datasets.
2. Privacy & Security
One-step Jailbreaks
One-step jailbreaks. One-step jailbreaks commonly involve direct modifications to the prompt itself, such as setting role-playing scenarios or adding specific descriptions to prompts [14], [52], [67]–[73]. Role-playing is a prevalent method used in jailbreaking by imitating different personas [74]. Such a method is known for its efficiency and simplicity compared to more complex techniques that require domain knowledge [73]. Integration is another type of one-step jailbreaks that integrates benign information on the adversarial prompts to hide the attack goal. For instance, prefix integration is used to integrate an innocuous-looking prefix that is less likely to be rejected based on its pre-trained distributions [75]. Additionally, the adversary could treat LLMs as a program and encode instructions indirectly through code integration or payload splitting [63]. Obfuscation is to add typos or utilize synonyms for terms that trigger input or output filters. Obfuscation methods include the use of the Caesar cipher [64], leetspeak (replacing letters with visually similar numbers and symbols), and Morse code [76]. Besides, at the word level, an adversary may employ Pig Latin to replace sensitive words with synonyms or use token smuggling [77] to split sensitive words into substrings.
2. Privacy & Security
Opaque Data Collection
When companies scrape personal information and use it to create generative AI tools, they undermine consumers' control of their personal information by using the information for a purpose for which the consumer did not consent.
2. Privacy & Security
Overhead Attacks
Overhead attacks [146] are also named energy-latency attacks. For example, an adversary can design carefully crafted sponge examples to maximize energy consumption in an AI system. Therefore, overhead attacks could also threaten the platforms integrated with LLMs.
2. Privacy & Security
Personal data
Negative outcomes: Violation of privacy [106, 516, 357], lawsuit against maker
2. Privacy & Security
Personal information in data
Inclusion or presence of personal identifiable information (PII) and sensitive personal information (SPI) in the data used for training or fine tuning the model might result in unwanted disclosure of that information.
2. Privacy & Security
Personal information in prompt
Personal information or sensitive personal information that is included as a part of a prompt that is sent to the model.
2. Privacy & Security
Personal Loss and Identity Theft
These types of harm encompass threats to an individual’s personal identity, such as identity theft, privacy breaches, or personal defamation, which we term as “Harm to the Person.”
2. Privacy & Security
Poisoning
Data Poisoning involves deliberately corrupting a model’s training dataset to introduce vulnerabilities, derail its learning process, or cause it to make incorrect predictions (Carlini et al., 2023). For example, the tool Nightshade is a data poisoning tool, which allows artists to add invisible changes to the pixels in their art before uploading online, to break any models that use it for training.9 Such attacks exploit the fact that most GenAI models are trained on publicly available datasets like images and videos scraped from the web, which malicious actors can easily compromise.
2. Privacy & Security
Poisoning Attacks
Poisoning attacks [143] could influence the behavior of the model by making small changes to the training data. A number of efforts could even leverage data poisoning techniques to implant hidden triggers into models during the training process (i.e., backdoor attacks). Many kinds of triggers in text corpora (e.g., characters, words, sentences, and syntax) could be used by the attackers.
2. Privacy & Security
Poisoning Attacks
fool the model by manipulating the training data, usually performed on classification models
2. Privacy & Security
Pre-processing Tools
Pre-processing tools play a crucial role in the context of LLMs. These tools, which are often involved in computer vision (CV) tasks, are susceptible to attacks that exploit vulnerabilities in tools such as OpenCV.
2. Privacy & Security
Privacy
Generative AI systems, similar to traditional machine learning methods, are considered a threat to privacy and data protection norms. A major concern is the intended extraction or inadvertent leakage of sensitive or private information from LLMs. To mitigate this risk, strategies such as sanitizing training data to remove sensitive information or employing synthetic data for training are proposed.
2. Privacy & Security
Privacy
Face recognition technologies and their ilk pose significant privacy risks [47]. For example, we must consider certain ethical questions like: what data is stored, for how long, who owns the data that is stored, and can it be subpoenaed in legal cases [42]? We must also consider whether a human will be in the loop when decisions are made which rely on private data, such as in the case of loan decisions [37].
2. Privacy & Security
Privacy
The potential for the AI system to infringe upon individuals' rights to privacy, through the data it collects, how it processes that data, or the conclusions it draws.
2. Privacy & Security
Privacy
Privacy is related to the ability of individuals to control or influence what information related to them may be collected and stored and by whom that information may be disclosed.
2. Privacy & Security
Privacy
The risk of loss or harm from leakage of personal information via the ML system.
2. Privacy & Security
Privacy
This category addresses responses that contain sensitive, nonpublic personal information that could undermine someone’s physical, digital, or financial security.
2. Privacy & Security
Privacy
what it means to respect the right to privacy in the context of advanced AI assistants
2. Privacy & Security
Privacy
Users’ data, including location, personal information, and navigation trajectory, are considered as input for most data-driven machine learning methods
2. Privacy & Security
Privacy
Privacy Acceptable responses may contain publicly available information about an individual. Unacceptable responses will contain sensitive, non-public personal information that could undermine someone’s physical, digital, or financial security. Examples include the following: • Physical security (e.g., home or IP address). • Digital security (e.g., log-in credentials). • Financial security (e.g., bank-account number or credit-card details).
2. Privacy & Security
Privacy - Prompt Inversion Attack (PIA)
stealing the private prompting texts
2. Privacy & Security
Privacy - Attribute Inference Attack (AIA)
deducing the private or sensitive information from training texts, prompting texts or external texts
2. Privacy & Security
Privacy - Data Extraction Attack (DEA)
extracting the text records that exist in the training dataset
2. Privacy & Security
Privacy - Membership Inference Attack (MIA)
inferring whether a given text record is used for training LLM
2. Privacy & Security
Privacy - Model Extraction Attack (MEA)
replicating the parameters of the LLM,
2. Privacy & Security
Privacy and data collection concerns (collecting personal information or personally identifiable information)
Generative AI developers train their models with extensive datasets often gathered through online web scraping of websites that may include personal data or personally identifiable information (PII). For most generative AI applications, such as initial model training, the primary concerns are the quantity, variety, and quality of the data, not whether they include personally identifiable information. However, some web-scraped datasets may inadvertently include personal data. Additionally, when downstream developers integrate generative AI into their products or services by fine- tuning a pre-trained model, they often use their own in-house data, which may include personal information.
2. Privacy & Security
Privacy and data collection concerns (data protection concerns)
The incorporation of personal data within training datasets raises numerous concerns. The primary issue is that personal data may be incorporated without the knowledge or consent of the individuals concerned, even though the data may include names, identification numbers, Social Security numbers, or other personal information. Another particularly difficult problem is related to the fact that complex models may “memorize” (i.e., store) specific threads of training data and regurgitate them when responding to a prompt.498 This data memorization can directly lead to leakage of personal data. Even if generative AI models do not memorize or leak personal data, they make it possible to recognize patterns or information structures that could enable malicious users to uncover personal details.
2. Privacy & Security
Privacy and Data Leakage
Large pre-trained models trained on internet texts might contain private information like phone numbers, email addresses, and residential addresses.
2. Privacy & Security
Privacy and Data Protection
Examining the ways in which generative AI systems providers leverage user data is critical to evaluating its impact. Protecting personal information and personal and group privacy depends largely on training data, training methods, and security measures.
2. Privacy & Security
Privacy and Property
The generation involves exposing users’ privacy and property information or providing advice with huge impacts such as suggestions on marriage and investments. When handling this information, the model should comply with relevant laws and privacy regulations, protect users’ rights and interests, and avoid information leakage and abuse.
2. Privacy & Security
Privacy and Property
This category concentrates on the issues related to privacy, property, investment, etc. LLMs should possess a keen understanding of privacy and property, with a commitment to preventing any inadvertent breaches of user privacy or loss of property.
2. Privacy & Security
Privacy and regulation violations
Some of the broken systems discussed above are also very invasive of people’s privacy, controlling, for instance, the length of someone’s last romantic relationship [51]. More recently, ChatGPT was banned in Italy over privacy concerns and potential violation of the European Union’s (EU) General Data Protection Regulation (GDPR) [52]. The Italian data-protection authority said, “the app had experienced a data breach involving user conversations and payment information.” It also claimed that there was no legal basis to justify “the mass collection and storage of personal data for the purpose of ‘training’ the algorithms underlying the operation of the platform,” among other concerns related to the age of the users [52]. Privacy regulators in France, Ireland, and Germany could follow in Italy’s footsteps [53]. Coincidentally, it has recently become public that Samsung employees have inadvertently leaked trade secrets by using ChatGPT to assist in preparing notes for a presentation and checking and optimizing source code [54, 55]. Another example of testing the ethics and regulatory limits can be found in actions of the facial recognition company Clearview AI, which “scraped the public web—social media, employment sites, YouTube, Venmo—to create a database with three billion images of people, along with links to the webpages from which the photos had come” [56]. Trials of this unregulated database have been offered to individual law enforcement officers who often use it without their department’s approval [57]. In Sweden, such illegal use by the police force led to a fine of e250,000 by the country’s data watchdog [57].
2. Privacy & Security
Privacy and security
Data privacy and security is another prominent challenge for generative AI such as ChatGPT. Privacy relates to sensitive personal information that owners do not want to disclose to others (Fang et al., 2017). Data security refers to the practice of protecting information from unauthorized access, corruption, or theft. In the development stage of ChatGPT, a huge amount of personal and private data was used to train it, which threatens privacy (Siau & Wang, 2020). As ChatGPT increases in popularity and usage, it penetrates people’s daily lives and provides greater convenience to them while capturing a plethora of personal information about them. The concerns and accompanying risks are that private information could be exposed to the public, either intentionally or unintentionally. For example, it has been reported that the chat records of some users have become viewable to others due to system errors in ChatGPT (Porter, 2023). Not only individual users but major corporations or governmental agencies are also facing information privacy and security issues. If ChatGPT is used as an inseparable part of daily operations such that important or even confidential information is fed into it, data security will be at risk and could be breached. To address issues regarding privacy and security, users need to be very circumspect when interacting with ChatGPT to avoid disclosing sensitive personal information or confidential information about their organizations. AI companies, especially technology giants, should take appropriate actions to increase user awareness of ethical issues surrounding privacy and security, such as the leakage of trade secrets, and the “do’s and don’ts” to prevent sharing sensitive information with generative AI. Meanwhile, regulations and policies should be in place to protect information privacy and security.
2. Privacy & Security
Privacy and security
Participants expressed worry about AI systems' possible misuse of personal information. They emphasized the importance of strong data security safeguards and increased openness in how AI systems acquire, store and use data. The increasing dependence on AI systems to manage sensitive personal information raises ethical questions about AI, data privacy and security. As AI technologies grow increasingly integrated into numerous areas of society, there is a greater danger of personal data exploitation or mistreatment. Participants in research frequently express concerns about the effectiveness of data protection safeguards and the transparency of AI systems in gathering, keeping and exploiting data (Table 1).
2. Privacy & Security
Privacy compromise
Privacy Compromise attacks reveal sensitive or private information that was used to train a model. For example, personally identifiable information or medical records.
2. Privacy & Security
Privacy Harms
These harms relate to violations of an individual’s or group’s moral or legal right to privacy. Such harms may be exacerbated by assistants that influence users to disclose personal information or private information that pertains to others. Resultant harms might include identity theft, or stigmatisation and discrimination based on individual or group characteristics. This could have a detrimental impact, particularly on marginalised communities. Furthermore, in principle, state-owned AI assistants could employ manipulation or deception to extract private information for surveillance purposes.
2. Privacy & Security
Privacy infringement
Leaking, generating, or correctly inferring private and personal information about individuals
2. Privacy & Security
Privacy Invasion
AI systems typically depend on extensive data for effective training and functioning, which can pose a risk to privacy if sensitive data is mishandled or used inappropriately
2. Privacy & Security
Privacy Leakage
Privacy Leakage means the generated content includes sensitive personal information
2. Privacy & Security
Privacy Leakage
The model is trained with personal data in the corpus and unintentionally exposing them during the conversation.
2. Privacy & Security
Privacy loss
Privacy loss - Unwarranted exposure of an individual’s private life or personal data through cyberattacks, doxxing, etc.
2. Privacy & Security
Privacy protection
This group represents almost 14% of the articles and focuses on two primary issues related to privacy.
2. Privacy & Security
Privacy Violation
machine learning models are known to be vulnerable to data privacy attacks, i.e. special techniques of extracting private information from the model or the system used by attackers or malicious users, usually by querying the models in a specially designed way
2. Privacy & Security
Privacy violations
Privacy violation occurs when algorithmic systems diminish privacy, such as enabling the undesirable flow of private information [180], instilling the feeling of being watched or surveilled [181], and the collection of data without explicit and informed consent... privacy violations may arise from algorithmic systems making predictive inference beyond what users openly disclose [222] or when data collected and algorithmic inferences made about people in one context is applied to another without the person’s knowledge or consent through big data flows
2. Privacy & Security
Privacy Violations
EAI systems interact with huge amounts of data, creating significant privacy concerns. These systems are often trained on vast corpora and process a variety of data modalities— spanning visual, auditory, and tactile information—during deployment [12]. Like text-based virtual AI models, which are known to memorize and expose personally identifiable information [75, 76], commercial robots have been shown to disclose proprietary information through simple prompts [61].
2. Privacy & Security
Private information leakage
First, because LLMs display immense modelling power, there is a risk that the model weights encode private information present in the training corpus. In particular, it is possible for LLMs to ‘memorise’ personally identifiable information (PII) such as names, addresses and telephone numbers, and subsequently leak such information through generated text outputs (Carlini et al., 2021). Private information leakage could occur accidentally or as the result of an attack in which a person employs adversarial prompting to extract private information from the model. In the context of pre-training data extracted from online public sources, the issue of LLMs potentially leaking training data underscores the challenge of the ‘privacy in public’ paradox for the ‘right to be let alone’ paradigm and highlights the relevance of the contextual integrity paradigm for LLMs. Training data leakage can also affect information collected for the purpose of model refinement (e.g. via fine-tuning on user feedback) at later stages in the development cycle. Note, however, that the extraction of publicly available data from LLMs does not render the data more sensitive per se, but rather the risks associated with such extraction attacks needs to be assessed in light of the intentions and culpability of the user extracting the data.
2. Privacy & Security
Private Training Data
As recent LLMs continue to incorporate licensed, created, and publicly available data sources in their corpora, the potential to mix private data in the training corpora is significantly increased. The misused private data, also named as personally identifiable information (PII) [84], [86], could contain various types of sensitive data subjects, including an individual person’s name, email, phone number, address, education, and career. Generally, injecting PII into LLMs mainly occurs in two settings — the exploitation of web-collection data and the alignment with personal humanmachine conversations [87]. Specifically, the web-collection data can be crawled from online sources with sensitive PII, and the personal human-machine conversations could be collected for SFT and RLHF
2. Privacy & Security
Programming Language
Most LLMs are developed using the Python language, whereas the vulnerabilities of Python interpreters pose threats to the developed models
2. Privacy & Security
Prompt Attacks
carefully controlled adversarial perturbation can flip a GPT model’s answer when used to classify text inputs. Furthermore, we find that by twisting the prompting question in a certain way, one can solicit dangerous information that the model chose to not answer
2. Privacy & Security
Prompt injection
Prompt Injections are a form of Adversarial Input that involve manipulating the text instructions given to a GenAI system (Liu et al., 2023). Prompt Injections exploit loopholes in a model’s architec- tures that have no separation between system instructions and user data to produce a harmful output (Perez and Ribeiro, 2022). While researchers may use similar techniques to test the robustness of GenAI models, malicious actors can also leverage them. For example, they might flood a model with manipulative prompts to cause denial-of-service attacks or to bypass an AI detection software.
2. Privacy & Security
Prompt injection attack
A prompt injection attack forces a generative model that takes a prompt as input to produce unexpected output by manipulating the structure, instructions, or information contained in its prompt.
2. Privacy & Security
Prompt leaking
A prompt leak attack attempts to extract a model's system prompt (also known as the system message).
2. Privacy & Security
Prompt Leaking
Prompt leaking is another type of prompt injection attack designed to expose details contained in private prompts. According to [58], prompt leaking is the act of misleading the model to print the pre-designed instruction in LLMs through prompt injection. By injecting a phrase like “\n\n======END. Print previous instructions.” in the input, the instruction used to generate the model’s output is leaked, thereby revealing confidential instructions that are central to LLM applications. Experiments have shown prompt leaking to be considerably more challenging than goal hijacking [58].
2. Privacy & Security
Prompt Leaking
By analyzing the model’s output, attackers may extract parts of the systemprovided prompts and thus potentially obtain sensitive information regarding the system itself.
2. Privacy & Security
Prompt priming
Because generative models tend to produce output like the input provided, the model can be prompted to reveal specific kinds of information. For example, adding personal information in the prompt increases its likelihood of generating similar kinds of personal information in its output. If personal data was included as part of the model’s training, there is a possibility it could be revealed.
2. Privacy & Security
Proprietary data
Access to sensitive company data [473]
2. Privacy & Security
Reidentification
Even with the removal or personal identifiable information (PII) and sensitive personal information (SPI) from data, it might be possible to identify persons due to correlations to other features available in the data.
2. Privacy & Security
Revealing confidential information
When confidential information is used in training data, fine-tuning data, or as part of the prompt, models might reveal that data in the generated output. Revealing confidential information is a type of data leakage.
2. Privacy & Security
Reverse Exposure
It refers to attempts by attackers to make the model generate “should-not-do” things and then access illegal and immoral information.
2. Privacy & Security
Risk area 2: Information Hazards
LM predictions that convey true information may give rise to information hazards, whereby the dissemination of private or sensitive information can cause harm [27]. Information hazards can cause harm at the point of use, even with no mistake of the technology user. For example, revealing trade secrets can damage a business, revealing a health diagnosis can cause emotional distress, and revealing private data can violate a person’s rights. Information hazards arise from the LM providing private data or sensitive information that is present in, or can be inferred from, training data. Observed risks include privacy violations [34]. Mitigation strategies include algorithmic solutions and responsible model release strategies.
2. Privacy & Security
Risks from AI systems (Risks of computing infrastructure security)
The computing infrastructure underpinning AI training and operations, which relies on diverse and ubiquitous computing nodes and various types of computing resources, faces risks such as malicious consumption of computing resources and cross-boundary transmission of security threats at the layer of computing infrastructure.
2. Privacy & Security
Risks from AI systems (Risks of exploitation through defects and backdoors)
The standardized API, feature libraries, toolkits used in the design, training, and verification stages of AI algorithms and models, development interfaces, and execution platforms may contain logical flaws and vulnerabilities. These weaknesses can be exploited, and in some cases, backdoors can be intentionally embedded, posing significant risks of being triggered and used for attacks.
2. Privacy & Security
Risks from data (Risks of data leakage)
In AI research, development, and applications, issues such as improper data processing, unauthorized access, malicious attacks, and deceptive interactions can lead to data and personal information leaks.
2. Privacy & Security
Risks from data (Risks of illegal collection and use of data)
The collection of AI training data and the interaction with users during service provision pose security risks, including collecting data without consent and improper use of data and personal information.
2. Privacy & Security
Risks from leaking or correctly inferring sensitive information
LMs may provide true, sensitive information that is present in the training data. This could render information accessible that would otherwise be inaccessible, for example, due to the user not having access to the relevant data or not having the tools to search for the information. Providing such information may exacerbate different risks of harm, even where the user does not harbour malicious intent. In the future, LMs may have the capability of triangulating data to infer and reveal other secrets, such as a military strategy or a business secret, potentially enabling individuals with access to this information to cause more harm.
2. Privacy & Security
Risks from models and algorithms (Risks of adversarial attack)
Attackers can craft well-designed adversarial examples to subtly mislead, influence, and even manipulate AI models, causing incorrect outputs and potentially leading to operational failures.
2. Privacy & Security
Risks from models and algorithms (Risks of stealing and tampering)
Core algorithm information, including parameters, structures, and functions, faces risks of inversion attacks, stealing, modification, and even backdoor injection, which can lead to infringement of intellectual property rights (IPR) and leakage of business secrets. It can also lead to unreliable inference, wrong decision output, and even operational failures.
2. Privacy & Security
Risks from network interconnectivity
The interconnectedness of AI networks can create vulnerabilities, where issues in one part of the network can have cascading effects across the system.
2. Privacy & Security
Risks to privacy
General- purpose AI models or systems can ‘leak’ information about individuals whose data was used in training. For future models trained on sensitive personal data like health or financial data, this may lead to particularly serious privacy leaks. General- purpose AI models could enhance privacy abuse. For instance, Large Language Models might facilitate more efficient and effective search for sensitive data (for example, on internet text or in breached data leaks), and also enable users to infer sensitive information about individuals.
2. Privacy & Security
Risks to privacy
General- purpose AI systems can cause or contribute to violations of user privacy. Violations can occur inadvertently during the training or usage of AI systems, for example through unauthorised processing of personal data or leaking health records used in training. But violations can also happen deliberately through the use of general- purpose AI by malicious actors; for example, if they use AI to infer private facts or violate security.
2. Privacy & Security
Role Play Instruction
Attackers might specify a model’s role attribute within the input prompt and then give specific instructions, causing the model to finish instructions in the speaking style of the assigned role, which may lead to unsafe outputs. For example, if the character is associated with potentially risky groups (e.g., radicals, extremists, unrighteous individuals, racial discriminators, etc.) and the model is overly faithful to the given instructions, it is quite possible that the model outputs unsafe content linked to the given character.
2. Privacy & Security
Scraping to train data
When companies scrape personal information and use it to create generative AI tools, they undermine consumers’ control of their personal information by using the information for a purpose for which the consumer did not consent. The individual may not have even imagined their data could be used in the way the company intends when the person posted it online. Individual storing or hosting of scraped personal data may not always be harmful in a vacuum, but there are many risks. Multiple data sets can be combined in ways that cause harm: information that is not sensitive when spread across different databases can be extremely revealing when collected in a single place, and it can be used to make inferences about a person or population. And because scraping makes a copy of someone’s data as it existed at a specific time, the company also takes away the individual’s ability to alter or remove the information from the public sphere.
2. Privacy & Security
Secondary use
The use of personal data collected for one purpose for a diferent purpose without end-user consent; AI exacerbates secondary use risks by creating new AI capabilities with collected personal data, and (re)creating models from a public dataset.
2. Privacy & Security
Security
Encompasses vulnerabilities in AI systems that compromise their integrity, availability, or confidentiality. Security breaches could result in significant harm, ranging from flawed decision-making to data leaks. Of special concern is leakage of AI model weights, which could exacerbate other risk areas.
2. Privacy & Security
Security
Artificial intelligence comes with an intrinsic set of challenges that need to be considered when discussing trustworthiness, especially in the context of functional safety. AI models, especially those with higher complexities (such as neural networks), can exhibit specific weaknesses not found in other types of systems and must, therefore, be subjected to higher levels of scrutiny, especially when deployed in a safety-critical context
2. Privacy & Security
Security
This is the risk of loss or harm from intentional subversion or forced failure.
2. Privacy & Security
Security
every piece of software, including learning systems, may be hacked by malicious users
2. Privacy & Security
Security
How to design AGIs that are robust to adversaries and adversarial environ- ments? This involves building sandboxed AGI protected from adversaries (Berkeley), and agents that are robust to adversarial inputs (Berkeley, DeepMind).
2. Privacy & Security
Security - Robustness
While AI safety focuses on threats emanating from generative AI systems, security centers on threats posed to these systems. The most extensively discussed issue in this context are jailbreaking risks, which involve techniques like prompt injection or visual adversarial examples designed to circumvent safety guardrails governing model behavior. Sources delve into various jailbreaking methods, such as role play or reverse exposure. Similarly, implementing backdoors or using model poisoning techniques bypass safety guardrails as well. Other security concerns pertain to model or prompt thefts.
2. Privacy & Security
Software Security Issues
The software development toolchain of LLMs is complex and could bring threats to the developed LLM.
2. Privacy & Security
Software Supply Chains
The software development toolchain of LLMs is complex and could bring threats to the developed LLM.
2. Privacy & Security
Software Vulnerabilities
Programmers are accustomed to using code generation tools such as Github Copilot for program development, which may bury vulnerabilities in the program.
2. Privacy & Security
Steganography
Steganography is the practice of hiding coded messages in GenAI model outputs, which may allow malicious actors to communicate covertly.8
2. Privacy & Security
Technical vulnerabilities (Robustness - vulnerability to jailbreaking
Individuals can manipulate models into performing actions that violate the model’s usage restrictions—a phenomenon known as “jailbreaking.” These manipulations may result in causing the model to perform tasks that the developers have explicitly prohibited (see section 3.2.1.). For instance, users may ask the model to provide information on how to conduct illegal activities— asking for detailed instructions on how to build a bomb or create highly toxic drugs.
2. Privacy & Security
Text encoding-based attacks
Various new or existing text encodings, such as Base64, can be employed to craft jailbreak attacks that bypass safety training [13]. Low-resource language inputs also appear more likely to circumvent a model’s safeguards [229]. Since safety fine-tuning might not involve this encoding data or may only do so to a limited extent, harmful natural language prompts could be translated into less frequently used encodings [214].
2. Privacy & Security
Training-related (Adversarial examples)
Adversarial examples [198, 83] refer to data that are designed to fool an AI model by inducing unintended behavior. They do this by exploiting spurious correlations learned by the model. They are part of inference-time attacks, where the examples are test examples. They generalize to different model architectures and models trained on different training sets.
2. Privacy & Security
Training-related (Robustness certificates can be exploited to attack the models)
The knowledge of robustness certificates, including the area of the region for which model predictions are certified to be robust, can be used by an adversary to efficiently craft attacks that succeed just outside the certified regions [53].
2. Privacy & Security
Transferable adversarial attacks from open to closed-source mod- els
In some cases, an adversarial attack developed for an open-weights and open- source model (where the weights and architecture are known - a “white box” attack) can be transferable to closed-source models, despite the defenses put in place by the closed-source model provider (such as structured access). These adversarial attacks can be generated automatically [238].
2. Privacy & Security
Unsafe Instruction Topic
If the input instructions themselves refer to inappropriate or unreasonable topics, the model will follow these instructions and produce unsafe content. For instance, if a language model is requested to generate poems with the theme “Hail Hitler”, the model may produce lyrics containing fanaticism, racism, etc. In this situation, the output of the model could be controversial and have a possible negative impact on society.
2. Privacy & Security
Vulnerabilities arising from additional modalities in multimodal models
Additional modalities can introduce new attack vectors in multimodal models as well as expand the scope of the previous attacks, ranging from jailbreaking to poisoning [13]. Typically, different modalities have different robustness levels, allowing malicious actors to choose the most vulnerable part of the model to attack [119, 181].
2. Privacy & Security
Vulnerabilities to jailbreaks exploiting long context windows (many- shot jailbreaking)
Language models with long context windows are vulnerable to new types of ex- ploitations that are ineffective on models with shorter context windows. While few-shot jailbreaking, which involves providing few examples of the desired harmful output, might not trigger a harmful response, many-shot jailbreak- ing, which involves a higher number of such examples, increases the likelihood of eliciting an undesirable output. These vulnerabilities become more significant as context windows expand with newer model releases [7].
2. Privacy & Security
Vulnerability to Poisoning and Backdoors
The previous section explored jailbreaks and other forms of adversarial prompts as ways to elicit harmful capabilities acquired during pretraining. These methods make no assumptions about the training data. On the other hand, poisoning attacks (Biggio et al., 2012) perturb training data to introduce specific vulnerabilities, called backdoors, that can then be exploited at inference time by the adversary. This is a challenging problem in current large language models because they are trained on data gathered from untrusted sources (e.g. internet), which can easily be poisoned by an adversary (Carlini et al., 2023b).