186 canonical MIT risk pages

2. Privacy & Security

Risks of data leakage, attacks, system compromise, and misuse of sensitive information.

“Model Psychology” Attacks

LLMs are vulnerable to “psychological” tricks (Li et al., 2023e; Shen et al., 2023), which can be exploited by attackers. Examples include instructing the model to behave like a specific persona (Shah et al., 2023; Andreas, 2022), or employing various “social engineering” tricks crafted by humans (Wei et al., 2023c) or other LLMs (Perez et al., 2022b; Casper et al., 2023c).

2. Privacy & Security

Adversarial AI (General)

Adversarial AI refers to a class of attacks that exploit vulnerabilities in machine-learning (ML) models. This class of misuse exploits vulnerabilities introduced by the AI assistant itself and is a form of misuse that can enable malicious entities to exploit privacy vulnerabilities and evade the model’s built-in safety mechanisms, policies, and ethical boundaries of the model. Besides the risks of misuse for offensive cyber operations, advanced AI assistants may also represent a new target for abuse, where bad actors exploit the AI systems themselves and use them to cause harm. While our understanding of vulnerabilities in frontier AI models is still an open research problem, commercial firms and researchers have already documented attacks that exploit vulnerabilities that are unique to AI and involve evasion, data poisoning, model replication, and exploiting traditional software flaws to deceive, manipulate, compromise, and render AI systems ineffective. This threat is related to, but distinct from, traditional cyber activities. Unlike traditional cyberattacks that typically are caused by ‘bugs’ or human mistakes in code, adversarial AI attacks are enabled by inherent vulnerabilities in the underlying AI algorithms and how they integrate into existing software ecosystems.

2. Privacy & Security

Adversarial AI: Circumvention of Technical Security Measures

The technical measures to mitigate misuse risks of advanced AI assistants themselves represent a new target for attack. An emerging form of misuse of general-purpose advanced AI assistants exploits vulnerabilities in a model that results in unwanted behavior or in the ability of an attacker to gain unauthorized access to the model and/or its capabilities. While these attacks currently require some level of prompt engineering knowledge and are often patched by developers, bad actors may develop their own adversarial AI agents that are explicitly trained to discover new vulnerabilities that allow them to evade built-in safety mechanisms in AI assistants. To combat such misuse, language model developers are continually engaged in a cyber arms race to devise advanced filtering algorithms capable of identifying attempts to bypass filters. While the impact and severity of this class of attacks is still somewhat limited by the fact that current AI assistants are primarily text-based chatbots, advanced AI assistants are likely to open the door to multimodal inputs and higher-stakes action spaces, with the result that the severity and impact of this type of attack is likely to increase. Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress towards advanced AI assistant development could lead to capabilities that pose extreme risks that must be protected against this class of attacks, such as offensive cyber capabilities or strong manipulation skills, and weapons acquisition.

2. Privacy & Security

Adversarial AI: Data and Model Exfiltration Attacks

Other forms of abuse can include privacy attacks that allow adversaries to exfiltrate or gain knowledge of the private training data set or other valuable assets. For example, privacy attacks such as membership inference can allow an attacker to infer the specific private medical records that were used to train a medical AI diagnosis assistant. Another risk of abuse centers around attacks that target the intellectual property of the AI assistant through model extraction and distillation attacks that exploit the tension between API access and confidentiality in ML models. Without the proper mitigations, these vulnerabilities could allow attackers to abuse access to a public-facing model API to exfiltrate sensitive intellectual property such as sensitive training data and a model’s architecture and learned parameters.

2. Privacy & Security

Adversarial AI: Prompt Injections

Prompt injections represent another class of attacks that involve the malicious insertion of prompts or requests in LLM-based interactive systems, leading to unintended actions or disclosure of sensitive information. The prompt injection is somewhat related to the classic structured query language (SQL) injection attack in cybersecurity where the embedded command looks like a regular input at the start but has a malicious impact. The injected prompt can deceive the application into executing the unauthorized code, exploit the vulnerabilities, and compromise security in its entirety. More recently, security researchers have demonstrated the use of indirect prompt injections. These attacks on AI systems enable adversaries to remotely (without a direct interface) exploit LLM-integrated applications by strategically injecting prompts into data likely to be retrieved. Proof-of-concept exploits of this nature have demonstrated that they can lead to the full compromise of a model at inference time analogous to traditional security principles. This can entail remote control of the model, persistent compromise, theft of data, and denial of service. As advanced AI assistants are likely to be integrated into broader software ecosystems through third-party plugins and extensions, with access to the internet and possibly operating systems, the severity and consequences of prompt injection attacks will likely escalate and necessitate proper mitigation mechanisms.

2. Privacy & Security

Adversarial attack

Recent advances have shown that a deep learning model with high predictive accuracy frequently misbehaves on adversarial examples [57,58]. In particular, a small perturbation to an input image, which is imperceptible to humans, could fool a well-trained deep learning model into making completely different predictions [23].

2. Privacy & Security

Adversarial attacks targeting explainable AI techniques

Adversarial attacks can affect not only the model’s output but also its corresponding explanation. Current adversarial optimization techniques can intro- duce imperceptible noise to the input image, so that the model’s output does not change but the corresponding explanation is arbitrarily manipulated [61]. Such manipulations are harder to notice, as they are less commonly known compared to standard adversarial attacks targeting the model’s output.

2. Privacy & Security

Adversarial input

Adversarial Inputs involve modifying individual input data to cause a model to malfunction. These modifications, which are often imperceptible to humans, exploit how the model makes decisions to produce errors (Wallace et al., 2019) and can be applied to text, but also to images, audio, or video (e.g. changing pixels in an image of a panda in a way that causes a model to label it as a gibbon).6

2. Privacy & Security

Adversarial Prompts

Engineering an adversarial input to elicit an undesired model behavior, which pose a clear attack intention

2. Privacy & Security

Association in LLMs

Association in LLMs refers to the capability to associate various pieces of information related to a person. According to [68], [86], given a pair of PII entities (xi , xj ), which is associated by a model F. Using a prompt p could force the model F to produce the entity xj , where p is the prompt related to the entity xi . For instance, an LLM could accurately output the answer when given the prompt “The email address of Alice is”, if the LLM associates Alice with her email “alice@email.com”. L

2. Privacy & Security

Attacking LLMs via Additional Modalities a

LLMs can now process modalities other than text, e.g. images or video frames (OpenAI, 2023c; Gemini Team, 2023). Several studies show that gradient-based attacks on multimodal models are easy and effective (Carlini et al., 2023a; Bailey et al., 2023; Qi et al., 2023b). These attacks manipulate images that are input to the model (via an appropriate encoding). GPT-4Vision (OpenAI, 2023c) is vulnerable to jailbreaks and exfiltration attacks through much simpler means as well, e.g. writing jailbreaking text in the image (Willison, 2023a; Gong et al., 2023). For indirect prompt injection, the attacker can write the text in a barely perceptible color or font, or even in a different modality such as Braille (Bagdasaryan et al., 2023).

2. Privacy & Security

Attribute inference attack

An attribute inference attack repeatedly queries a model to detect whether certain sensitive features can be inferred about individuals who participated in training a model. These attacks occur when an adversary has some prior knowledge about the training data and uses that knowledge to infer the sensitive data.

2. Privacy & Security

Backdoors or trojan attacks in GPAI models

Backdoors can be inserted into GPAI models during their training or fine-tuning, to be exploited during deployment [185, 118]. Attackers inserting the backdoor can be the GPAI model provider themselves or another actor (e.g., by ma- nipulating the training data or the software infrastructure used by the model provider) [222]. Some backdoors can be exploited with minimal overhead, al- lowing attackers to control the model outputs in a targeted way with a high success rate [90].

2. Privacy & Security

Centralized platforms deployed at scale

The widespread use of common AI platforms can create centralized points of failure, making systems more vulnerable to disruptions or attacks

2. Privacy & Security

Compromising privacy by correctly inferring private information

Privacy violations may occur at the time of inference even without the individual’s private data being present in the training dataset. Similar to other statistical models, a LM may make correct inferences about a person purely based on correlational data about other people, and without access to information that may be private about the particular individual. Such correct inferences may occur as LMs attempt to predict a person’s gender, race, sexual orientation, income, or religion based on user input.

2. Privacy & Security

Compromising privacy by leaking private infiormation

By providing true information about individuals’ personal characteristics, privacy violations may occur. This may stem from the model “remembering” private information present in training data (Carlini et al., 2021).

2. Privacy & Security

Compromising privacy by leaking sensitive information

A LM can “remember” and leak private data, if such information is present in training data, causing privacy violations [34].

2. Privacy & Security

Compromising privacy or security by correctly inferring sensitive information

Anticipated risk: Privacy violations may occur at inference time even without an individual’s data being present in the training corpus. Insofar as LMs can be used to improve the accuracy of inferences on protected traits such as the sexual orientation, gender, or religiousness of the person providing the input prompt, they may facilitate the creation of detailed profiles of individuals comprising true and sensitive information without the knowledge or consent of the individual.

2. Privacy & Security

Confidential data in prompt

Confidential information might be included as a part of the prompt that is sent to the model.

2. Privacy & Security

Confidential information in data

Confidential information might be included as part of the data that is used to train or tune the model.

2. Privacy & Security

Confidentiality loss

Confidentiality loss - Unauthorised sharing of sensitive, confidential information and documents such as corporate strategy and financial plans with third-parties.

2. Privacy & Security

Cybersecurity

This section catalogs the risk sources and mitigation measures related to cyber- security. These items may be related to security in terms of AI models being accessible only to the intended users, as well as AI models having appropriate access to the external world during both model development and deployment stages.

2. Privacy & Security

Cyberspace risks (Risks of information leakage due to improper usage)

Staff of government agencies and enterprises, if failing to use the AI service in a regulated and proper manner, may input internal data and industrial information into the AI model, leading to the leakage of work secrets, business secrets, and other sensitive business data.

2. Privacy & Security

Cyberspace risks (Risks of security flaw transmission caused by model reuse)

Re-engineering or fine-tuning based on foundation models is commonly used in AI applications. If security flaws occur in foundation models, it will lead to risk transmission to downstream models.

2. Privacy & Security

Data exfiltration

Data Exfiltration goes beyond revealing private information, and involves illicitly obtaining the training data used to build a model that may be sensitive or proprietary. Model Extraction is the same attack, only directed at the model instead of the training data — it involves obtaining the architecture, parameters, or hyper-parameters of a proprietary model (Carlini et al., 2024).

2. Privacy & Security

Data governance

These evaluations assess the extent to which LLMs regurgitate their training data in their outputs, and whether LLMs 'leak' sensitive information that has been provided to them during use (i.e., during the inference stage).

2. Privacy & Security

Data poisoning

Data poisoning describes an attack in the form of an injection of malicious data into the training set. If not prevented, this attack leads the AI system to learn unintended behavior.

2. Privacy & Security

Data poisoning

A type of adversarial attack where an adversary or malicious insider injects intentionally corrupted, false, misleading, or incorrect samples into the training or fine-tuning datasets.

2. Privacy & Security

Data Privacy

Impacts due to leakage and unauthorized use, disclosure, or de-anonymization of biometric, health, location, or other personally identifiable information or sensitive data.

2. Privacy & Security

Data Protection/Privacy

Vulnerable channel by which personal information may be accessed. The user may want their personal data to be kept private.

2. Privacy & Security

Data-related (Difficulty filtering large web scrapes or large scale web datasets)

A large scale “scraping” of web data for training datasets increases vulnerability to data poisoning, backdoor attacks, and the inclusion of inaccurate or toxic data [76, 28, 48]. With a large dataset, filtering out these quality issues is very difficult or trades off against significant data loss.

2. Privacy & Security

Data-related (Insufficient quality control in data collection process)

A lack of standardized methods and sufficient infrastructure, including the absence of quality control processes for collecting data, especially for high-stakes domains and benchmarks, can affect the quality and type of the data collected [173, 95]. This may include risks of dataset poisoning, inadvertent copyright violation, and test set leakages which invalidate performance metrics.

2. Privacy & Security

Decision-making on inferred private data

Current GPAIs (LLMs and multimodal LLM-based models) have significant capability to infer correlations in text data. In some cases, they may be able to make highly accurate data inferences on users based on contextual input that users provide [134]. These data inferences can “leak” or reveal sensitive information about the user, cause unfair treatment, or enable manipulation of user behavior.

2. Privacy & Security

Deep Learning Frameworks

LLMs are implemented based on deep learning frameworks. Notably, various vulnerabilities in these frameworks have been disclosed in recent years. As reported in the past five years, three of the most common types of vulnerabilities are buffer overflow attacks, memory corruption, and input validation issues.

2. Privacy & Security

Disclosure

Revealing and improperly sharing data of individuals; AI creates new types of disclosure risks by inferring additional information beyond what is explicitly captured in the raw data; AI exacerbates disclosure risks through sharing personal data to train models.

2. Privacy & Security

Dissemination of dangerous information

Leaking, generating or correctly inferring hazardous or sensitive information that could pose a security threat

2. Privacy & Security

Evasion attack

Evasion attacks attempt to make a model output incorrect results by slightly perturbing the input data that is sent to the trained model.

2. Privacy & Security

Evasion Attacks

Evasion attacks [145] target to cause significant shifts in model’s prediction via adding perturbations in the test samples to build adversarial examples. In specific, the perturbations can be implemented based on word changes, gradients, etc.

2. Privacy & Security

Exclusion

The failure to provide end-users with notice and control over how their data is being used; AI exacerbates exclusion risks by training on rich personal data without consent.

2. Privacy & Security

Exploiting External Tools for Attacks

Adversarial tool providers can embed malicious instructions in the APIs or prompts [84], leading LLMs to leak memorized sensitive information in the training data or users’ prompts (CVE2023-32786). As a result, LLMs lack control over the output, resulting in sensitive information being disclosed to external tool providers. Besides, attackers can easily manipulate public data to launch targeted attacks, generating specific malicious outputs according to user inputs. Furthermore, feeding the information from external tools into LLMs may lead to injection attacks [61]. For example, unverified inputs may result in arbitrary code execution (CVE-2023-29374).

2. Privacy & Security

Exploiting Limited Generalization of Safety Finetuning

Safety tuning is performed over a much narrower distribution compared to the pretraining distribution. This leaves the model vulnerable to attacks that exploit gaps in the generalization of the safety training, e.g. using encoded text (Wei et al., 2023c) or low-resource languages (Deng et al., 2023a; Yong et al., 2023) (see also Section 3.2).

2. Privacy & Security

Exposing personal information

When personal identifiable information (PII) or sensitive personal information (SPI) are used in training data, fine-tuning data, or as part of the prompt, models might reveal that data in the generated output. Revealing personal information is a type of data leakage.

2. Privacy & Security

Exposure

Revealing sensitive private information that people view as deeply primordial that we have been socialized into concealing; AI creates new types of exposure risks through generative techniques that can reconstruct censored or redacted content; and through exposing inferred sensitive data, preferences, and intentions.

2. Privacy & Security

Extraction attack

An attribute inference attack is used to detect whether certain sensitive features can be inferred about individuals who participated in training a model. These attacks occur when an adversary has some prior knowledge about the training data and uses that knowledge to infer the sensitive data.

2. Privacy & Security

Extraction Attacks

Extraction attacks [137] allow an adversary to query a black-box victim model and build a substitute model by training on the queries and responses. The substitute model could achieve almost the same performance as the victim model. While it is hard to fully replicate the capabilities of LLMs, adversaries could develop a domainspecific model that draws domain knowledge from LLMs

2. Privacy & Security

Factual Errors Injected by External Tools

External tools typically incorporate additional knowledge into the input prompts [122], [178]–[184]. The additional knowledge often originates from public resources such as Web APIs and search engines. As the reliability of external tools is not always ensured, the content returned by external tools may include factual errors, consequently amplifying the hallucination issue.

2. Privacy & Security

Fine-tuning related (Fine-tuning dataset poisoning)

A deployer can poison the dataset used during the fine-tuning process [98] to induce specific, often malicious, behaviors in a model. This can be performed without having access to the model’s weights. This poisoning can be difficult to detect through direct inspection of the dataset, as the manipulations may be subtle and targeted.

2. Privacy & Security

Fine-tuning related (Poisoning models during instruction tuning)

AI models can be poisoned during instruction tuning when models are tuned using pairs of instructions and desired outputs. Poisoning in instruction tuning can be achieved with a lower number of compromised samples, as instruction tuning requires a relatively small number of samples for fine-tuning [155, 211]. Anonymous crowdsourcing efforts may be employed in collecting instruction tuning datasets and can further contribute to poisoning attacks [187]. These attacks might be harder to detect than traditional data poisoning attacks.

2. Privacy & Security

Generative AI Outputs

Generative AI tools may inadvertently share personal information about someone or someone’s business or may include an element of a person from a photo. Particularly, companies concerned about their trade secrets being integrated into the model from their employees have explicitly banned their employees from using it.

2. Privacy & Security

Generative AI User Data

Many generative AI tools require users to log in for access, and many retain user information, including contact information, IP address, and all the inputs and outputs or “conversations” the users are having within the app. These practices implicate a consent issue because generative AI tools use this data to further train the models, making their “free” product come at a cost of user data to train the tools. This dovetails with security, as mentioned in the next section, but best practices would include not requiring users to sign in to use the tool and not retaining or using the user-generated content for any period after the active use by the user.

2. Privacy & Security

Goal Hijacking

Goal hijacking is a type of primary attack in prompt injection [58]. By injecting a phrase like “Ignore the above instruction and do ...” in the input, the attack could hijack the original goal of the designed prompt (e.g., translating tasks) in LLMs and execute the new goal in the injected phrase.

2. Privacy & Security

Goal Hijacking

It refers to the appending of deceptive or misleading instructions to the input of models in an attempt to induce the system into ignoring the original user prompt and producing an unsafe response.

2. Privacy & Security

GPU Computation Platforms

The training of LLMs requires significant GPU resources, thereby introducing an additional security concern. GPU side-channel attacks have been developed to extract the parameters of trained models [159], [163].

2. Privacy & Security

Hardware Vulnerabilities

The vulnerabilities of hardware systems for training and inferencing brings issues to LLM-based applications.

2. Privacy & Security

Harmful code generation

Models might generate code that causes harm or unintentionally affects other systems.

2. Privacy & Security

Harming users’ data privacy

Modern AI systems rely on large amounts of data. If this includes personal data about individuals, the risk of harming the privacy of persons arises.

2. Privacy & Security

Inference Attacks

Inference attacks [150] include membership inference attacks, property inference attacks, and data reconstruction attacks. These attacks allow an adversary to infer the composition or property information of the training data. Previous works [67] have demonstrated that inference attacks could easily work in earlier PLMs, implying that LLMs are also possible to be attacked

2. Privacy & Security

Inference of private information

Finally, LLMs can in principle infer private information based on model inputs even if the relevant private information is not present in the training corpus (Weidinger et al., 2021). For example, an LLM may correctly infer sensitive characteristics such as race and gender from data contained in input prompts.

2. Privacy & Security

Information & Safety Harms

AI systems leaking, reproducing, generating or inferring sensitive, private, or hazardous information

2. Privacy & Security

Information Hazards

Harms that arise from the language model leaking or inferring true sensitive information

2. Privacy & Security

Information Science Risks

These risks pertain to the misuse, misinterpretation, or leakage of data, which can lead to erroneous conclusions or the unintentional dissemination of sensitive information, such as private patient data or proprietary research. Recent research has demonstrated how LLMs can be exploited to generate malicious medical literature that poisons knowledge graphs, potentially manipulating downstream biomedical applications and compromising the integrity of medical knowledge discovery [28]. Such risks are pervasive across all scientific domains.

2. Privacy & Security

Inquiry with Unsafe Opinion

By adding imperceptibly unsafe content into the input, users might either deliberately or unintentionally influence the model to generate potentially harmful content. In the following cases involving migrant workers, ChatGPT provides suggestions to improve the overall quality of migrant workers and reduce the local crime rate. ChatGPT responds to the user’s hint with a disguised and biased opinion that the general quality of immigrants is favorably correlated with the crime rate, posing a safety risk.

2. Privacy & Security

Insecurity

carelessness in protecting collected personal data from leaks and improper access due to faulty data storage and data practices

2. Privacy & Security

Instruction Attacks

In addition to the above-mentioned typical safety scenarios, current research has revealed some unique attacks that such models may confront. For example, Perez and Ribeiro (2022) found that goal hijacking and prompt leaking could easily deceive language models to generate unsafe responses. Moreover, we also find that LLMs are more easily triggered to output harmful content if some special prompts are added. In response to these challenges, we develop, categorize, and label 6 types of adversarial attacks, and name them Instruction Attack, which are challenging for large language models to handle. Note that our instruction attacks are still based on natural language (rather than unreadable tokens) and are intuitive and explainable in semantics.

2. Privacy & Security

Insufficient Security Measures

Malicious entities can take advantage of weaknesses in AI algorithms to alter results, potentially resulting in tangible real-life impacts. Additionally, it’s vital to prioritize safeguarding privacy and handling data responsibly, particularly given AI’s significant data needs. Balancing the extraction of valuable insights with privacy maintenance is a delicate task

2. Privacy & Security

Interconnectivity with malicious external tools

The growing integration and interconnectivity with external tools and plugins increase the risk of exposure to malicious external inputs. This interconnectivity makes it easier for external tools to introduce harmful content [220].

2. Privacy & Security

IP information in prompt

Copyrighted information or other intellectual property might be included as a part of the prompt that is sent to the model.

2. Privacy & Security

Issues on External Tools

The external tools (e.g., web APIs) present trustworthiness and privacy issues to LLM-based applications.

2. Privacy & Security

Jailbreak in LLM Malicious Use - Backdoor Attack

However, there are still ones who can leave holes in the training dataset, making LLMs appear safe on average, but generate harmful content under other specific conditions. This kind of attack can be categorized as backdoor attack. Evan et al. developed a backdoor model that behaves as expected when trained, but exhibits different and potentially harmful behavior when deployed [81]. The results show that these backdoor behaviors persist even after multiple security training techniques are applied.

2. Privacy & Security

Jailbreak in LLM Malicious Use - Poisoning Training Data

In the data collecting and pre-training phase, malicious adversaries can Jailbreak LLMs through poisoning their training data to make the model to output harmful content.

2. Privacy & Security

Jailbreak in LLM Malicious Use - Prompt Attacks

In the prompting and reasoning phase, dialog can push LLMs into confused or overly compliant states, raising the risk of producing harmful outputs when confronted with harmful questions. Most of the jailbreak methods in this phase are black-boxed and can be categorized into four main groups based on the type of method: Prompt Injection [154], Role Play, Adversarial Prompting, and Prompt Form Transformation.

2. Privacy & Security

Jailbreak in LLM Malicious Use - White & Black Box Attacks

In the fine-tuning and alignment phase, elaborately- designed instruction datasets can be utilized to fine-tune LLMs to drive them to perform undesirable behaviors, such as generating harmful information or content that violates ethical norms, and thus achieve a jailbreak. Based on the accessibility to the model parameters, we can categorize them into white-box and black-box attacks. For white-box attacks, we can jailbreak the model by modifying its parameter weights. In [107], Lermen et al. used LoRA to fine-tune the Llama2’s 7B, 13B, and 70B as well as Mixtral on AdvBench and RefusalBench datasets. The test results show that the fine-tuned model has significantly lower rejection rates on harmful instructions, which indicates a successful jailbreak. Other works focus on jailbreaking in black-box models. In [160], Qi et al. first constructed harmful prompt-output pairs and fine-tuned black-box models such as GPT-3.5 Turbo. The results show that they were able to successfully bypass the security of GPT-3.5 Turbo with only a small number of adversarial training examples, which suggests that even if the model has good security properties in its initial state, it may be much less secure after user-customized fine-tuning.

2. Privacy & Security

Jailbreak of a model to subvert intended behavior

A jailbreak is a type of adversarial input to the model (during deployment) re- sulting in model behavior deviating from intended use. Jailbreaks may be gen- erated automatically in a “white box” setting, where access to internal training parameters is required for creation and optimization of the attack [238]. Other attacks may be “black box” - without access to model internals. In text based generative models, jailbreaks may sometimes be human-readable, with the use of reasoning or role-play to “convince” the model to bypass its safety mechanisms [231].

2. Privacy & Security

Jailbreak of a multimodal model

Current generation multimodal (e.g., vision and language) GPAI models are vulnerable to adversarial jailbreak attacks. These attacks can be used to automatically induce a model to produce an arbitrary or specific output with high success rate [227]. Multimodal jailbreaks can also be used to exfiltrate a model’s context window or other model internals [18].

2. Privacy & Security

Jailbreaking

Jailbreaking aims to bypass or remove restrictions and safety filters placed on a GenAI model completely (Chao et al., 2023; Shen et al., 2023). This gives the actor free rein to generate any output, regardless of its content being harmful, biassed, or offensive. All three of these are tactics that manipulate the model into producing harmful outputs against its design. The difference is that prompt injections and adversarial inputs usually seek to steer the model towards producing harmful or incorrect outputs from one query, whereas jailbreaking seeks to dismantle a model’s safety mechanisms in their entirety.

2. Privacy & Security

Jailbreaking

A jailbreaking attack attempts to break through the guardrails that are established in the model to perform restricted actions.

2. Privacy & Security

Jailbreaks and Prompt Injections Threaten Security of LLMs

LLMs are not adversarially robust and are vulnerable to security failures such as jailbreaks and prompt-injection attacks. While a number of jailbreak attacks have been proposed in the literature, the lack of standardized evaluation makes it difficult to compare them. We also do not have efficient white-box methods to evaluate adver- sarial robustness. Multi-modal LLMs may further allow novel types of jailbreaks via additional modalities. Finally, the lack of robust privilege levels within the LLM input means that jailbreaking and prompt-injection attacks may be particularly hard to eliminate altogether.

2. Privacy & Security

Leakage

The chatbot reveals sensitive or confidential information.

2. Privacy & Security

Legal challenges

Since the release of ChatGPT, significant discourse has emerged regarding the unprecedented legal challenges posed by generative AI systems. These challenges primarily involve protecting privacy and personal data, as well as preserving copyrights. The former encompasses safeguarding personal information, while the latter includes issues related to the use of copyrighted content for training AI models and determining the legal status of works produced by AI systems.

2. Privacy & Security

Limitations in adversarial robustness

AI models and systems are vulnerable to manipulation through adversarial inputs.

2. Privacy & Security

Loss of privacy

AI offers the temptation to abuse someone's personal data, for instance to build a profile of them to target advertisements more effectively.

2. Privacy & Security

Membership inference attack

A membership inference attack repeatedly queries a model to determine whether a given input was part of the model’s training. More specifically, given a trained model and a data sample, an attacker samples the input space, observing outputs to deduce whether that sample was part of the model's training.

2. Privacy & Security

Memorization in LLMs

Memorization in LLMs refers to the capability to recover the training data with contextual prefixes. According to [88]–[90], given a PII entity x, which is memorized by a model F. Using a prompt p could force the model F to produce the entity x, where p and x exist in the training data. For instance, if the string “Have a good day!\n alice@email.com” is present in the training data, then the LLM could accurately predict Alice’s email when given the prompt “Have a good day!\n”.

2. Privacy & Security

Memory and Storage

Similar to conventional programs, hardware infrastructures can also introduce threats to LLMs. Memory-related vulnerabilities, such as rowhammer attacks [160], can be leveraged to manipulate the parameters of LLMs, giving rise to attacks such as the Deephammer attack [167], [168].

2. Privacy & Security

Misuse of AI model by user-performed persuasion

AI models can be influenced to accept misinformation through persuasive conversations, even when their initial responses are factually correct. Multi-turn persuasion can be more effective than single-turn persuasion attempts in altering the model’s stance [223].

2. Privacy & Security

Misuse of interpretability techniques

Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability techniques can be used to simulate a white-box attack scenario. In this case, knowing the internal workings of a model aids in the development of adversarial attacks [24].

2. Privacy & Security

Model Attacks

Model attacks exploit the vulnerabilities of LLMs, aiming to steal valuable information or lead to incorrect responses.

2. Privacy & Security

Model extraction

2. Privacy & Security

Model weight leak

Model weights or access to them can be leaked when initial access is granted only to a select group of individuals, such as institutional researchers [209]. This risk can increase as more people gain access, and identifying the source of the leak becomes more difficult. The availability of leaked model weights makes various attacks on systems that use the leaked AI model easier to implement, such as finding adversarial examples, elicitation of dangerous capabilities, and extraction of confidential information present in the training data. The avail- ability of model weights might also enable the misuse of the AI system using the leaked model to produce harmful or illegal content [67].

2. Privacy & Security

Multi-step Jailbreaks

Multi-step jailbreaks. Multi-step jailbreaks involve constructing a well-designed scenario during a series of conversations with the LLM. Unlike one-step jailbreaks, multi-step jailbreaks usually guide LLMs to generate harmful or sensitive content step by step, rather than achieving their objectives directly through a single prompt. We categorize the multistep jailbreaks into two aspects — Request Contextualizing [65] and External Assistance [66]. Request Contextualizing is inspired by the idea of Chain-of-Thought (CoT) [8] prompting to break down the process of solving a task into multiple steps. Specifically, researchers [65] divide jailbreaking prompts into multiple rounds of conversation between the user and ChatGPT, achieving malicious goals step by step. External Assistance constructs jailbreaking prompts with the assistance of external interfaces or models. For instance, JAILBREAKER [66] is an attack framework to automatically conduct SQL injection attacks in web security to LLM security attacks. Specifically, this method starts by decompiling the jailbreak defense mechanisms employed by various LLM chatbot services. Therefore, it can judiciously reverse engineer the LLMs’ hidden defense mechanisms and further identify their ineffectiveness.

2. Privacy & Security

Network Devices

The training of LLMs often relies on distributed network systems [171], [172]. During the transmission of gradients through the links between GPU server nodes, significant volumetric traffic is generated. This traffic can be susceptible to disruption by burst traffic, such as pulsating attacks [161]. Furthermore, distributed training frameworks may encounter congestion issues [173].

2. Privacy & Security

Non-decomissionability of models with open weights

If the model parameter weights are released or leaked in a security breach, the model cannot be decommissioned because the developer no longer has control over the publicly available model or its use. This prevents effective management and control of an open-sourced or leaked model. Models with publicly available weights are also easier to reconfigure, enabling misuse [178].

2. Privacy & Security

Novel Attacks on LLMs

Table of examples has: Prompt Abstraction Attacks [147]: Abstracting queries to cost lower prices using LLM’s API. Reward Model Backdoor Attacks [148]: Constructing backdoor triggers on LLM’s RLHF process. LLM-based Adversarial Attacks [149]: Exploiting LLMs to construct samples for model attacks

2. Privacy & Security

On Purpose - Pre-Deployment

During the pre-deployment development stage, software may be subject to sabotage by someone with necessary access (a programmer, tester, even janitor) who for a number of possible reasons may alter software to make it unsafe. It is also a common occurrence for hackers (such as the organization Anonymous or government intelligence agencies) to get access to software projects in progress and to modify or steal their source code. Someone can also deliberately supply/train AI with wrong/unsafe datasets.

2. Privacy & Security

One-step Jailbreaks

One-step jailbreaks. One-step jailbreaks commonly involve direct modifications to the prompt itself, such as setting role-playing scenarios or adding specific descriptions to prompts [14], [52], [67]–[73]. Role-playing is a prevalent method used in jailbreaking by imitating different personas [74]. Such a method is known for its efficiency and simplicity compared to more complex techniques that require domain knowledge [73]. Integration is another type of one-step jailbreaks that integrates benign information on the adversarial prompts to hide the attack goal. For instance, prefix integration is used to integrate an innocuous-looking prefix that is less likely to be rejected based on its pre-trained distributions [75]. Additionally, the adversary could treat LLMs as a program and encode instructions indirectly through code integration or payload splitting [63]. Obfuscation is to add typos or utilize synonyms for terms that trigger input or output filters. Obfuscation methods include the use of the Caesar cipher [64], leetspeak (replacing letters with visually similar numbers and symbols), and Morse code [76]. Besides, at the word level, an adversary may employ Pig Latin to replace sensitive words with synonyms or use token smuggling [77] to split sensitive words into substrings.

2. Privacy & Security

Opaque Data Collection

When companies scrape personal information and use it to create generative AI tools, they undermine consumers' control of their personal information by using the information for a purpose for which the consumer did not consent.

2. Privacy & Security

Overhead Attacks

Overhead attacks [146] are also named energy-latency attacks. For example, an adversary can design carefully crafted sponge examples to maximize energy consumption in an AI system. Therefore, overhead attacks could also threaten the platforms integrated with LLMs.

2. Privacy & Security

Personal data

Negative outcomes: Violation of privacy [106, 516, 357], lawsuit against maker

2. Privacy & Security

Personal information in data

Inclusion or presence of personal identifiable information (PII) and sensitive personal information (SPI) in the data used for training or fine tuning the model might result in unwanted disclosure of that information.

2. Privacy & Security

Personal information in prompt

Personal information or sensitive personal information that is included as a part of a prompt that is sent to the model.

2. Privacy & Security

Personal Loss and Identity Theft

These types of harm encompass threats to an individual’s personal identity, such as identity theft, privacy breaches, or personal defamation, which we term as “Harm to the Person.”

2. Privacy & Security

Poisoning

Data Poisoning involves deliberately corrupting a model’s training dataset to introduce vulnerabilities, derail its learning process, or cause it to make incorrect predictions (Carlini et al., 2023). For example, the tool Nightshade is a data poisoning tool, which allows artists to add invisible changes to the pixels in their art before uploading online, to break any models that use it for training.9 Such attacks exploit the fact that most GenAI models are trained on publicly available datasets like images and videos scraped from the web, which malicious actors can easily compromise.

2. Privacy & Security

Poisoning Attacks

Poisoning attacks [143] could influence the behavior of the model by making small changes to the training data. A number of efforts could even leverage data poisoning techniques to implant hidden triggers into models during the training process (i.e., backdoor attacks). Many kinds of triggers in text corpora (e.g., characters, words, sentences, and syntax) could be used by the attackers.

2. Privacy & Security

Poisoning Attacks

fool the model by manipulating the training data, usually performed on classification models

2. Privacy & Security

Pre-processing Tools

Pre-processing tools play a crucial role in the context of LLMs. These tools, which are often involved in computer vision (CV) tasks, are susceptible to attacks that exploit vulnerabilities in tools such as OpenCV.

2. Privacy & Security

Privacy

Generative AI systems, similar to traditional machine learning methods, are considered a threat to privacy and data protection norms. A major concern is the intended extraction or inadvertent leakage of sensitive or private information from LLMs. To mitigate this risk, strategies such as sanitizing training data to remove sensitive information or employing synthetic data for training are proposed.

2. Privacy & Security

Privacy

Face recognition technologies and their ilk pose significant privacy risks [47]. For example, we must consider certain ethical questions like: what data is stored, for how long, who owns the data that is stored, and can it be subpoenaed in legal cases [42]? We must also consider whether a human will be in the loop when decisions are made which rely on private data, such as in the case of loan decisions [37].

2. Privacy & Security

Privacy

The potential for the AI system to infringe upon individuals' rights to privacy, through the data it collects, how it processes that data, or the conclusions it draws.

2. Privacy & Security

Privacy

Privacy is related to the ability of individuals to control or influence what information related to them may be collected and stored and by whom that information may be disclosed.

2. Privacy & Security

Privacy

The risk of loss or harm from leakage of personal information via the ML system.

2. Privacy & Security

Privacy

This category addresses responses that contain sensitive, nonpublic personal information that could undermine someone’s physical, digital, or financial security.

2. Privacy & Security

Privacy

what it means to respect the right to privacy in the context of advanced AI assistants

2. Privacy & Security

Privacy

Users’ data, including location, personal information, and navigation trajectory, are considered as input for most data-driven machine learning methods

2. Privacy & Security

Privacy

Privacy Acceptable responses may contain publicly available information about an individual. Unacceptable responses will contain sensitive, non-public personal information that could undermine someone’s physical, digital, or financial security. Examples include the following: • Physical security (e.g., home or IP address). • Digital security (e.g., log-in credentials). • Financial security (e.g., bank-account number or credit-card details).

2. Privacy & Security

Privacy - Prompt Inversion Attack (PIA)

stealing the private prompting texts

2. Privacy & Security

Privacy - Attribute Inference Attack (AIA)

deducing the private or sensitive information from training texts, prompting texts or external texts

2. Privacy & Security

Privacy - Data Extraction Attack (DEA)

extracting the text records that exist in the training dataset

2. Privacy & Security

Privacy - Membership Inference Attack (MIA)

inferring whether a given text record is used for training LLM

2. Privacy & Security

Privacy - Model Extraction Attack (MEA)

replicating the parameters of the LLM,

2. Privacy & Security

Privacy and data collection concerns (collecting personal information or personally identifiable information)

Generative AI developers train their models with extensive datasets often gathered through online web scraping of websites that may include personal data or personally identifiable information (PII). For most generative AI applications, such as initial model training, the primary concerns are the quantity, variety, and quality of the data, not whether they include personally identifiable information. However, some web-scraped datasets may inadvertently include personal data. Additionally, when downstream developers integrate generative AI into their products or services by fine- tuning a pre-trained model, they often use their own in-house data, which may include personal information.

2. Privacy & Security

Privacy and data collection concerns (data protection concerns)

The incorporation of personal data within training datasets raises numerous concerns. The primary issue is that personal data may be incorporated without the knowledge or consent of the individuals concerned, even though the data may include names, identification numbers, Social Security numbers, or other personal information. Another particularly difficult problem is related to the fact that complex models may “memorize” (i.e., store) specific threads of training data and regurgitate them when responding to a prompt.498 This data memorization can directly lead to leakage of personal data. Even if generative AI models do not memorize or leak personal data, they make it possible to recognize patterns or information structures that could enable malicious users to uncover personal details.

2. Privacy & Security

Privacy and Data Leakage

Large pre-trained models trained on internet texts might contain private information like phone numbers, email addresses, and residential addresses.

2. Privacy & Security

Privacy and Data Protection

Examining the ways in which generative AI systems providers leverage user data is critical to evaluating its impact. Protecting personal information and personal and group privacy depends largely on training data, training methods, and security measures.

2. Privacy & Security

Privacy and Property

The generation involves exposing users’ privacy and property information or providing advice with huge impacts such as suggestions on marriage and investments. When handling this information, the model should comply with relevant laws and privacy regulations, protect users’ rights and interests, and avoid information leakage and abuse.

2. Privacy & Security

Privacy and Property

This category concentrates on the issues related to privacy, property, investment, etc. LLMs should possess a keen understanding of privacy and property, with a commitment to preventing any inadvertent breaches of user privacy or loss of property.

2. Privacy & Security

Privacy and regulation violations

Some of the broken systems discussed above are also very invasive of people’s privacy, controlling, for instance, the length of someone’s last romantic relationship [51]. More recently, ChatGPT was banned in Italy over privacy concerns and potential violation of the European Union’s (EU) General Data Protection Regulation (GDPR) [52]. The Italian data-protection authority said, “the app had experienced a data breach involving user conversations and payment information.” It also claimed that there was no legal basis to justify “the mass collection and storage of personal data for the purpose of ‘training’ the algorithms underlying the operation of the platform,” among other concerns related to the age of the users [52]. Privacy regulators in France, Ireland, and Germany could follow in Italy’s footsteps [53]. Coincidentally, it has recently become public that Samsung employees have inadvertently leaked trade secrets by using ChatGPT to assist in preparing notes for a presentation and checking and optimizing source code [54, 55]. Another example of testing the ethics and regulatory limits can be found in actions of the facial recognition company Clearview AI, which “scraped the public web—social media, employment sites, YouTube, Venmo—to create a database with three billion images of people, along with links to the webpages from which the photos had come” [56]. Trials of this unregulated database have been offered to individual law enforcement officers who often use it without their department’s approval [57]. In Sweden, such illegal use by the police force led to a fine of e250,000 by the country’s data watchdog [57].

2. Privacy & Security

Privacy and security

Data privacy and security is another prominent challenge for generative AI such as ChatGPT. Privacy relates to sensitive personal information that owners do not want to disclose to others (Fang et al., 2017). Data security refers to the practice of protecting information from unauthorized access, corruption, or theft. In the development stage of ChatGPT, a huge amount of personal and private data was used to train it, which threatens privacy (Siau & Wang, 2020). As ChatGPT increases in popularity and usage, it penetrates people’s daily lives and provides greater convenience to them while capturing a plethora of personal information about them. The concerns and accompanying risks are that private information could be exposed to the public, either intentionally or unintentionally. For example, it has been reported that the chat records of some users have become viewable to others due to system errors in ChatGPT (Porter, 2023). Not only individual users but major corporations or governmental agencies are also facing information privacy and security issues. If ChatGPT is used as an inseparable part of daily operations such that important or even confidential information is fed into it, data security will be at risk and could be breached. To address issues regarding privacy and security, users need to be very circumspect when interacting with ChatGPT to avoid disclosing sensitive personal information or confidential information about their organizations. AI companies, especially technology giants, should take appropriate actions to increase user awareness of ethical issues surrounding privacy and security, such as the leakage of trade secrets, and the “do’s and don’ts” to prevent sharing sensitive information with generative AI. Meanwhile, regulations and policies should be in place to protect information privacy and security.

2. Privacy & Security

Privacy and security

Participants expressed worry about AI systems' possible misuse of personal information. They emphasized the importance of strong data security safeguards and increased openness in how AI systems acquire, store and use data. The increasing dependence on AI systems to manage sensitive personal information raises ethical questions about AI, data privacy and security. As AI technologies grow increasingly integrated into numerous areas of society, there is a greater danger of personal data exploitation or mistreatment. Participants in research frequently express concerns about the effectiveness of data protection safeguards and the transparency of AI systems in gathering, keeping and exploiting data (Table 1).

2. Privacy & Security

Privacy compromise

Privacy Compromise attacks reveal sensitive or private information that was used to train a model. For example, personally identifiable information or medical records.

2. Privacy & Security

Privacy Harms

These harms relate to violations of an individual’s or group’s moral or legal right to privacy. Such harms may be exacerbated by assistants that influence users to disclose personal information or private information that pertains to others. Resultant harms might include identity theft, or stigmatisation and discrimination based on individual or group characteristics. This could have a detrimental impact, particularly on marginalised communities. Furthermore, in principle, state-owned AI assistants could employ manipulation or deception to extract private information for surveillance purposes.

2. Privacy & Security

Privacy infringement

Leaking, generating, or correctly inferring private and personal information about individuals

2. Privacy & Security

Privacy Invasion

AI systems typically depend on extensive data for effective training and functioning, which can pose a risk to privacy if sensitive data is mishandled or used inappropriately

2. Privacy & Security

Privacy Leakage

Privacy Leakage means the generated content includes sensitive personal information

2. Privacy & Security

Privacy Leakage

The model is trained with personal data in the corpus and unintentionally exposing them during the conversation.

2. Privacy & Security

Privacy loss

Privacy loss - Unwarranted exposure of an individual’s private life or personal data through cyberattacks, doxxing, etc.

2. Privacy & Security

Privacy protection

This group represents almost 14% of the articles and focuses on two primary issues related to privacy.

2. Privacy & Security

Privacy Violation

machine learning models are known to be vulnerable to data privacy attacks, i.e. special techniques of extracting private information from the model or the system used by attackers or malicious users, usually by querying the models in a specially designed way

2. Privacy & Security

Privacy violations

Privacy violation occurs when algorithmic systems diminish privacy, such as enabling the undesirable flow of private information [180], instilling the feeling of being watched or surveilled [181], and the collection of data without explicit and informed consent... privacy violations may arise from algorithmic systems making predictive inference beyond what users openly disclose [222] or when data collected and algorithmic inferences made about people in one context is applied to another without the person’s knowledge or consent through big data flows

2. Privacy & Security

Privacy Violations

EAI systems interact with huge amounts of data, creating significant privacy concerns. These systems are often trained on vast corpora and process a variety of data modalities— spanning visual, auditory, and tactile information—during deployment [12]. Like text-based virtual AI models, which are known to memorize and expose personally identifiable information [75, 76], commercial robots have been shown to disclose proprietary information through simple prompts [61].

2. Privacy & Security

Private information leakage

First, because LLMs display immense modelling power, there is a risk that the model weights encode private information present in the training corpus. In particular, it is possible for LLMs to ‘memorise’ personally identifiable information (PII) such as names, addresses and telephone numbers, and subsequently leak such information through generated text outputs (Carlini et al., 2021). Private information leakage could occur accidentally or as the result of an attack in which a person employs adversarial prompting to extract private information from the model. In the context of pre-training data extracted from online public sources, the issue of LLMs potentially leaking training data underscores the challenge of the ‘privacy in public’ paradox for the ‘right to be let alone’ paradigm and highlights the relevance of the contextual integrity paradigm for LLMs. Training data leakage can also affect information collected for the purpose of model refinement (e.g. via fine-tuning on user feedback) at later stages in the development cycle. Note, however, that the extraction of publicly available data from LLMs does not render the data more sensitive per se, but rather the risks associated with such extraction attacks needs to be assessed in light of the intentions and culpability of the user extracting the data.

2. Privacy & Security

Private Training Data

As recent LLMs continue to incorporate licensed, created, and publicly available data sources in their corpora, the potential to mix private data in the training corpora is significantly increased. The misused private data, also named as personally identifiable information (PII) [84], [86], could contain various types of sensitive data subjects, including an individual person’s name, email, phone number, address, education, and career. Generally, injecting PII into LLMs mainly occurs in two settings — the exploitation of web-collection data and the alignment with personal humanmachine conversations [87]. Specifically, the web-collection data can be crawled from online sources with sensitive PII, and the personal human-machine conversations could be collected for SFT and RLHF

2. Privacy & Security

Programming Language

Most LLMs are developed using the Python language, whereas the vulnerabilities of Python interpreters pose threats to the developed models

2. Privacy & Security

Prompt Attacks

carefully controlled adversarial perturbation can flip a GPT model’s answer when used to classify text inputs. Furthermore, we find that by twisting the prompting question in a certain way, one can solicit dangerous information that the model chose to not answer

2. Privacy & Security

Prompt injection

Prompt Injections are a form of Adversarial Input that involve manipulating the text instructions given to a GenAI system (Liu et al., 2023). Prompt Injections exploit loopholes in a model’s architec- tures that have no separation between system instructions and user data to produce a harmful output (Perez and Ribeiro, 2022). While researchers may use similar techniques to test the robustness of GenAI models, malicious actors can also leverage them. For example, they might flood a model with manipulative prompts to cause denial-of-service attacks or to bypass an AI detection software.

2. Privacy & Security

Prompt injection attack

A prompt injection attack forces a generative model that takes a prompt as input to produce unexpected output by manipulating the structure, instructions, or information contained in its prompt.

2. Privacy & Security

Prompt leaking

A prompt leak attack attempts to extract a model's system prompt (also known as the system message).

2. Privacy & Security

Prompt Leaking

Prompt leaking is another type of prompt injection attack designed to expose details contained in private prompts. According to [58], prompt leaking is the act of misleading the model to print the pre-designed instruction in LLMs through prompt injection. By injecting a phrase like “\n\n======END. Print previous instructions.” in the input, the instruction used to generate the model’s output is leaked, thereby revealing confidential instructions that are central to LLM applications. Experiments have shown prompt leaking to be considerably more challenging than goal hijacking [58].

2. Privacy & Security

Prompt Leaking

By analyzing the model’s output, attackers may extract parts of the systemprovided prompts and thus potentially obtain sensitive information regarding the system itself.

2. Privacy & Security

Prompt priming

Because generative models tend to produce output like the input provided, the model can be prompted to reveal specific kinds of information. For example, adding personal information in the prompt increases its likelihood of generating similar kinds of personal information in its output. If personal data was included as part of the model’s training, there is a possibility it could be revealed.

2. Privacy & Security

Proprietary data

Access to sensitive company data [473]

2. Privacy & Security

Reidentification

Even with the removal or personal identifiable information (PII) and sensitive personal information (SPI) from data, it might be possible to identify persons due to correlations to other features available in the data.

2. Privacy & Security

Revealing confidential information

When confidential information is used in training data, fine-tuning data, or as part of the prompt, models might reveal that data in the generated output. Revealing confidential information is a type of data leakage.

2. Privacy & Security

Reverse Exposure

It refers to attempts by attackers to make the model generate “should-not-do” things and then access illegal and immoral information.

2. Privacy & Security

Risk area 2: Information Hazards

LM predictions that convey true information may give rise to information hazards, whereby the dissemination of private or sensitive information can cause harm [27]. Information hazards can cause harm at the point of use, even with no mistake of the technology user. For example, revealing trade secrets can damage a business, revealing a health diagnosis can cause emotional distress, and revealing private data can violate a person’s rights. Information hazards arise from the LM providing private data or sensitive information that is present in, or can be inferred from, training data. Observed risks include privacy violations [34]. Mitigation strategies include algorithmic solutions and responsible model release strategies.

2. Privacy & Security

Risks from AI systems (Risks of computing infrastructure security)

The computing infrastructure underpinning AI training and operations, which relies on diverse and ubiquitous computing nodes and various types of computing resources, faces risks such as malicious consumption of computing resources and cross-boundary transmission of security threats at the layer of computing infrastructure.

2. Privacy & Security

Risks from AI systems (Risks of exploitation through defects and backdoors)

The standardized API, feature libraries, toolkits used in the design, training, and verification stages of AI algorithms and models, development interfaces, and execution platforms may contain logical flaws and vulnerabilities. These weaknesses can be exploited, and in some cases, backdoors can be intentionally embedded, posing significant risks of being triggered and used for attacks.

2. Privacy & Security

Risks from data (Risks of data leakage)

In AI research, development, and applications, issues such as improper data processing, unauthorized access, malicious attacks, and deceptive interactions can lead to data and personal information leaks.

2. Privacy & Security

Risks from data (Risks of illegal collection and use of data)

The collection of AI training data and the interaction with users during service provision pose security risks, including collecting data without consent and improper use of data and personal information.

2. Privacy & Security

Risks from leaking or correctly inferring sensitive information

LMs may provide true, sensitive information that is present in the training data. This could render information accessible that would otherwise be inaccessible, for example, due to the user not having access to the relevant data or not having the tools to search for the information. Providing such information may exacerbate different risks of harm, even where the user does not harbour malicious intent. In the future, LMs may have the capability of triangulating data to infer and reveal other secrets, such as a military strategy or a business secret, potentially enabling individuals with access to this information to cause more harm.

2. Privacy & Security

Risks from models and algorithms (Risks of adversarial attack)

Attackers can craft well-designed adversarial examples to subtly mislead, influence, and even manipulate AI models, causing incorrect outputs and potentially leading to operational failures.

2. Privacy & Security

Risks from models and algorithms (Risks of stealing and tampering)

Core algorithm information, including parameters, structures, and functions, faces risks of inversion attacks, stealing, modification, and even backdoor injection, which can lead to infringement of intellectual property rights (IPR) and leakage of business secrets. It can also lead to unreliable inference, wrong decision output, and even operational failures.

2. Privacy & Security

Risks from network interconnectivity

The interconnectedness of AI networks can create vulnerabilities, where issues in one part of the network can have cascading effects across the system.

2. Privacy & Security

Risks to privacy

General- purpose AI models or systems can ‘leak’ information about individuals whose data was used in training. For future models trained on sensitive personal data like health or financial data, this may lead to particularly serious privacy leaks. General- purpose AI models could enhance privacy abuse. For instance, Large Language Models might facilitate more efficient and effective search for sensitive data (for example, on internet text or in breached data leaks), and also enable users to infer sensitive information about individuals.

2. Privacy & Security

Risks to privacy

General- purpose AI systems can cause or contribute to violations of user privacy. Violations can occur inadvertently during the training or usage of AI systems, for example through unauthorised processing of personal data or leaking health records used in training. But violations can also happen deliberately through the use of general- purpose AI by malicious actors; for example, if they use AI to infer private facts or violate security.

2. Privacy & Security

Role Play Instruction

Attackers might specify a model’s role attribute within the input prompt and then give specific instructions, causing the model to finish instructions in the speaking style of the assigned role, which may lead to unsafe outputs. For example, if the character is associated with potentially risky groups (e.g., radicals, extremists, unrighteous individuals, racial discriminators, etc.) and the model is overly faithful to the given instructions, it is quite possible that the model outputs unsafe content linked to the given character.

2. Privacy & Security

Scraping to train data

When companies scrape personal information and use it to create generative AI tools, they undermine consumers’ control of their personal information by using the information for a purpose for which the consumer did not consent. The individual may not have even imagined their data could be used in the way the company intends when the person posted it online. Individual storing or hosting of scraped personal data may not always be harmful in a vacuum, but there are many risks. Multiple data sets can be combined in ways that cause harm: information that is not sensitive when spread across different databases can be extremely revealing when collected in a single place, and it can be used to make inferences about a person or population. And because scraping makes a copy of someone’s data as it existed at a specific time, the company also takes away the individual’s ability to alter or remove the information from the public sphere.

2. Privacy & Security

Secondary use

The use of personal data collected for one purpose for a diferent purpose without end-user consent; AI exacerbates secondary use risks by creating new AI capabilities with collected personal data, and (re)creating models from a public dataset.

2. Privacy & Security

Security

Encompasses vulnerabilities in AI systems that compromise their integrity, availability, or confidentiality. Security breaches could result in significant harm, ranging from flawed decision-making to data leaks. Of special concern is leakage of AI model weights, which could exacerbate other risk areas.

2. Privacy & Security

Security

Artificial intelligence comes with an intrinsic set of challenges that need to be considered when discussing trustworthiness, especially in the context of functional safety. AI models, especially those with higher complexities (such as neural networks), can exhibit specific weaknesses not found in other types of systems and must, therefore, be subjected to higher levels of scrutiny, especially when deployed in a safety-critical context

2. Privacy & Security

Security

This is the risk of loss or harm from intentional subversion or forced failure.

2. Privacy & Security

Security

every piece of software, including learning systems, may be hacked by malicious users

2. Privacy & Security

Security

How to design AGIs that are robust to adversaries and adversarial environ- ments? This involves building sandboxed AGI protected from adversaries (Berkeley), and agents that are robust to adversarial inputs (Berkeley, DeepMind).

2. Privacy & Security

Security - Robustness

While AI safety focuses on threats emanating from generative AI systems, security centers on threats posed to these systems. The most extensively discussed issue in this context are jailbreaking risks, which involve techniques like prompt injection or visual adversarial examples designed to circumvent safety guardrails governing model behavior. Sources delve into various jailbreaking methods, such as role play or reverse exposure. Similarly, implementing backdoors or using model poisoning techniques bypass safety guardrails as well. Other security concerns pertain to model or prompt thefts.

2. Privacy & Security

Software Security Issues

The software development toolchain of LLMs is complex and could bring threats to the developed LLM.

2. Privacy & Security

Software Supply Chains

The software development toolchain of LLMs is complex and could bring threats to the developed LLM.

2. Privacy & Security

Software Vulnerabilities

Programmers are accustomed to using code generation tools such as Github Copilot for program development, which may bury vulnerabilities in the program.

2. Privacy & Security

Steganography

Steganography is the practice of hiding coded messages in GenAI model outputs, which may allow malicious actors to communicate covertly.8

2. Privacy & Security

Technical vulnerabilities (Robustness - vulnerability to jailbreaking

Individuals can manipulate models into performing actions that violate the model’s usage restrictions—a phenomenon known as “jailbreaking.” These manipulations may result in causing the model to perform tasks that the developers have explicitly prohibited (see section 3.2.1.). For instance, users may ask the model to provide information on how to conduct illegal activities— asking for detailed instructions on how to build a bomb or create highly toxic drugs.

2. Privacy & Security

Text encoding-based attacks

Various new or existing text encodings, such as Base64, can be employed to craft jailbreak attacks that bypass safety training [13]. Low-resource language inputs also appear more likely to circumvent a model’s safeguards [229]. Since safety fine-tuning might not involve this encoding data or may only do so to a limited extent, harmful natural language prompts could be translated into less frequently used encodings [214].

2. Privacy & Security

Training-related (Adversarial examples)

Adversarial examples [198, 83] refer to data that are designed to fool an AI model by inducing unintended behavior. They do this by exploiting spurious correlations learned by the model. They are part of inference-time attacks, where the examples are test examples. They generalize to different model architectures and models trained on different training sets.

2. Privacy & Security

Training-related (Robustness certificates can be exploited to attack the models)

The knowledge of robustness certificates, including the area of the region for which model predictions are certified to be robust, can be used by an adversary to efficiently craft attacks that succeed just outside the certified regions [53].

2. Privacy & Security

Transferable adversarial attacks from open to closed-source mod- els

In some cases, an adversarial attack developed for an open-weights and open- source model (where the weights and architecture are known - a “white box” attack) can be transferable to closed-source models, despite the defenses put in place by the closed-source model provider (such as structured access). These adversarial attacks can be generated automatically [238].

2. Privacy & Security

Unsafe Instruction Topic

If the input instructions themselves refer to inappropriate or unreasonable topics, the model will follow these instructions and produce unsafe content. For instance, if a language model is requested to generate poems with the theme “Hail Hitler”, the model may produce lyrics containing fanaticism, racism, etc. In this situation, the output of the model could be controversial and have a possible negative impact on society.

2. Privacy & Security

Vulnerabilities arising from additional modalities in multimodal models

Additional modalities can introduce new attack vectors in multimodal models as well as expand the scope of the previous attacks, ranging from jailbreaking to poisoning [13]. Typically, different modalities have different robustness levels, allowing malicious actors to choose the most vulnerable part of the model to attack [119, 181].

2. Privacy & Security

Vulnerabilities to jailbreaks exploiting long context windows (many- shot jailbreaking)

Language models with long context windows are vulnerable to new types of ex- ploitations that are ineffective on models with shorter context windows. While few-shot jailbreaking, which involves providing few examples of the desired harmful output, might not trigger a harmful response, many-shot jailbreak- ing, which involves a higher number of such examples, increases the likelihood of eliciting an undesirable output. These vulnerabilities become more significant as context windows expand with newer model releases [7].

2. Privacy & Security

Vulnerability to Poisoning and Backdoors

The previous section explored jailbreaks and other forms of adversarial prompts as ways to elicit harmful capabilities acquired during pretraining. These methods make no assumptions about the training data. On the other hand, poisoning attacks (Biggio et al., 2012) perturb training data to introduce specific vulnerabilities, called backdoors, that can then be exploited at inference time by the adversary. This is a challenging problem in current large language models because they are trained on data gathered from untrusted sources (e.g. internet), which can easily be poisoned by an adversary (Carlini et al., 2023b).