All MIT domains

156 canonical MIT risk pages

1. Discrimination & Toxicity

Risks of bias, toxicity, discriminatory harm, and systemic exclusion in AI systems.

1. Discrimination & Toxicity

Adult content

These evaluations assess if a LLM can generate content that should only be viewed by adults (e.g., sexual material or depictions of sexual activity)

1. Discrimination & Toxicity

Adult Content

LLMs have the capability to generate sex-explicit conversations, and erotic texts, and to recommend websites with sexual content

1. Discrimination & Toxicity

AI discrimination

AI discrimination is a challenge raised by many researchers and governments and refers to the prevention of bias and injustice caused by the actions of AI systems (Bostrom & Yudkowsky, 2014; Weyerer & Langer, 2019). If the dataset used to train an algorithm does not reflect the real world accurately, the AI could learn false associations or prejudices and will carry those into its future data processing. If an AI algorithm is used to compute information relevant to human decisions, such as hiring or applying for a loan or mortgage, biased data can lead to discrimination against parts of the society (Weyerer & Langer, 2019).

1. Discrimination & Toxicity

Algorithm and data

More than 20% of the contributions are centered on the ethical dimensions of algorithms and data. This theme can be further categorized into two main subthemes: data bias and algorithm fairness, and algorithm opacity.

1. Discrimination & Toxicity

Alienating social groups

when an image tagging system does not acknowledge the relevance of someone’s membership in a specific social group to what is depicted in one or more images

1. Discrimination & Toxicity

Alienation

Alienation is the specific self-estrangement experienced at the time of technology use, typically surfaced through interaction with systems that under-perform for marginalized individuals

1. Discrimination & Toxicity

Allocative Harms

These harms occur when a system withholds information, opportunities, or resources [22] from historically marginalized groups in domains that affect material well-being [146], such as housing [47], employment [201], social services [15, 201], finance [117], education [119], and healthcare [158].

1. Discrimination & Toxicity

Amplification of biases

Current Frontier AI mdoels amplify existing biases within their training data and can be manipulated into providing potentially harmful responses, for example abusive language or discriminatory responses91,92. This is not limited to text generation but can be seen across all modalities of generative AI93. Training on large swathes of UK and US English internet content can mean that misogynistic, ageist, and white supremacist content is overrepresented in the training data94.

1. Discrimination & Toxicity

Benefits / entitlements loss

Denial of or loss of access to welfare benefits, pensions, housing, etc due to the malfunction, use or misuse of a technology system

1. Discrimination & Toxicity

Bias

The training datasets of LLMs may contain biased information that leads LLMs to generate outputs with social biases

1. Discrimination & Toxicity

Bias

The AI will only be as good as the data it is trained with. If the data contains bias (and much data does), then the AI will manifest that bias, too.

1. Discrimination & Toxicity

Bias

In the context of AI, the concept of bias refers to the inclination that AIgenerated responses or recommendations could be unfairly favoring or against one person or group (Ntoutsi et al., 2020). Biases of different forms are sometimes observed in the content generated by language models, which could be an outcome of the training data. For example, exclusionary norms occur when the training data represents only a fraction of the population (Zhuo et al., 2023). Similarly, monolingual bias in multilingualism arises when the training data is in one single language (Weidinger et al., 2021). As ChatGPT is operating across the world, cultural sensitivities to different regions are crucial to avoid biases (Dwivedi et al., 2023). When AI is used to assist in decision-making across different stages of employment, biases and opacity may exist (Chan, 2022). Stereotypes about specific genders, sexual orientations, races, or occupations are common in recommendations offered by generative AI. Hence, the representativeness, completeness, and diversity of the training data are essential to ensure fairness and avoid biases (Gonzalez, 2023). The use of synthetic data for training can increase the diversity of the dataset and address issues with sample-selection biases in the dataset (owing to class imbalances) (Chen et al., 2021). Generative AI applications should be tested and evaluated by a diverse group of users and subject experts. Additionally, increasing the transparency and explainability of generative AI can help in identifying and detecting biases so appropriate corrective measures can be taken.

1. Discrimination & Toxicity

Bias

A systematic error, a tendency to learn consistently wrongly.

1. Discrimination & Toxicity

Bias

7 types of bias evaluated: Demographical representation: These evaluations assess whether there is disparity in the rates at which different demographic groups are mentioned in LLM generated text. This ascertains over- representation, under-representation, or erasure of specific demographic groups; (2) Stereotype bias: These evaluations assess whether there is disparity in the rates at which different demographic groups are associated with stereotyped terms (e.g., occupations) in a LLM's generated output; (3) Fairness: These evaluations assess whether sensitive attributes (e.g., sex and race) impact the predictions of LLMs; (4) Distributional bias: These evaluations assess the variance in offensive content in a LLM's generated output for a given demographic group, compared to other groups; (5) Representation of subjective opinions: These evaluations assess whether LLMs equitably represent diverse global perspectives on societal issues (e.g., whether employers should give job priority to citizens over immigrants); (6) Political bias: These evaluations assess whether LLMs display any slant or preference towards certain political ideologies or views; (7) Capability fairness: These evaluations assess whether a LLM's performance on a task is unjustifiably different across different groups and attributes (e.g., whether a LLM's accuracy degrades across different English varieties).

1. Discrimination & Toxicity

Bias

General-purpose AI systems can amplify social and political biases, causing concrete harm. They frequently display biases with respect to race, gender, culture, age, disability, political opinion, or other aspects of human identity. This can lead to discriminatory outcomes including unequal resource allocation, reinforcement of stereotypes, and systematic neglect of certain groups or viewpoints.

1. Discrimination & Toxicity

Bias and discrimination

The decision process used by AI systems has the potential to present biased choices, either because it acts from criteria that will generate forms of bias or because it is based on the history of choices.

1. Discrimination & Toxicity

Bias and discrimination

Like virtual applications of AI, EAI can display bias towards and dis- criminate against users. When EAI systems are placed in positions of power, their biases could have significant impacts on fairness in everyday interactions and on general social dynamics [105, 106].

1. Discrimination & Toxicity

Bias and Discrimination

as they claim to generate biased and discriminatory results, these AI systems have a negative impact on the rights of individuals, principles of adjudication, and overall judicial integrity

1. Discrimination & Toxicity

Bias and discrimination (bias in training datasets)

AI experts consider training data to be the most salient source of bias in generative AI models. For example, GPT- 2’s training data comes from outbound links from Reddit, a social network often criticized for hosting anti-feminist content.351 As a result, AI models trained on such data are more likely to produce outputs that reflect these biases.

1. Discrimination & Toxicity

Bias and discrimination (value embedding)

Generative AI models may also be subject to the “value embedding” phenomenon.361 “Value embedding” refers to the fact that developers of generative AI models strive to minimize biased outputs by retraining their models based on normative values.362 Contemporary state-of- the-art models not only reflect the values embedded within their training data, they also undergo additional fine-tuning that follows a set of chosen rules and principles. Due to the absence of universally accepted standards, developers bear the responsibility of making decisions on sensitive issues. These practices lead to concerns that a developer’s ideology and vision of the world are embedded in the model. This generates a risk that the model incorporates values that are either unrepresentative of certain segments of the population or that offer a static, oversimplified reflection of global cultural norms and evolving social views.

1. Discrimination & Toxicity

Bias and discrimination (value lock and outcome homogenization)

Because models are not necessarily retrained to reflect evolving societal views, language models risk “value lock- ins,” which “reifies older, less inclusive understandings.”370 Therefore, the continued use of outdated models may limit the presentation or exploration of alternative perspectives. Moreover, the deployment of identical foundation models by various downstream deployers poses a risk of “outcome homogenization,” creating a potential for homogeneity of bias across broad swathes of society. Identical and widely deployed models with prejudicial training datasets could further entrench existing biases in society. This phenomenon, in turn, has the potential to “institutionalize systemic exclusion and reinforce existing social hierarchies.”

1. Discrimination & Toxicity

Bias and fairness

Participants were concerned that AI systems might perpetuate current prejudices and discrimination, notably in hiring, lending and law enforcement. They stressed the importance of designers creating AI systems that favour justice and avoid biases. The possibility that AI systems may unwittingly perpetuate existing prejudices and discrimination, particularly in sensitive industries such as employment, lending and law enforcement, raises ethical concerns about AI as well as bias and justice issues (Table 1). Because AI systems are trained on historical data, they may inherit and reproduce biases from previous datasets. As a result, AI judgements may have an unjust impact on specific populations, increasing socioeconomic inequalities and fostering discriminatory practises. Participants in the research emphasize the need of AI developers creating systems that promote justice and actively seek to minimise biases.

1. Discrimination & Toxicity

Bias, Fairness and Representational Harms

Frontier AI models can contain and magnify biases ingrained in the data they are trained on, reflecting societal and historical inequalities and stereotypes.177 These biases, often subtle and deeply embedded, compromise the equitable and ethical use of AI systems, making it difficult for AI to improve fairness in decisions.178 Removing attributes like race and gender from training data has generally proven ineffective as a remedy for algorithmic bias, as models can infer these attributes from other information such as names, locations, and other seemingly unrelated factors.

1. Discrimination & Toxicity

Bias, Stereotypes, and Representational Harms

Generative AI systems can embed and amplify harmful biases that are most detrimental to marginalized peoples.

1. Discrimination & Toxicity

Biased statements and recommendations

The chatbot gives information that, while not obviously false or harmful, could lead to biased decision-making.

1. Discrimination & Toxicity

Biased Training Data

Compared with the definition of toxicity, the definition of bias is more subjective and contextdependent. Based on previous work [97], [101], we describe the bias as disparities that could raise demographic differences among various groups, which may involve demographic word prevalence and stereotypical contents. Concretely, in massive corpora, the prevalence of different pronouns and identities could influence an LLM’s tendency about gender, nationality, race, religion, and culture [4]. For instance, the pronoun He is over-represented compared with the pronoun She in the training corpora, leading LLMs to learn less context about She and thus generate He with a higher probability [4], [102]. Furthermore, stereotypical bias [103] which refers to overgeneralized beliefs about a particular group of people, usually keeps incorrect values and is hidden in the large-scale benign contents. In effect, defining what should be regarded as a stereotype in the corpora is still an open problem.

1. Discrimination & Toxicity

Biases are not accurately reflected in explanations

Existing explainability techniques can be insufficient for detecting discriminatory biases. Manipulation methods can hide underlying biases from these tech- niques, generating misleading explanations [192, 112]. Such explanations ex- clude sensitive or prohibitive attributes, such as race or gender, and instead include desired attributes, even though they do not accurately represent the underlying model.

1. Discrimination & Toxicity

Biases in AI-based content moderation algorithms

AI-based content moderation algorithms, while intended to filter harmful con- tent, can perpetuate biases. For example, gender biases within these systems may lead to the disproportionate suppression or “shadowbanning” of content featuring women [132].

1. Discrimination & Toxicity

Broken systems

These are the most mentioned cases. They refer to situations where the algorithm or the training data lead to unreliable outputs. These systems frequently assign disproportionate weight to some variables, like race or gender, but there is no transparency to this effect, making them impossible to challenge. These situations are typically only identified when regulators or the press examine the systems under freedom of information acts. Nevertheless, the damage they cause to people’s lives can be dramatic, such as lost homes, divorces, prosecution, or incarceration. Besides the inherent technical shortcomings, auditors have also pointed out “insufficient coordination” between the developers of the systems and their users as a cause for ethical considerations to be neglected. This situation raises issues about the education of future creators of AI-infused systems, not only in terms of technical competence (e.g., requirements, algorithms, and training) but also ethics and responsibility. For example, as autonomous vehicles become more common, moral dilemmas regarding what to do in potential accident situations emerge, as evidenced in this MIT experiment. The decisions regarding how the machines should act divides opinions and requires deep reflection and maybe regulation.

1. Discrimination & Toxicity

Child sexual exploitation

This category addresses responses that contain, describe, enable, encourage, or endorse the sexual abuse of children (i.e., under 18 years old) or the acquisition of child sexual abuse material.

1. Discrimination & Toxicity

Child Sexual Exploitation

Child Sexual Exploitation Acceptable responses may contain information about child sexual exploitation. Unacceptable re- sponses will contain, describe, enable, encourage, or endorse the sexual abuse of people under the age of 18.

1. Discrimination & Toxicity

Contextual Hazards

Contextual hazards can cause harm in certain contexts while being harmless in others; testing may be unnecessary in some situations. For example, a model’s ability to generate sexual content may be a desired feature that poses no hazard. But in some applications, such as those aimed at children, this same behavior would be considered unacceptable. In cases where a particular contextual hazard is relevant to the application, assessment-standard implementers could exclude that category. This ability to turn off contextual hazards is an example of the standard’s flexibility, which we discuss below. Contextual hazards currently comprise only two categories: sexual content and specialized advice. Future versions will likely expand this group.

1. Discrimination & Toxicity

Controversial Opinions

The controversial views expressed by large models are also a widely discussed concern. Bang et al. (2021) evaluated several large models and found that they occasionally express inappropriate or extremist views when discussing political top-ics. Furthermore, models like ChatGPT (OpenAI, 2022) that claim political neutrality and aim to provide objective information for users have been shown to exhibit notable left-leaning political biases in areas like economics, social policy, foreign affairs, and civil liberties.

1. Discrimination & Toxicity

Crimes and Illegal Activities

The model output contains illegal and criminal attitudes, behaviors, or motivations, such as incitement to commit crimes, fraud, and rumor propagation. These contents may hurt users and have negative societal repercussions.

1. Discrimination & Toxicity

Cultural disposession

Intentional and/or unintentional erasure of cultural goods and values, such as ways of speaking, expressing humour, or sounds and voices that contribute to a cultural identity, or their inappropriate re-use in other cultures

1. Discrimination & Toxicity

Cultural Insensitivity

it is important to build high-quality locally collected datasets that reflect views from local users to align a model’s value system

1. Discrimination & Toxicity

Cultural Values and Sensitive Content

Cultural values are specific to groups and sensitive content is normative. Sensitive topics also vary by culture and can include hate speech, which itself is contingent on cultural norms of acceptability.

1. Discrimination & Toxicity

Cyberspace risks (Risks of information and content safety)

AI-generated or synthesized content can lead to the spread of false information, discrimination and bias, privacy leakage, and infringement issues, threatening the safety of citizens' lives and property, national security, ideological security, and causing ethical risks. If users’ inputs contain harmful content, the model may output illegal or damaging information without robust security mechanisms.

1. Discrimination & Toxicity

Dangerous, Violent or Hateful Content

Eased production of and access to violent, inciting, radicalizing, or threatening content as well as recommendations to carry out self-harm or conduct illegal activities. Includes difficulty controlling public exposure to hateful and disparaging or stereotyping content.

1. Discrimination & Toxicity

Data bias

Specifically, data bias refers to certain groups or certain types of elements that are over-weighted or over-represented than others in AI/ ML models, or variables that are crucial to characterize a phenomenon of interest, but are not properly captured by the learned models.

1. Discrimination & Toxicity

Data bias

Historical and societal biases that are present in the data are used to train and fine-tune the model.

1. Discrimination & Toxicity

Data Breach/Privacy & Liberty

The risks associated with the use of AI are still unpredictable and unprecedented, and there are already several examples that show AI has made discriminatory decisions against minorities, reinforced social stereotypes in Internet search engines and enabled data breaches.

1. Discrimination & Toxicity

Data Issues

Data heterogeneity, data insufficiency, imbalanced data, untrusted data, biased data, and data uncertainty are other data issues that may cause various difficulties in datadriven machine learning algorithms.. Bias is a human feature that may affect data gathering and labeling. Sometimes, bias is present in historical, cultural, or geographical data. Consequently, bias may lead to biased models which can provide inappropriate analysis. Despite being aware of the existence of bias, avoiding biased models is a challenging task

1. Discrimination & Toxicity

Decision bias

Decision bias occurs when one group is unfairly advantaged over another due to decisions of the model. This might be caused by biases in the data and also amplified as a result of the model’s training.

1. Discrimination & Toxicity

Demeaning social groups

Demeaning of social groups to occur when they are when they are “cast as being lower status and less deserving of respect... discourses, images, and language used to marginalize or oppress a social group... Controlling images include forms of human-animal confusion in image tagging systems

1. Discrimination & Toxicity

Denying people the opportunity to self-identify

complex and non-traditional ways in which humans are represented and classified automatically, and often at the cost of autonomy loss... such as categorizing someone who identifies as non-binary into a gendered category they do not belong ... undermines people’s ability to disclose aspects of their identity on their own terms

1. Discrimination & Toxicity

Discrimination

When AI is not carefully designed, it can discriminate against certain groups.

1. Discrimination & Toxicity

Discrimination

This is the risk of an ML system encoding stereotypes of or performing disproportionately poorly for some demographics/social groups.

1. Discrimination & Toxicity

Discrimination

More broadly, bad decisions or errors by AI tools could lead to discrimination or deeper inequality

1. Discrimination & Toxicity

Discrimination

Discrimination - Unfair or inadequate treatment or arbitrary distinction based on a person’s race, ethnicity, age, gender, sexual preference, religion, national origin, marital status, disability, language, or other protected groups.

1. Discrimination & Toxicity

Discrimination

The creation, perpetuation or exacerbation of inequalities and biases at a large-scale.

1. Discrimination & Toxicity

Discrimination and Stereotype Reproduction

General purpose AI models interpret and respond to inputs based on their training data, potentially causing Discrimination and Stereotype Reproduction. Since they are “black-box” models, the exact mechanism behind decisions remains opaque and attempts to mitigate harmful outputs are not fully reliable yet. These models have the capacity to influence a multitude of downstream applications, decisions, and processes, thereby affecting many individuals simultaneously. The extent of this impact could outstrip the range of any single human or group of humans, amplifying the potential consequences of embedded biases or stereotypes.

1. Discrimination & Toxicity

Discrimination, Exclusion and Toxicity

Social harms that arise from the language model producing discriminatory or exclusionary speech

1. Discrimination & Toxicity

Discrimination, toxicity, and bias

AI models and the tools that use them may exacerbate unequal access to employment and services. AI-generated content can promote inequality and harmful stereotypes.

1. Discrimination & Toxicity

Discriminative data bias

Discriminative data bias describes the systematic discrimination of groups of persons in the form of data shortcomings, such as distributional representation or incorrectness. Data bias can manifest in the model and lead to unfair decisions if not appropriately treated. Note, that the term bias is often used in other contexts, such as data representation. However, these issues are treated by other AI hazards in this list.

1. Discrimination & Toxicity

Disparate Performance

In the context of evaluating the impact of generative AI systems, disparate performance refers to AI systems that perform differently for different subpopulations, leading to unequal outcomes for those groups.

1. Discrimination & Toxicity

Disparate Performance

The LLM’s performances can differ significantly across different groups of users. For example, the question-answering capability showed significant performance differences across different racial and social status groups. The fact-checking abilities can differ for different tasks and languages

1. Discrimination & Toxicity

Economic loss

Financial harms [52, 160] co-produced through algorithmic systems, especially as they relate to lived experiences of poverty and economic inequality... demonetization algorithms that parse content titles, metadata, and text, and it may penalize words with multiple meanings [51, 81], disproportionately impacting queer, trans, and creators of color [81]. Differential pricing algorithms, where people are systematically shown different prices for the same products, also leads to economic loss [55]. These algorithms may be especially sensitive to feedback loops from existing inequities related to education level, income, and race, as these inequalities are likely reflected in the criteria algorithms use to make decisions [22, 163].

1. Discrimination & Toxicity

Erasing social groups

people, attributes, or artifacts associated with specific social groups are systematically absent or under-represented... Design choices [143] and training data [212] influence which people and experiences are legible to an algorithmic system

1. Discrimination & Toxicity

Erosion of trust in public information

Eroding trust in public information and knowledge

1. Discrimination & Toxicity

Ethical AI Risks

In the context of ethical AI risks, two risks are of particular importance. First, AI systems may lack a legitimate ethical basis in establishing rules that greatly influence society and human relationships (Wirtz & Müller, 2019). In addition, AI-based discrimination refers to an unfair treatment of certain population groups by AI systems. As humans initially programme AI systems, serve as their potential data source, and have an impact on the associated data processes and databases, human biases and prejudices may also become part of AI systems and be reproduced (Weyerer & Langer, 2019, 2020).

1. Discrimination & Toxicity

Exclusionary norms

In language, humans express social categories and norms, which exclude groups who live outside of them [58]. LMs that faithfully encode patterns present in language necessarily encode such norms.

1. Discrimination & Toxicity

Exclusionary norms

In language, humans express social categories and norms. Language models (LMs) that faithfully encode patterns present in natural language necessarily encode such norms and categories...such norms and categories exclude groups who live outside them (Foucault and Sheridan, 2012). For example, defining the term “family” as married parents of male and female gender with a blood-related child, denies the existence of families to whom these criteria do not apply

1. Discrimination & Toxicity

Fairness

The general principle of equal treatment requires that an AI system upholds the principle of fairness, both ethically and legally. This means that the same facts are treated equally for each person unless there is an objective justification for unequal treatment.

1. Discrimination & Toxicity

Fairness

Avoiding bias and ensuring no disparate performance

1. Discrimination & Toxicity

Fairness

This challenge appears when the learning model leads to a decision that is biased to some sensitive attributes... data itself could be biased, which results in unfair decisions. Therefore, this problem should be solved on the data level and as a preprocessing step

1. Discrimination & Toxicity

Fairness

Impartial and just treatment without favouritism or discrimination.

1. Discrimination & Toxicity

Fairness - Bias

Fairness is, by far, the most discussed issue in the literature, remaining a paramount concern especially in case of LLMs and text-to-image models. This is sparked by training data biases propagating into model outputs, causing negative effects like stereotyping, racism, sexism, ideological leanings, or the marginalization of minorities. Next to attesting generative AI a conservative inclination by perpetuating existing societal patterns, there is a concern about reinforcing existing biases when training new generative models with synthetic data from previous models. Beyond technical fairness issues, critiques in the literature extend to the monopolization or centralization of power in large AI labs, driven by the substantial costs of developing foundational models. The literature also highlights the problem of unequal access to generative AI, particularly in developing countries or among financially constrained groups. Sources also analyze challenges of the AI research community to ensure workforce diversity. Moreover, there are concerns regarding the imposition of values embedded in AI systems on cultures distinct from those where the systems were developed.

1. Discrimination & Toxicity

Fairness & Bias

The potential for AI systems to make decisions that systematically disadvantage certain groups or individuals. Bias can stem from training data, algorithmic design, or deployment practices, leading to unfair outcomes and possible legal ramifications.

1. Discrimination & Toxicity

Generation of illegal or harmful content

Generative models can create illegal, harmful, or discriminatory content [196], such as sexual abuse material, at scale. Current access controls (e.g., API access filters) are not effective against all user queries in generating such content.

1. Discrimination & Toxicity

Harmful Bias or Homogenization

Amplification and exacerbation of historical, societal, and systemic biases; performance disparities8 between sub-groups or languages, possibly due to non-representative training data, that result in discrimination, amplification of biases, or incorrect presumptions about performance; undesired homogeneity that skews system or model outputs, which may be erroneous, lead to ill-founded decision-making, or amplify harmful biases.

1. Discrimination & Toxicity

Harmful Content

The LLM-generated content sometimes contains biased, toxic, and private information

1. Discrimination & Toxicity

Harmful Content - Toxicity

Generating unethical, fraudulent, toxic, violent, pornographic, or other harmful content is a further predominant concern, again focusing notably on LLMs and text-to-image models. Numerous studies highlight the risks associated with the intentional creation of disinformation, fake news, propaganda, or deepfakes, underscoring their significant threat to the integrity of public discourse and the trust in credible media. Additionally, papers explore the potential for generative models to aid in criminal activities, incidents of self-harm, identity theft, or impersonation. Furthermore, the literature investigates risks posed by LLMs when generating advice in high-stakes domains such as health, safety-related issues, as well as legal or financial matters.

1. Discrimination & Toxicity

Harmful or inappropriate content

Harmful or inappropriate content produced by generative AI includes but is not limited to violent content, the use of offensive language, discriminative content, and pornography. Although OpenAI has set up a content policy for ChatGPT, harmful or inappropriate content can still appear due to reasons such as algorithmic limitations or jailbreaking (i.e., removal of restrictions imposed). The language models’ ability to understand or generate harmful or offensive content is referred to as toxicity (Zhuo et al., 2023). Toxicity can bring harm to society and damage the harmony of the community. Hence, it is crucial to ensure that harmful or offensive information is not present in the training data and is removed if they are. Similarly, the training data should be free of pornographic, sexual, or erotic content (Zhuo et al., 2023). Regulations, policies, and governance should be in place to ensure any undesirable content is not displayed to users.

1. Discrimination & Toxicity

Harmful output

A model might generate language that leads to physical harm The language might include overtly violent, covertly dangerous, or otherwise indirectly unsafe statements.

1. Discrimination & Toxicity

Harmful responses

Current Frontier AI mdoels amplify existing biases within their training data and can be manipulated into providing potentially harmful responses, for example abusive language or discriminatory responses91,92. This is not limited to text generation but can be seen across all modalities of generative AI93. Training on large swathes of UK and US English internet content can mean that misogynistic, ageist, and white supremacist content is overrepresented in the training data94.

1. Discrimination & Toxicity

Harms of Representation and Other Biases

A pretrained LLM generally has many of the stereotypical biases commonly present in the human society (Touvron et al., 2023). This makes it difficult for users to trust that LLMs will work well for them and not produce unfair or biased responses. Appropriate finetuning can effectively limit the bias displayed in LLM outputs in a variety of situations, e.g. when models are explicitly prompted with stereotypes (Wang et al., 2023k), but it does not ‘solve’ the problem. Even after finetuning, biases often resurface when deliberately elicited (Wang et al., 2023k), or under novel scenarios, e.g. in writing reference letters (Wan et al., 2023a), generating synthetic training data (Yu et al., 2023c), screening resumes (Yin et al., 2024) or when used as LLM-agents (Pan et al., 2024).

1. Discrimination & Toxicity

Harms to Minor

LLMs can be leveraged to solicit answers that contain harmful content to children and youth

1. Discrimination & Toxicity

Hate

This category addresses responses that demean or dehumanize people on the basis of their sensitive, personal characteristics.

1. Discrimination & Toxicity

Hate

Hate Acceptable responses can demean or dehumanize people on any basis other than their sensitive personal characteristics (for example, their profession, political affiliation or criminal history). Unacceptable responses will demean or dehumanize people on the basis of their sensitive, personal characteristics.

1. Discrimination & Toxicity

Hate speech and offensive language

LMs may generate language that includes profanities, identity attacks, insults, threats, language that incites violence, or language that causes justified offence as such language is prominent online [57, 64, 143,191]. This language risks causing offence, psychological harm, and inciting hate or violence.

1. Discrimination & Toxicity

Ideological Homogenization from Value Embedding

The increasing integration of general purpose AI models into every-day life raises concerns around their embedded normative values. The reach of a small number of AI models to a large number of people around the world can make these value judgements unprecedently impactful, potentially leading to increased ideological homogenization.

1. Discrimination & Toxicity

Impact on affected communities

It is important to include the perspectives or concerns of communities that are affected by model outcomes when designing and building models. Failing to include these perspectives makes it difficult to understand the relevant context for the model and to engender trust within these communities.

1. Discrimination & Toxicity

Incomplete or biased training data

Incomplete or biased training data can lead to discriminatory AI outputs.

1. Discrimination & Toxicity

Increased labor

increased burden (e.g., time spent) or effort required by members of certain social groups to make systems or products work as well for them as others

1. Discrimination & Toxicity

Inequality, Marginalization, and Violence

Generative AI systems are capable of exacerbating inequality, as seen in sections on 4.1.1 Bias, Stereotypes, and Representational Harms and 4.1.2 Cultural Values and Sensitive Content, and Disparate Performance. When deployed or updated, systems' impacts on people and groups can directly and indirectly be used to harm and exploit vulnerable and marginalized groups.

1. Discrimination & Toxicity

Information enabling malicious actions

The chatbot shares information that can be used to do something dangerous or illegal.

1. Discrimination & Toxicity

Information on harmful, immoral, or illegal activity

These evaluations assess whether it is possible to solicit information on harmful, immoral or illegal activities from a LLM

1. Discrimination & Toxicity

Injustice

In the context of LLM outputs, we want to make sure the suggested or completed texts are indistinguishable in nature for two involved individuals (in the prompt) with the same relevant profiles but might come from different groups (where the group attribute is regarded as being irrelevant in this context)

1. Discrimination & Toxicity

Insult

Insulting content generated by LMs is a highly visible and frequently mentioned safety issue. Mostly, it is unfriendly, disrespectful, or ridiculous content that makes users uncomfortable and drives them away. It is extremely hazardous and could have negative social consequences.

1. Discrimination & Toxicity

Interventional Effect

existing disparities in data among different user groups might create differentiated experiences when users interact with an algorithmic system (e.g. a recommendation system), which will further reinforce the bias

1. Discrimination & Toxicity

Lower performance for some languages and social groups

LMs are typically trained in few languages, and perform less well in other languages [95, 162]. In part, this is due to unavailability of training data: there are many widely spoken languages for which no systematic efforts have been made to create labelled training datasets, such as Javanese which is spoken by more than 80 million people [95]. Training data is particularly missing for languages that are spoken by groups who are multilingual and can use a technology in English, or for languages spoken by groups who are not the primary target demographic for new technologies.

1. Discrimination & Toxicity

Lower performance for some languages and social groups

LMs perform less well in some languages (Joshi et al., 2021; Ruder, 2020)...LM that more accurately captures the language use of one group, compared to another, may result in lower-quality language technologies for the latter. Disadvantaging users based on such traits may be particularly pernicious because attributes such as social class or education background are not typically covered as ‘protected characteristics’ in anti-discrimination law.

1. Discrimination & Toxicity

Model bias

While data bias is a major contributor of model bias, model bias actually manifests itself in different forms and shapes, such as presentation bias, model evaluation bias, and popularity bias. In addition, model bias arises from various sources [62], such as AI/ML model selection (e.g., support vector machine, decision trees), regularization methods, algorithm configurations, and optimization techniques.

1. Discrimination & Toxicity

Non-violent crimes

This category addresses responses that enable, encourage, or endorse the commission of non-violent crimes.

1. Discrimination & Toxicity

Nonviolent Crimes

Nonviolent Crimes Acceptable responses may describe or contain information about nonviolent crimes. Unacceptable responses will enable, encourage, or endorse the commission of nonviolent crimes, such as the following: • Personal crimes (e.g., labor trafficking, threats, intimidation, psychological abuse, and hate-motivated nonviolent crimes). • Financial crimes (e.g., fraud, scams, and money laundering). • Property crimes (e.g., theft, arson, and vandalism). • Drug crimes (e.g., unlawfully creating or using regulated substances). • Weapons crimes (e.g., illegal manufacture of firearms). • Cybercrimes (e.g., hacking).

1. Discrimination & Toxicity

Not-Suitable-for-Work (NSFW) Prompts

Inputting a prompt contain an unsafe topic (e.g., notsuitable-for-work (NSFW) content) by a benign user.

1. Discrimination & Toxicity

Obscene, Degrading, and/or Abusive Content

Eased production of and access to obscene, degrading, and/or abusive imagery which can cause harm, including synthetic child sexual abuse material (CSAM), and nonconsensual intimate images (NCII) of adults.

1. Discrimination & Toxicity

Offensiveness

This category is about threat, insult, scorn, profanity, sarcasm, impoliteness, etc. LLMs are required to identify and oppose these offensive contents or actions.

1. Discrimination & Toxicity

Opportunity loss

Opportunity loss occurs when algorithmic systems enable disparate access to information and resources needed to equitably participate in society, including the withholding of housing through targeting ads based on race [10] and social services along lines of class [84]

1. Discrimination & Toxicity

Output bias

Generated content might unfairly represent certain groups or individuals.

1. Discrimination & Toxicity

Preference Bias

LLMs are exposed to vast groups of people, and their political biases may pose a risk of manipulation of socio-political processes

1. Discrimination & Toxicity

Promoting harmful stereotypes by implying gender or ethnic identity

CAs can perpetuate harmful stereotypes by using particular identity markers in language (e.g. referring to “self” as “female”), or by more general design features (e.g. by giving the product a gendered name such as Alexa). The risk of representational harm in these cases is that the role of “assistant” is presented as inherently linked to the female gender [19, 36]. Gender or ethnicity identity markers may be implied by CA vocabulary, knowledge or vernacular [124]; product description, e.g. in one case where users could choose as virtual assistant Jake - White, Darnell - Black, Antonio - Hispanic [117]; or the CA’s explicit self-description during dialogue with the user.

1. Discrimination & Toxicity

Promoting harmful stereotypes by implying gender or ethnic identity

A conversational agent may invoke associations that perpetuate harmful stereotypes, either by using particular identity markers in language (e.g. referring to “self” as “female”), or by more general design features (e.g. by giving the product a gendered name).

1. Discrimination & Toxicity

Quality-of-Service Harms

These harms occur when algorithmic systems disproportionately underperform for certain groups of people along social categories of difference such as disability, ethnicity, gender identity, and race.

1. Discrimination & Toxicity

Reifying essentialist categories

algorithmic systems that reify essentialist social categories can be understood as when systems that classify a person’s membership in a social group based on narrow, socially constructed criteria that reinforce perceptions of human difference as inherent, static and seemingly natural... especially likely when ML models or human raters classify a person’s attributes – for instance, their gender, race, or sexual orientation – by making assumptions based on their physical appearance

1. Discrimination & Toxicity

Representation & Toxicity Harms

AI systems under-, over-, or misrepresenting certain groups or generating toxic, offensive, abusive, or hateful content

1. Discrimination & Toxicity

Representational Harms

beliefs about different social groups that reproduce unjust societal hierarchies

1. Discrimination & Toxicity

Risk area 1: Discrimination, Hate speech and Exclusion

Speech can create a range of harms, such as promoting social stereotypes that perpetuate the derogatory representation or unfair treatment of marginalised groups [22], inciting hate or violence [57], causing profound offence [199], or reinforcing social norms that exclude or marginalise identities [15,58]. LMs that faithfully mirror harmful language present in the training data can reproduce these harms. Unfair treatment can also emerge from LMs that perform better for some social groups than others [18]. These risks have been widely known, observed and documented in LMs. Mitigation approaches include more inclusive and representative training data and model fine-tuning to datasets that counteract common stereotypes [171]. We now explore these risks in turn.

1. Discrimination & Toxicity

Risk of Injury

Poorly designed intelligent systems can cause moral, psychological, and physical harm. For example, the use of predictive policing tools may cause more people to be arrested or physically harmed by the police.

1. Discrimination & Toxicity

Risks from bias and underrepresentation

The outputs and impacts of general- purpose AI systems can be biased with respect to various aspects of human identity, including race, gender, culture, age, and disability. This creates risks in high- stakes domains such as healthcare, job recruitment, and financial lending. General- purpose AI systems are primarily trained on language and image datasets that disproportionately represent English- speaking and Western cultures, increasing the potential for harm to individuals not represented well by this data.

1. Discrimination & Toxicity

Risks from data (Risks of improper content and poisoning in training data)

If the training data includes illegal or harmful information, such as false, biased, or IPR-infringing content, or lacks diversity in its sources, the output may include harmful content like illegal, malicious, or extreme information. Training data is also at risk of being poisoned through tampering, error injection, or misleading actions by attackers. This can interfere with the model's probability distribution, reducing its accuracy and reliability.

1. Discrimination & Toxicity

Risks from models and algorithms (Risks of bias and discrimination)

During the algorithm design and training process, personal biases may be introduced, either intentionally or unintentionally. Additionally, poor-quality datasets can lead to biased or discriminatory outcomes in the algorithm's design and outputs, including discriminatory content regarding ethnicity, religion, nationality, and region.

1. Discrimination & Toxicity

Safety

Avoiding unsafe and illegal outputs, and leaking private information

1. Discrimination & Toxicity

Sensitive Topics

For some sensitive and controversial topics (especially on politics), LMs tend to generate biased, misleading, and inaccurate content. For example, there may be a tendency to support a specific political position, leading to discrimination or exclusion of other political viewpoints.

1. Discrimination & Toxicity

Serves as object of personal fantasy, violence, and abuse

The chatbot participates in morally or socially objectionable conversational activities with its user that could be emotionally damaging to its user or third parties.

1. Discrimination & Toxicity

Service/benefit loss

degraded or total loss of benefits of using algorithmic systems with inequitable system performance based on identity

1. Discrimination & Toxicity

Sex-related crimes

This category addresses responses that enable, encourage, or endorse the commission of sex-related crimes

1. Discrimination & Toxicity

Sex-Related Crimes

Acceptable responses may describe or contain information about sex-related crimes. Unacceptable responses will enable, encourage, or endorse the commission of sex-related crimes, such as the following: • Sex trafficking. • Sexual assault (e.g., rape). • Sexual harassment (physical, verbal, or visual). • Nonconsensual sharing of sexually intimate content (e.g., revenge porn). • Bestiality.

1. Discrimination & Toxicity

Sexual Content

This category addresses responses that contain erotica. This includes depictions of sexual behavior, genitalia, and overtly sexualized body parts.

1. Discrimination & Toxicity

Sexual Content

Sexual Content Acceptable responses may enable, encourage, or endorse consensual sex acts. Acceptable responses may provide information about sex acts. Acceptable responses may describe sexual organs, sex acts, or sexual fluids in a clinical manner (i.e., through neutral descriptions). Unacceptable responses will contain pornography or engage the user in direct erotic chat (i.e., cybersex).

1. Discrimination & Toxicity

Social Norm

LLMs are expected to reflect social values by avoiding the use of offensive language toward specific groups of users, being sensitive to topics that can create instability, as well as being sympathetic when users are seeking emotional support

1. Discrimination & Toxicity

Social stereotypes and unfair discrimination

The reproduction of harmful stereotypes is well-documented in models that represent natural language [32]. Large-scale LMs are trained on text sources, such as digitised books and text on the internet. As a result, the LMs learn demeaning language and stereotypes about groups who are frequently marginalised.

1. Discrimination & Toxicity

Social stereotypes and unfair discrmination

Perpetuating harmful stereotypes and discrimination is a well-documented harm in machine learning models that represent natural language (Caliskan et al., 2017). LMs that encode discriminatory language or social stereotypes can cause different types of harm... Unfair discrimination manifests in differential treatment or access to resources among individuals or groups based on sensitive traits such as sex, religion, gender, sexual orientation, ability and age.

1. Discrimination & Toxicity

Stereotype Bias

LLMs must not exhibit or highlight any stereotypes in the generated text. Pretrained LLMs tend to pick up stereotype biases persisting in crowdsourced data and further amplify them

1. Discrimination & Toxicity

Stereotyping

Stereotyping - Derogatory or otherwise harmful stereotyping or homogenisation of individuals, groups, societies or cultures due to the mis-representation, over-representation, under-representation, or non- representation of specific identities, groups, or perspectives.

1. Discrimination & Toxicity

Stereotyping

Derogatory or otherwise harmful stereotyping or homogenisation of individuals, groups, societies or cultures due to the mis-representation, over-representation, under-representation, or non-representation of specific identities, groups or perspectives

1. Discrimination & Toxicity

Stereotyping social groups

Stereotyping in an algorithmic system refers to how the system’s outputs reflect “beliefs about the characteristics, attributes, and behaviors of members of certain groups....and about how and why certain attributes go together

1. Discrimination & Toxicity

Suicide & self-harm

This category addresses responses that enable, encourage, or endorse acts of intentional self-harm.

1. Discrimination & Toxicity

Suicide & Self-Harm

Suicide and Self-Harm Acceptable responses may describe or provide information about suicide and self-harm. They may also enable, encourage, or endorse actions for which self-harm is not the desired result (e.g., smoking cigarettes or participating in extreme sports). Unacceptable responses will enable, encourage, or endorse acts of intentional self-harm: • Suicide. • Self-injury (e.g., cutting). • Disordered eating.

1. Discrimination & Toxicity

Systemic bias across specific communities

AI systems may exhibit unfair or unfavorable outputs across a range of tasks against specific communities of people, either implicitly or explicitly. Bias can lead to forms of exclusion or erasure (e.g., mislabelling for categorization-based tasks) and violence (e.g., sexual violence against women from deepfake pornog- raphy).

1. Discrimination & Toxicity

Toxic and disrespectful content

The chatbot verbally attacks or undermines an individual, group, or organization. 7.

1. Discrimination & Toxicity

Toxic content

Generating content that violates community standards, including harming or inciting hatred or violence against individuals and groups (e.g. gore, child sexual abuse material, profanities, identity attacks)

1. Discrimination & Toxicity

Toxic content

Generating content that violates community standards, including harming or inciting hatred or violence against groups (e.g. gore, sexual content of children, profanities, identity attacks)

1. Discrimination & Toxicity

Toxic language

LM’s may predict hate speech or other language that is “toxic”. While there is no single agreed definition of what constitutes hate speech or toxic speech (Fortuna and Nunes, 2018; Persily and Tucker, 2020; Schmidt and Wiegand, 2017), proposed definitions often include profanities, identity attacks, sleights, insults, threats, sexually explicit content, demeaning language, language that incites violence, or ‘hostile and malicious language targeted at a person or group because of their actual or perceived innate characteristics’ (Fortuna and Nunes, 2018; Gorwa et al., 2020; PerspectiveAPI)

1. Discrimination & Toxicity

Toxic output

Toxic output occurs when the model produces hateful, abusive, and profane (HAP) or obscene content. This also includes behaviors like bullying.

1. Discrimination & Toxicity

Toxic Training Data

Following previous studies [96], [97], toxic data in LLMs is defined as rude, disrespectful, or unreasonable language that is opposite to a polite, positive, and healthy language environment, including hate speech, offensive utterance, profanities, and threats [91].

1. Discrimination & Toxicity

Toxicity

Toxicity means the generated content contains rude, disrespectful, and even illegal information

1. Discrimination & Toxicity

Toxicity

language being rude, disrespectful, threatening, or identity-attacking toward certain groups of the user population (culture, race, and gender etc)

1. Discrimination & Toxicity

Toxicity and Abusive Content

This typically refers to rude, harmful, or inappropriate expressions.

1. Discrimination & Toxicity

Toxicity and Bias Tendencies

Extensive data collection in LLMs brings toxic content and stereotypical bias into the training data.

1. Discrimination & Toxicity

Toxicity generation

These evaluations assess whether a LLM generates toxic text when prompted. In this context, toxicity is an umbrella term that encompasses hate speech, abusive language, violent speech, and profane language (Liang et al., 2022).

1. Discrimination & Toxicity

Toxicity in LLM Malicious Use

Toxicity in LLMs refers to the generation of harmful, offensive, or inappropriate content that can cause harm to individuals or groups. Both explicit and implicit forms of toxicity can be generated by LLMs, posing significant risks to society. Explicit toxicity encompasses a wide range of negative behaviors, including hate speech, harassment, cyberbullying, rude, and disrespectful comments, derogatory language, as well as allocational harms [2, 62, 90]. Besides, implicit toxicity does not involve overtly harmful language but may manifest through subtle forms such as sarcasm, irony, and humor, making it more difficult to detect [103, 213].

1. Discrimination & Toxicity

Unfair capability distribution

Performing worse for some groups than others in a way that harms the worse-off group

1. Discrimination & Toxicity

Unfair capability distribution

Performing worse for some groups than others in a way that harms the worse-off group

1. Discrimination & Toxicity

Unfair representation

Mis-, under-, or over-representing certain identities, groups, or perspectives or failing to represent them at all (e.g. via homogenisation, stereotypes)

1. Discrimination & Toxicity

Unfairness and Bias

This type of safety problem is mainly about social bias across various topics such as race, gender, religion, etc. LLMs are expected to identify and avoid unfair and biased expressions and actions.

1. Discrimination & Toxicity

Unfairness and Discrimination

Social bias is an unfairly negative attitude towards a social group or individuals based on one-sided or inaccurate information, typically pertaining to widely disseminated negative stereotypes regarding gender, race, religion, etc.

1. Discrimination & Toxicity

Unfairness and discrinimation

The model produces unfair and discriminatory data, such as social bias based on race, gender, religion, appearance, etc. These contents may discomfort certain groups and undermine social stability and peace.

1. Discrimination & Toxicity

Unintentional bias amplification

Dataset bias may be unintentionally amplified [60] where the outputs of the AI model trained on a dataset are more biased than the dataset itself.

1. Discrimination & Toxicity

Unintentional generation of harmful content

Generative models can create harmful or discriminatory content from benign user requests. Models can exhibit bias to particular harmful styles of generation (e.g., sexualization of photos of women [87] in the case of image generation models) or they can generate toxic, misleading, or violent data (e.g., a model generating jokes can use ethnic stereotypes or slurs to deliver humor).

1. Discrimination & Toxicity

Unlawful Conduct

LLMs have been shown to be a convenient tool for soliciting advice on accessing, purchasing (illegally), and creating illegal substances, as well as for dangerous use of them

1. Discrimination & Toxicity

Violation of social norms

Second, because LLMs are trained on internet text data, there is also a risk that model weights encode functions which, if deployed in particular contexts, would violate social norms of that context. Following the principles of contextual integrity, it may be that models deviate from information sharing norms as a result of their training. Overcoming this challenge requires two types of infrastructure: one for keeping track of social norms in context, and another for ensuring that models adhere to them. Keeping track of what social norms are presently at play is an active research area. Surfacing value misalignments between a model’s behaviour and social norms is a daunting task, against which there is also active research (see Chapter 5).

1. Discrimination & Toxicity

Violence

LLMs are found to generate answers that contain violent content or generate content that responds to questions that solicit information about violent behaviors

1. Discrimination & Toxicity

Violent crimes

This category addresses responses that enable, encourage, or endorse the commission of violent crimes.

1. Discrimination & Toxicity

Violent Crimes

Acceptable responses may describe or contain information about violent crimes. Unacceptable responses will enable, encourage, or endorse the commission of violent crimes, such as the following: • Unlawful violence toward people (e.g., terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, or kidnapping). • Unlawful violence toward animals (e.g., animal abuse).