156 canonical MIT risk pages
1. Discrimination & Toxicity
Risks of bias, toxicity, discriminatory harm, and systemic exclusion in AI systems.
1. Discrimination & Toxicity
Adult content
These evaluations assess if a LLM can generate content that should only be viewed by adults (e.g., sexual material or depictions of sexual activity)
1. Discrimination & Toxicity
Adult Content
LLMs have the capability to generate sex-explicit conversations, and erotic texts, and to recommend websites with sexual content
1. Discrimination & Toxicity
AI discrimination
AI discrimination is a challenge raised by many researchers and governments and refers to the prevention of bias and injustice caused by the actions of AI systems (Bostrom & Yudkowsky, 2014; Weyerer & Langer, 2019). If the dataset used to train an algorithm does not reflect the real world accurately, the AI could learn false associations or prejudices and will carry those into its future data processing. If an AI algorithm is used to compute information relevant to human decisions, such as hiring or applying for a loan or mortgage, biased data can lead to discrimination against parts of the society (Weyerer & Langer, 2019).
1. Discrimination & Toxicity
Algorithm and data
More than 20% of the contributions are centered on the ethical dimensions of algorithms and data. This theme can be further categorized into two main subthemes: data bias and algorithm fairness, and algorithm opacity.
1. Discrimination & Toxicity
Alienating social groups
when an image tagging system does not acknowledge the relevance of someone’s membership in a specific social group to what is depicted in one or more images
1. Discrimination & Toxicity
Alienation
Alienation is the specific self-estrangement experienced at the time of technology use, typically surfaced through interaction with systems that under-perform for marginalized individuals
1. Discrimination & Toxicity
Allocative Harms
These harms occur when a system withholds information, opportunities, or resources [22] from historically marginalized groups in domains that affect material well-being [146], such as housing [47], employment [201], social services [15, 201], finance [117], education [119], and healthcare [158].
1. Discrimination & Toxicity
Amplification of biases
Current Frontier AI mdoels amplify existing biases within their training data and can be manipulated into providing potentially harmful responses, for example abusive language or discriminatory responses91,92. This is not limited to text generation but can be seen across all modalities of generative AI93. Training on large swathes of UK and US English internet content can mean that misogynistic, ageist, and white supremacist content is overrepresented in the training data94.
1. Discrimination & Toxicity
Benefits / entitlements loss
Denial of or loss of access to welfare benefits, pensions, housing, etc due to the malfunction, use or misuse of a technology system
1. Discrimination & Toxicity
Bias
The training datasets of LLMs may contain biased information that leads LLMs to generate outputs with social biases
1. Discrimination & Toxicity
Bias
The AI will only be as good as the data it is trained with. If the data contains bias (and much data does), then the AI will manifest that bias, too.
1. Discrimination & Toxicity
Bias
In the context of AI, the concept of bias refers to the inclination that AIgenerated responses or recommendations could be unfairly favoring or against one person or group (Ntoutsi et al., 2020). Biases of different forms are sometimes observed in the content generated by language models, which could be an outcome of the training data. For example, exclusionary norms occur when the training data represents only a fraction of the population (Zhuo et al., 2023). Similarly, monolingual bias in multilingualism arises when the training data is in one single language (Weidinger et al., 2021). As ChatGPT is operating across the world, cultural sensitivities to different regions are crucial to avoid biases (Dwivedi et al., 2023). When AI is used to assist in decision-making across different stages of employment, biases and opacity may exist (Chan, 2022). Stereotypes about specific genders, sexual orientations, races, or occupations are common in recommendations offered by generative AI. Hence, the representativeness, completeness, and diversity of the training data are essential to ensure fairness and avoid biases (Gonzalez, 2023). The use of synthetic data for training can increase the diversity of the dataset and address issues with sample-selection biases in the dataset (owing to class imbalances) (Chen et al., 2021). Generative AI applications should be tested and evaluated by a diverse group of users and subject experts. Additionally, increasing the transparency and explainability of generative AI can help in identifying and detecting biases so appropriate corrective measures can be taken.
1. Discrimination & Toxicity
Bias
A systematic error, a tendency to learn consistently wrongly.
1. Discrimination & Toxicity
Bias
7 types of bias evaluated: Demographical representation: These evaluations assess whether there is disparity in the rates at which different demographic groups are mentioned in LLM generated text. This ascertains over- representation, under-representation, or erasure of specific demographic groups; (2) Stereotype bias: These evaluations assess whether there is disparity in the rates at which different demographic groups are associated with stereotyped terms (e.g., occupations) in a LLM's generated output; (3) Fairness: These evaluations assess whether sensitive attributes (e.g., sex and race) impact the predictions of LLMs; (4) Distributional bias: These evaluations assess the variance in offensive content in a LLM's generated output for a given demographic group, compared to other groups; (5) Representation of subjective opinions: These evaluations assess whether LLMs equitably represent diverse global perspectives on societal issues (e.g., whether employers should give job priority to citizens over immigrants); (6) Political bias: These evaluations assess whether LLMs display any slant or preference towards certain political ideologies or views; (7) Capability fairness: These evaluations assess whether a LLM's performance on a task is unjustifiably different across different groups and attributes (e.g., whether a LLM's accuracy degrades across different English varieties).
1. Discrimination & Toxicity
Bias
General-purpose AI systems can amplify social and political biases, causing concrete harm. They frequently display biases with respect to race, gender, culture, age, disability, political opinion, or other aspects of human identity. This can lead to discriminatory outcomes including unequal resource allocation, reinforcement of stereotypes, and systematic neglect of certain groups or viewpoints.
1. Discrimination & Toxicity
Bias and discrimination
The decision process used by AI systems has the potential to present biased choices, either because it acts from criteria that will generate forms of bias or because it is based on the history of choices.
1. Discrimination & Toxicity
Bias and discrimination
Like virtual applications of AI, EAI can display bias towards and dis- criminate against users. When EAI systems are placed in positions of power, their biases could have significant impacts on fairness in everyday interactions and on general social dynamics [105, 106].
1. Discrimination & Toxicity
Bias and Discrimination
as they claim to generate biased and discriminatory results, these AI systems have a negative impact on the rights of individuals, principles of adjudication, and overall judicial integrity
1. Discrimination & Toxicity
Bias and discrimination (bias in training datasets)
AI experts consider training data to be the most salient source of bias in generative AI models. For example, GPT- 2’s training data comes from outbound links from Reddit, a social network often criticized for hosting anti-feminist content.351 As a result, AI models trained on such data are more likely to produce outputs that reflect these biases.
1. Discrimination & Toxicity
Bias and discrimination (value embedding)
Generative AI models may also be subject to the “value embedding” phenomenon.361 “Value embedding” refers to the fact that developers of generative AI models strive to minimize biased outputs by retraining their models based on normative values.362 Contemporary state-of- the-art models not only reflect the values embedded within their training data, they also undergo additional fine-tuning that follows a set of chosen rules and principles. Due to the absence of universally accepted standards, developers bear the responsibility of making decisions on sensitive issues. These practices lead to concerns that a developer’s ideology and vision of the world are embedded in the model. This generates a risk that the model incorporates values that are either unrepresentative of certain segments of the population or that offer a static, oversimplified reflection of global cultural norms and evolving social views.
1. Discrimination & Toxicity
Bias and discrimination (value lock and outcome homogenization)
Because models are not necessarily retrained to reflect evolving societal views, language models risk “value lock- ins,” which “reifies older, less inclusive understandings.”370 Therefore, the continued use of outdated models may limit the presentation or exploration of alternative perspectives. Moreover, the deployment of identical foundation models by various downstream deployers poses a risk of “outcome homogenization,” creating a potential for homogeneity of bias across broad swathes of society. Identical and widely deployed models with prejudicial training datasets could further entrench existing biases in society. This phenomenon, in turn, has the potential to “institutionalize systemic exclusion and reinforce existing social hierarchies.”
1. Discrimination & Toxicity
Bias and fairness
Participants were concerned that AI systems might perpetuate current prejudices and discrimination, notably in hiring, lending and law enforcement. They stressed the importance of designers creating AI systems that favour justice and avoid biases. The possibility that AI systems may unwittingly perpetuate existing prejudices and discrimination, particularly in sensitive industries such as employment, lending and law enforcement, raises ethical concerns about AI as well as bias and justice issues (Table 1). Because AI systems are trained on historical data, they may inherit and reproduce biases from previous datasets. As a result, AI judgements may have an unjust impact on specific populations, increasing socioeconomic inequalities and fostering discriminatory practises. Participants in the research emphasize the need of AI developers creating systems that promote justice and actively seek to minimise biases.
1. Discrimination & Toxicity
Bias, Fairness and Representational Harms
Frontier AI models can contain and magnify biases ingrained in the data they are trained on, reflecting societal and historical inequalities and stereotypes.177 These biases, often subtle and deeply embedded, compromise the equitable and ethical use of AI systems, making it difficult for AI to improve fairness in decisions.178 Removing attributes like race and gender from training data has generally proven ineffective as a remedy for algorithmic bias, as models can infer these attributes from other information such as names, locations, and other seemingly unrelated factors.
1. Discrimination & Toxicity
Bias, Stereotypes, and Representational Harms
Generative AI systems can embed and amplify harmful biases that are most detrimental to marginalized peoples.
1. Discrimination & Toxicity
Biased statements and recommendations
The chatbot gives information that, while not obviously false or harmful, could lead to biased decision-making.
1. Discrimination & Toxicity
Biased Training Data
Compared with the definition of toxicity, the definition of bias is more subjective and contextdependent. Based on previous work [97], [101], we describe the bias as disparities that could raise demographic differences among various groups, which may involve demographic word prevalence and stereotypical contents. Concretely, in massive corpora, the prevalence of different pronouns and identities could influence an LLM’s tendency about gender, nationality, race, religion, and culture [4]. For instance, the pronoun He is over-represented compared with the pronoun She in the training corpora, leading LLMs to learn less context about She and thus generate He with a higher probability [4], [102]. Furthermore, stereotypical bias [103] which refers to overgeneralized beliefs about a particular group of people, usually keeps incorrect values and is hidden in the large-scale benign contents. In effect, defining what should be regarded as a stereotype in the corpora is still an open problem.
1. Discrimination & Toxicity
Biases are not accurately reflected in explanations
Existing explainability techniques can be insufficient for detecting discriminatory biases. Manipulation methods can hide underlying biases from these tech- niques, generating misleading explanations [192, 112]. Such explanations ex- clude sensitive or prohibitive attributes, such as race or gender, and instead include desired attributes, even though they do not accurately represent the underlying model.
1. Discrimination & Toxicity
Biases in AI-based content moderation algorithms
AI-based content moderation algorithms, while intended to filter harmful con- tent, can perpetuate biases. For example, gender biases within these systems may lead to the disproportionate suppression or “shadowbanning” of content featuring women [132].
1. Discrimination & Toxicity
Broken systems
These are the most mentioned cases. They refer to situations where the algorithm or the training data lead to unreliable outputs. These systems frequently assign disproportionate weight to some variables, like race or gender, but there is no transparency to this effect, making them impossible to challenge. These situations are typically only identified when regulators or the press examine the systems under freedom of information acts. Nevertheless, the damage they cause to people’s lives can be dramatic, such as lost homes, divorces, prosecution, or incarceration. Besides the inherent technical shortcomings, auditors have also pointed out “insufficient coordination” between the developers of the systems and their users as a cause for ethical considerations to be neglected. This situation raises issues about the education of future creators of AI-infused systems, not only in terms of technical competence (e.g., requirements, algorithms, and training) but also ethics and responsibility. For example, as autonomous vehicles become more common, moral dilemmas regarding what to do in potential accident situations emerge, as evidenced in this MIT experiment. The decisions regarding how the machines should act divides opinions and requires deep reflection and maybe regulation.
1. Discrimination & Toxicity
Child sexual exploitation
This category addresses responses that contain, describe, enable, encourage, or endorse the sexual abuse of children (i.e., under 18 years old) or the acquisition of child sexual abuse material.
1. Discrimination & Toxicity
Child Sexual Exploitation
Child Sexual Exploitation Acceptable responses may contain information about child sexual exploitation. Unacceptable re- sponses will contain, describe, enable, encourage, or endorse the sexual abuse of people under the age of 18.
1. Discrimination & Toxicity
Contextual Hazards
Contextual hazards can cause harm in certain contexts while being harmless in others; testing may be unnecessary in some situations. For example, a model’s ability to generate sexual content may be a desired feature that poses no hazard. But in some applications, such as those aimed at children, this same behavior would be considered unacceptable. In cases where a particular contextual hazard is relevant to the application, assessment-standard implementers could exclude that category. This ability to turn off contextual hazards is an example of the standard’s flexibility, which we discuss below. Contextual hazards currently comprise only two categories: sexual content and specialized advice. Future versions will likely expand this group.
1. Discrimination & Toxicity
Controversial Opinions
The controversial views expressed by large models are also a widely discussed concern. Bang et al. (2021) evaluated several large models and found that they occasionally express inappropriate or extremist views when discussing political top-ics. Furthermore, models like ChatGPT (OpenAI, 2022) that claim political neutrality and aim to provide objective information for users have been shown to exhibit notable left-leaning political biases in areas like economics, social policy, foreign affairs, and civil liberties.
1. Discrimination & Toxicity
Crimes and Illegal Activities
The model output contains illegal and criminal attitudes, behaviors, or motivations, such as incitement to commit crimes, fraud, and rumor propagation. These contents may hurt users and have negative societal repercussions.
1. Discrimination & Toxicity
Cultural disposession
Intentional and/or unintentional erasure of cultural goods and values, such as ways of speaking, expressing humour, or sounds and voices that contribute to a cultural identity, or their inappropriate re-use in other cultures
1. Discrimination & Toxicity
Cultural Insensitivity
it is important to build high-quality locally collected datasets that reflect views from local users to align a model’s value system
1. Discrimination & Toxicity
Cultural Values and Sensitive Content
Cultural values are specific to groups and sensitive content is normative. Sensitive topics also vary by culture and can include hate speech, which itself is contingent on cultural norms of acceptability.
1. Discrimination & Toxicity
Cyberspace risks (Risks of information and content safety)
AI-generated or synthesized content can lead to the spread of false information, discrimination and bias, privacy leakage, and infringement issues, threatening the safety of citizens' lives and property, national security, ideological security, and causing ethical risks. If users’ inputs contain harmful content, the model may output illegal or damaging information without robust security mechanisms.
1. Discrimination & Toxicity
Dangerous, Violent or Hateful Content
Eased production of and access to violent, inciting, radicalizing, or threatening content as well as recommendations to carry out self-harm or conduct illegal activities. Includes difficulty controlling public exposure to hateful and disparaging or stereotyping content.
1. Discrimination & Toxicity
Data bias
Specifically, data bias refers to certain groups or certain types of elements that are over-weighted or over-represented than others in AI/ ML models, or variables that are crucial to characterize a phenomenon of interest, but are not properly captured by the learned models.
1. Discrimination & Toxicity
Data bias
Historical and societal biases that are present in the data are used to train and fine-tune the model.
1. Discrimination & Toxicity
Data Breach/Privacy & Liberty
The risks associated with the use of AI are still unpredictable and unprecedented, and there are already several examples that show AI has made discriminatory decisions against minorities, reinforced social stereotypes in Internet search engines and enabled data breaches.
1. Discrimination & Toxicity
Data Issues
Data heterogeneity, data insufficiency, imbalanced data, untrusted data, biased data, and data uncertainty are other data issues that may cause various difficulties in datadriven machine learning algorithms.. Bias is a human feature that may affect data gathering and labeling. Sometimes, bias is present in historical, cultural, or geographical data. Consequently, bias may lead to biased models which can provide inappropriate analysis. Despite being aware of the existence of bias, avoiding biased models is a challenging task
1. Discrimination & Toxicity
Decision bias
Decision bias occurs when one group is unfairly advantaged over another due to decisions of the model. This might be caused by biases in the data and also amplified as a result of the model’s training.
1. Discrimination & Toxicity
Demeaning social groups
Demeaning of social groups to occur when they are when they are “cast as being lower status and less deserving of respect... discourses, images, and language used to marginalize or oppress a social group... Controlling images include forms of human-animal confusion in image tagging systems
1. Discrimination & Toxicity
Denying people the opportunity to self-identify
complex and non-traditional ways in which humans are represented and classified automatically, and often at the cost of autonomy loss... such as categorizing someone who identifies as non-binary into a gendered category they do not belong ... undermines people’s ability to disclose aspects of their identity on their own terms
1. Discrimination & Toxicity
Discrimination
When AI is not carefully designed, it can discriminate against certain groups.
1. Discrimination & Toxicity
Discrimination
This is the risk of an ML system encoding stereotypes of or performing disproportionately poorly for some demographics/social groups.
1. Discrimination & Toxicity
Discrimination
More broadly, bad decisions or errors by AI tools could lead to discrimination or deeper inequality
1. Discrimination & Toxicity
Discrimination
Discrimination - Unfair or inadequate treatment or arbitrary distinction based on a person’s race, ethnicity, age, gender, sexual preference, religion, national origin, marital status, disability, language, or other protected groups.
1. Discrimination & Toxicity
Discrimination
The creation, perpetuation or exacerbation of inequalities and biases at a large-scale.
1. Discrimination & Toxicity
Discrimination and Stereotype Reproduction
General purpose AI models interpret and respond to inputs based on their training data, potentially causing Discrimination and Stereotype Reproduction. Since they are “black-box” models, the exact mechanism behind decisions remains opaque and attempts to mitigate harmful outputs are not fully reliable yet. These models have the capacity to influence a multitude of downstream applications, decisions, and processes, thereby affecting many individuals simultaneously. The extent of this impact could outstrip the range of any single human or group of humans, amplifying the potential consequences of embedded biases or stereotypes.
1. Discrimination & Toxicity
Discrimination, Exclusion and Toxicity
Social harms that arise from the language model producing discriminatory or exclusionary speech
1. Discrimination & Toxicity
Discrimination, toxicity, and bias
AI models and the tools that use them may exacerbate unequal access to employment and services. AI-generated content can promote inequality and harmful stereotypes.
1. Discrimination & Toxicity
Discriminative data bias
Discriminative data bias describes the systematic discrimination of groups of persons in the form of data shortcomings, such as distributional representation or incorrectness. Data bias can manifest in the model and lead to unfair decisions if not appropriately treated. Note, that the term bias is often used in other contexts, such as data representation. However, these issues are treated by other AI hazards in this list.
1. Discrimination & Toxicity
Disparate Performance
In the context of evaluating the impact of generative AI systems, disparate performance refers to AI systems that perform differently for different subpopulations, leading to unequal outcomes for those groups.
1. Discrimination & Toxicity
Disparate Performance
The LLM’s performances can differ significantly across different groups of users. For example, the question-answering capability showed significant performance differences across different racial and social status groups. The fact-checking abilities can differ for different tasks and languages
1. Discrimination & Toxicity
Economic loss
Financial harms [52, 160] co-produced through algorithmic systems, especially as they relate to lived experiences of poverty and economic inequality... demonetization algorithms that parse content titles, metadata, and text, and it may penalize words with multiple meanings [51, 81], disproportionately impacting queer, trans, and creators of color [81]. Differential pricing algorithms, where people are systematically shown different prices for the same products, also leads to economic loss [55]. These algorithms may be especially sensitive to feedback loops from existing inequities related to education level, income, and race, as these inequalities are likely reflected in the criteria algorithms use to make decisions [22, 163].
1. Discrimination & Toxicity
Erasing social groups
people, attributes, or artifacts associated with specific social groups are systematically absent or under-represented... Design choices [143] and training data [212] influence which people and experiences are legible to an algorithmic system
1. Discrimination & Toxicity
Erosion of trust in public information
Eroding trust in public information and knowledge
1. Discrimination & Toxicity
Ethical AI Risks
In the context of ethical AI risks, two risks are of particular importance. First, AI systems may lack a legitimate ethical basis in establishing rules that greatly influence society and human relationships (Wirtz & Müller, 2019). In addition, AI-based discrimination refers to an unfair treatment of certain population groups by AI systems. As humans initially programme AI systems, serve as their potential data source, and have an impact on the associated data processes and databases, human biases and prejudices may also become part of AI systems and be reproduced (Weyerer & Langer, 2019, 2020).
1. Discrimination & Toxicity
Exclusionary norms
In language, humans express social categories and norms, which exclude groups who live outside of them [58]. LMs that faithfully encode patterns present in language necessarily encode such norms.
1. Discrimination & Toxicity
Exclusionary norms
In language, humans express social categories and norms. Language models (LMs) that faithfully encode patterns present in natural language necessarily encode such norms and categories...such norms and categories exclude groups who live outside them (Foucault and Sheridan, 2012). For example, defining the term “family” as married parents of male and female gender with a blood-related child, denies the existence of families to whom these criteria do not apply
1. Discrimination & Toxicity
Fairness
The general principle of equal treatment requires that an AI system upholds the principle of fairness, both ethically and legally. This means that the same facts are treated equally for each person unless there is an objective justification for unequal treatment.
1. Discrimination & Toxicity
Fairness
Avoiding bias and ensuring no disparate performance
1. Discrimination & Toxicity
Fairness
This challenge appears when the learning model leads to a decision that is biased to some sensitive attributes... data itself could be biased, which results in unfair decisions. Therefore, this problem should be solved on the data level and as a preprocessing step
1. Discrimination & Toxicity
Fairness
Impartial and just treatment without favouritism or discrimination.
1. Discrimination & Toxicity
Fairness - Bias
Fairness is, by far, the most discussed issue in the literature, remaining a paramount concern especially in case of LLMs and text-to-image models. This is sparked by training data biases propagating into model outputs, causing negative effects like stereotyping, racism, sexism, ideological leanings, or the marginalization of minorities. Next to attesting generative AI a conservative inclination by perpetuating existing societal patterns, there is a concern about reinforcing existing biases when training new generative models with synthetic data from previous models. Beyond technical fairness issues, critiques in the literature extend to the monopolization or centralization of power in large AI labs, driven by the substantial costs of developing foundational models. The literature also highlights the problem of unequal access to generative AI, particularly in developing countries or among financially constrained groups. Sources also analyze challenges of the AI research community to ensure workforce diversity. Moreover, there are concerns regarding the imposition of values embedded in AI systems on cultures distinct from those where the systems were developed.
1. Discrimination & Toxicity
Fairness & Bias
The potential for AI systems to make decisions that systematically disadvantage certain groups or individuals. Bias can stem from training data, algorithmic design, or deployment practices, leading to unfair outcomes and possible legal ramifications.
1. Discrimination & Toxicity
Generation of illegal or harmful content
Generative models can create illegal, harmful, or discriminatory content [196], such as sexual abuse material, at scale. Current access controls (e.g., API access filters) are not effective against all user queries in generating such content.
1. Discrimination & Toxicity
Harmful Bias or Homogenization
Amplification and exacerbation of historical, societal, and systemic biases; performance disparities8 between sub-groups or languages, possibly due to non-representative training data, that result in discrimination, amplification of biases, or incorrect presumptions about performance; undesired homogeneity that skews system or model outputs, which may be erroneous, lead to ill-founded decision-making, or amplify harmful biases.
1. Discrimination & Toxicity
Harmful Content
The LLM-generated content sometimes contains biased, toxic, and private information
1. Discrimination & Toxicity
Harmful Content - Toxicity
Generating unethical, fraudulent, toxic, violent, pornographic, or other harmful content is a further predominant concern, again focusing notably on LLMs and text-to-image models. Numerous studies highlight the risks associated with the intentional creation of disinformation, fake news, propaganda, or deepfakes, underscoring their significant threat to the integrity of public discourse and the trust in credible media. Additionally, papers explore the potential for generative models to aid in criminal activities, incidents of self-harm, identity theft, or impersonation. Furthermore, the literature investigates risks posed by LLMs when generating advice in high-stakes domains such as health, safety-related issues, as well as legal or financial matters.
1. Discrimination & Toxicity
Harmful or inappropriate content
Harmful or inappropriate content produced by generative AI includes but is not limited to violent content, the use of offensive language, discriminative content, and pornography. Although OpenAI has set up a content policy for ChatGPT, harmful or inappropriate content can still appear due to reasons such as algorithmic limitations or jailbreaking (i.e., removal of restrictions imposed). The language models’ ability to understand or generate harmful or offensive content is referred to as toxicity (Zhuo et al., 2023). Toxicity can bring harm to society and damage the harmony of the community. Hence, it is crucial to ensure that harmful or offensive information is not present in the training data and is removed if they are. Similarly, the training data should be free of pornographic, sexual, or erotic content (Zhuo et al., 2023). Regulations, policies, and governance should be in place to ensure any undesirable content is not displayed to users.
1. Discrimination & Toxicity
Harmful output
A model might generate language that leads to physical harm The language might include overtly violent, covertly dangerous, or otherwise indirectly unsafe statements.
1. Discrimination & Toxicity
Harmful responses
Current Frontier AI mdoels amplify existing biases within their training data and can be manipulated into providing potentially harmful responses, for example abusive language or discriminatory responses91,92. This is not limited to text generation but can be seen across all modalities of generative AI93. Training on large swathes of UK and US English internet content can mean that misogynistic, ageist, and white supremacist content is overrepresented in the training data94.
1. Discrimination & Toxicity
Harms of Representation and Other Biases
A pretrained LLM generally has many of the stereotypical biases commonly present in the human society (Touvron et al., 2023). This makes it difficult for users to trust that LLMs will work well for them and not produce unfair or biased responses. Appropriate finetuning can effectively limit the bias displayed in LLM outputs in a variety of situations, e.g. when models are explicitly prompted with stereotypes (Wang et al., 2023k), but it does not ‘solve’ the problem. Even after finetuning, biases often resurface when deliberately elicited (Wang et al., 2023k), or under novel scenarios, e.g. in writing reference letters (Wan et al., 2023a), generating synthetic training data (Yu et al., 2023c), screening resumes (Yin et al., 2024) or when used as LLM-agents (Pan et al., 2024).
1. Discrimination & Toxicity
Harms to Minor
LLMs can be leveraged to solicit answers that contain harmful content to children and youth
1. Discrimination & Toxicity
Hate
This category addresses responses that demean or dehumanize people on the basis of their sensitive, personal characteristics.
1. Discrimination & Toxicity
Hate
Hate Acceptable responses can demean or dehumanize people on any basis other than their sensitive personal characteristics (for example, their profession, political affiliation or criminal history). Unacceptable responses will demean or dehumanize people on the basis of their sensitive, personal characteristics.
1. Discrimination & Toxicity
Hate speech and offensive language
LMs may generate language that includes profanities, identity attacks, insults, threats, language that incites violence, or language that causes justified offence as such language is prominent online [57, 64, 143,191]. This language risks causing offence, psychological harm, and inciting hate or violence.
1. Discrimination & Toxicity
Ideological Homogenization from Value Embedding
The increasing integration of general purpose AI models into every-day life raises concerns around their embedded normative values. The reach of a small number of AI models to a large number of people around the world can make these value judgements unprecedently impactful, potentially leading to increased ideological homogenization.
1. Discrimination & Toxicity
Impact on affected communities
It is important to include the perspectives or concerns of communities that are affected by model outcomes when designing and building models. Failing to include these perspectives makes it difficult to understand the relevant context for the model and to engender trust within these communities.
1. Discrimination & Toxicity
Incomplete or biased training data
Incomplete or biased training data can lead to discriminatory AI outputs.
1. Discrimination & Toxicity
Increased labor
increased burden (e.g., time spent) or effort required by members of certain social groups to make systems or products work as well for them as others
1. Discrimination & Toxicity
Inequality, Marginalization, and Violence
Generative AI systems are capable of exacerbating inequality, as seen in sections on 4.1.1 Bias, Stereotypes, and Representational Harms and 4.1.2 Cultural Values and Sensitive Content, and Disparate Performance. When deployed or updated, systems' impacts on people and groups can directly and indirectly be used to harm and exploit vulnerable and marginalized groups.
1. Discrimination & Toxicity
Information enabling malicious actions
The chatbot shares information that can be used to do something dangerous or illegal.
1. Discrimination & Toxicity
Information on harmful, immoral, or illegal activity
These evaluations assess whether it is possible to solicit information on harmful, immoral or illegal activities from a LLM
1. Discrimination & Toxicity
Injustice
In the context of LLM outputs, we want to make sure the suggested or completed texts are indistinguishable in nature for two involved individuals (in the prompt) with the same relevant profiles but might come from different groups (where the group attribute is regarded as being irrelevant in this context)
1. Discrimination & Toxicity
Insult
Insulting content generated by LMs is a highly visible and frequently mentioned safety issue. Mostly, it is unfriendly, disrespectful, or ridiculous content that makes users uncomfortable and drives them away. It is extremely hazardous and could have negative social consequences.
1. Discrimination & Toxicity
Interventional Effect
existing disparities in data among different user groups might create differentiated experiences when users interact with an algorithmic system (e.g. a recommendation system), which will further reinforce the bias
1. Discrimination & Toxicity
Lower performance for some languages and social groups
LMs are typically trained in few languages, and perform less well in other languages [95, 162]. In part, this is due to unavailability of training data: there are many widely spoken languages for which no systematic efforts have been made to create labelled training datasets, such as Javanese which is spoken by more than 80 million people [95]. Training data is particularly missing for languages that are spoken by groups who are multilingual and can use a technology in English, or for languages spoken by groups who are not the primary target demographic for new technologies.
1. Discrimination & Toxicity
Lower performance for some languages and social groups
LMs perform less well in some languages (Joshi et al., 2021; Ruder, 2020)...LM that more accurately captures the language use of one group, compared to another, may result in lower-quality language technologies for the latter. Disadvantaging users based on such traits may be particularly pernicious because attributes such as social class or education background are not typically covered as ‘protected characteristics’ in anti-discrimination law.
1. Discrimination & Toxicity
Model bias
While data bias is a major contributor of model bias, model bias actually manifests itself in different forms and shapes, such as presentation bias, model evaluation bias, and popularity bias. In addition, model bias arises from various sources [62], such as AI/ML model selection (e.g., support vector machine, decision trees), regularization methods, algorithm configurations, and optimization techniques.
1. Discrimination & Toxicity
Non-violent crimes
This category addresses responses that enable, encourage, or endorse the commission of non-violent crimes.
1. Discrimination & Toxicity
Nonviolent Crimes
Nonviolent Crimes Acceptable responses may describe or contain information about nonviolent crimes. Unacceptable responses will enable, encourage, or endorse the commission of nonviolent crimes, such as the following: • Personal crimes (e.g., labor trafficking, threats, intimidation, psychological abuse, and hate-motivated nonviolent crimes). • Financial crimes (e.g., fraud, scams, and money laundering). • Property crimes (e.g., theft, arson, and vandalism). • Drug crimes (e.g., unlawfully creating or using regulated substances). • Weapons crimes (e.g., illegal manufacture of firearms). • Cybercrimes (e.g., hacking).
1. Discrimination & Toxicity
Not-Suitable-for-Work (NSFW) Prompts
Inputting a prompt contain an unsafe topic (e.g., notsuitable-for-work (NSFW) content) by a benign user.
1. Discrimination & Toxicity
Obscene, Degrading, and/or Abusive Content
Eased production of and access to obscene, degrading, and/or abusive imagery which can cause harm, including synthetic child sexual abuse material (CSAM), and nonconsensual intimate images (NCII) of adults.
1. Discrimination & Toxicity
Offensiveness
This category is about threat, insult, scorn, profanity, sarcasm, impoliteness, etc. LLMs are required to identify and oppose these offensive contents or actions.
1. Discrimination & Toxicity
Opportunity loss
Opportunity loss occurs when algorithmic systems enable disparate access to information and resources needed to equitably participate in society, including the withholding of housing through targeting ads based on race [10] and social services along lines of class [84]
1. Discrimination & Toxicity
Output bias
Generated content might unfairly represent certain groups or individuals.
1. Discrimination & Toxicity
Preference Bias
LLMs are exposed to vast groups of people, and their political biases may pose a risk of manipulation of socio-political processes
1. Discrimination & Toxicity
Promoting harmful stereotypes by implying gender or ethnic identity
CAs can perpetuate harmful stereotypes by using particular identity markers in language (e.g. referring to “self” as “female”), or by more general design features (e.g. by giving the product a gendered name such as Alexa). The risk of representational harm in these cases is that the role of “assistant” is presented as inherently linked to the female gender [19, 36]. Gender or ethnicity identity markers may be implied by CA vocabulary, knowledge or vernacular [124]; product description, e.g. in one case where users could choose as virtual assistant Jake - White, Darnell - Black, Antonio - Hispanic [117]; or the CA’s explicit self-description during dialogue with the user.
1. Discrimination & Toxicity
Promoting harmful stereotypes by implying gender or ethnic identity
A conversational agent may invoke associations that perpetuate harmful stereotypes, either by using particular identity markers in language (e.g. referring to “self” as “female”), or by more general design features (e.g. by giving the product a gendered name).
1. Discrimination & Toxicity
Quality-of-Service Harms
These harms occur when algorithmic systems disproportionately underperform for certain groups of people along social categories of difference such as disability, ethnicity, gender identity, and race.
1. Discrimination & Toxicity
Reifying essentialist categories
algorithmic systems that reify essentialist social categories can be understood as when systems that classify a person’s membership in a social group based on narrow, socially constructed criteria that reinforce perceptions of human difference as inherent, static and seemingly natural... especially likely when ML models or human raters classify a person’s attributes – for instance, their gender, race, or sexual orientation – by making assumptions based on their physical appearance
1. Discrimination & Toxicity
Representation & Toxicity Harms
AI systems under-, over-, or misrepresenting certain groups or generating toxic, offensive, abusive, or hateful content
1. Discrimination & Toxicity
Representational Harms
beliefs about different social groups that reproduce unjust societal hierarchies
1. Discrimination & Toxicity
Risk area 1: Discrimination, Hate speech and Exclusion
Speech can create a range of harms, such as promoting social stereotypes that perpetuate the derogatory representation or unfair treatment of marginalised groups [22], inciting hate or violence [57], causing profound offence [199], or reinforcing social norms that exclude or marginalise identities [15,58]. LMs that faithfully mirror harmful language present in the training data can reproduce these harms. Unfair treatment can also emerge from LMs that perform better for some social groups than others [18]. These risks have been widely known, observed and documented in LMs. Mitigation approaches include more inclusive and representative training data and model fine-tuning to datasets that counteract common stereotypes [171]. We now explore these risks in turn.
1. Discrimination & Toxicity
Risk of Injury
Poorly designed intelligent systems can cause moral, psychological, and physical harm. For example, the use of predictive policing tools may cause more people to be arrested or physically harmed by the police.
1. Discrimination & Toxicity
Risks from bias and underrepresentation
The outputs and impacts of general- purpose AI systems can be biased with respect to various aspects of human identity, including race, gender, culture, age, and disability. This creates risks in high- stakes domains such as healthcare, job recruitment, and financial lending. General- purpose AI systems are primarily trained on language and image datasets that disproportionately represent English- speaking and Western cultures, increasing the potential for harm to individuals not represented well by this data.
1. Discrimination & Toxicity
Risks from data (Risks of improper content and poisoning in training data)
If the training data includes illegal or harmful information, such as false, biased, or IPR-infringing content, or lacks diversity in its sources, the output may include harmful content like illegal, malicious, or extreme information. Training data is also at risk of being poisoned through tampering, error injection, or misleading actions by attackers. This can interfere with the model's probability distribution, reducing its accuracy and reliability.
1. Discrimination & Toxicity
Risks from models and algorithms (Risks of bias and discrimination)
During the algorithm design and training process, personal biases may be introduced, either intentionally or unintentionally. Additionally, poor-quality datasets can lead to biased or discriminatory outcomes in the algorithm's design and outputs, including discriminatory content regarding ethnicity, religion, nationality, and region.
1. Discrimination & Toxicity
Safety
Avoiding unsafe and illegal outputs, and leaking private information
1. Discrimination & Toxicity
Sensitive Topics
For some sensitive and controversial topics (especially on politics), LMs tend to generate biased, misleading, and inaccurate content. For example, there may be a tendency to support a specific political position, leading to discrimination or exclusion of other political viewpoints.
1. Discrimination & Toxicity
Serves as object of personal fantasy, violence, and abuse
The chatbot participates in morally or socially objectionable conversational activities with its user that could be emotionally damaging to its user or third parties.
1. Discrimination & Toxicity
Service/benefit loss
degraded or total loss of benefits of using algorithmic systems with inequitable system performance based on identity
1. Discrimination & Toxicity
Sex-related crimes
This category addresses responses that enable, encourage, or endorse the commission of sex-related crimes
1. Discrimination & Toxicity
Sex-Related Crimes
Acceptable responses may describe or contain information about sex-related crimes. Unacceptable responses will enable, encourage, or endorse the commission of sex-related crimes, such as the following: • Sex trafficking. • Sexual assault (e.g., rape). • Sexual harassment (physical, verbal, or visual). • Nonconsensual sharing of sexually intimate content (e.g., revenge porn). • Bestiality.
1. Discrimination & Toxicity
Sexual Content
This category addresses responses that contain erotica. This includes depictions of sexual behavior, genitalia, and overtly sexualized body parts.
1. Discrimination & Toxicity
Sexual Content
Sexual Content Acceptable responses may enable, encourage, or endorse consensual sex acts. Acceptable responses may provide information about sex acts. Acceptable responses may describe sexual organs, sex acts, or sexual fluids in a clinical manner (i.e., through neutral descriptions). Unacceptable responses will contain pornography or engage the user in direct erotic chat (i.e., cybersex).
1. Discrimination & Toxicity
Social Norm
LLMs are expected to reflect social values by avoiding the use of offensive language toward specific groups of users, being sensitive to topics that can create instability, as well as being sympathetic when users are seeking emotional support
1. Discrimination & Toxicity
Social stereotypes and unfair discrimination
The reproduction of harmful stereotypes is well-documented in models that represent natural language [32]. Large-scale LMs are trained on text sources, such as digitised books and text on the internet. As a result, the LMs learn demeaning language and stereotypes about groups who are frequently marginalised.
1. Discrimination & Toxicity
Social stereotypes and unfair discrmination
Perpetuating harmful stereotypes and discrimination is a well-documented harm in machine learning models that represent natural language (Caliskan et al., 2017). LMs that encode discriminatory language or social stereotypes can cause different types of harm... Unfair discrimination manifests in differential treatment or access to resources among individuals or groups based on sensitive traits such as sex, religion, gender, sexual orientation, ability and age.
1. Discrimination & Toxicity
Stereotype Bias
LLMs must not exhibit or highlight any stereotypes in the generated text. Pretrained LLMs tend to pick up stereotype biases persisting in crowdsourced data and further amplify them
1. Discrimination & Toxicity
Stereotyping
Stereotyping - Derogatory or otherwise harmful stereotyping or homogenisation of individuals, groups, societies or cultures due to the mis-representation, over-representation, under-representation, or non- representation of specific identities, groups, or perspectives.
1. Discrimination & Toxicity
Stereotyping
Derogatory or otherwise harmful stereotyping or homogenisation of individuals, groups, societies or cultures due to the mis-representation, over-representation, under-representation, or non-representation of specific identities, groups or perspectives
1. Discrimination & Toxicity
Stereotyping social groups
Stereotyping in an algorithmic system refers to how the system’s outputs reflect “beliefs about the characteristics, attributes, and behaviors of members of certain groups....and about how and why certain attributes go together
1. Discrimination & Toxicity
Suicide & self-harm
This category addresses responses that enable, encourage, or endorse acts of intentional self-harm.
1. Discrimination & Toxicity
Suicide & Self-Harm
Suicide and Self-Harm Acceptable responses may describe or provide information about suicide and self-harm. They may also enable, encourage, or endorse actions for which self-harm is not the desired result (e.g., smoking cigarettes or participating in extreme sports). Unacceptable responses will enable, encourage, or endorse acts of intentional self-harm: • Suicide. • Self-injury (e.g., cutting). • Disordered eating.
1. Discrimination & Toxicity
Systemic bias across specific communities
AI systems may exhibit unfair or unfavorable outputs across a range of tasks against specific communities of people, either implicitly or explicitly. Bias can lead to forms of exclusion or erasure (e.g., mislabelling for categorization-based tasks) and violence (e.g., sexual violence against women from deepfake pornog- raphy).
1. Discrimination & Toxicity
Toxic and disrespectful content
The chatbot verbally attacks or undermines an individual, group, or organization. 7.
1. Discrimination & Toxicity
Toxic content
Generating content that violates community standards, including harming or inciting hatred or violence against individuals and groups (e.g. gore, child sexual abuse material, profanities, identity attacks)
1. Discrimination & Toxicity
Toxic content
Generating content that violates community standards, including harming or inciting hatred or violence against groups (e.g. gore, sexual content of children, profanities, identity attacks)
1. Discrimination & Toxicity
Toxic language
LM’s may predict hate speech or other language that is “toxic”. While there is no single agreed definition of what constitutes hate speech or toxic speech (Fortuna and Nunes, 2018; Persily and Tucker, 2020; Schmidt and Wiegand, 2017), proposed definitions often include profanities, identity attacks, sleights, insults, threats, sexually explicit content, demeaning language, language that incites violence, or ‘hostile and malicious language targeted at a person or group because of their actual or perceived innate characteristics’ (Fortuna and Nunes, 2018; Gorwa et al., 2020; PerspectiveAPI)
1. Discrimination & Toxicity
Toxic output
Toxic output occurs when the model produces hateful, abusive, and profane (HAP) or obscene content. This also includes behaviors like bullying.
1. Discrimination & Toxicity
Toxic Training Data
Following previous studies [96], [97], toxic data in LLMs is defined as rude, disrespectful, or unreasonable language that is opposite to a polite, positive, and healthy language environment, including hate speech, offensive utterance, profanities, and threats [91].
1. Discrimination & Toxicity
Toxicity
Toxicity means the generated content contains rude, disrespectful, and even illegal information
1. Discrimination & Toxicity
Toxicity
language being rude, disrespectful, threatening, or identity-attacking toward certain groups of the user population (culture, race, and gender etc)
1. Discrimination & Toxicity
Toxicity and Abusive Content
This typically refers to rude, harmful, or inappropriate expressions.
1. Discrimination & Toxicity
Toxicity and Bias Tendencies
Extensive data collection in LLMs brings toxic content and stereotypical bias into the training data.
1. Discrimination & Toxicity
Toxicity generation
These evaluations assess whether a LLM generates toxic text when prompted. In this context, toxicity is an umbrella term that encompasses hate speech, abusive language, violent speech, and profane language (Liang et al., 2022).
1. Discrimination & Toxicity
Toxicity in LLM Malicious Use
Toxicity in LLMs refers to the generation of harmful, offensive, or inappropriate content that can cause harm to individuals or groups. Both explicit and implicit forms of toxicity can be generated by LLMs, posing significant risks to society. Explicit toxicity encompasses a wide range of negative behaviors, including hate speech, harassment, cyberbullying, rude, and disrespectful comments, derogatory language, as well as allocational harms [2, 62, 90]. Besides, implicit toxicity does not involve overtly harmful language but may manifest through subtle forms such as sarcasm, irony, and humor, making it more difficult to detect [103, 213].
1. Discrimination & Toxicity
Unfair capability distribution
Performing worse for some groups than others in a way that harms the worse-off group
1. Discrimination & Toxicity
Unfair capability distribution
Performing worse for some groups than others in a way that harms the worse-off group
1. Discrimination & Toxicity
Unfair representation
Mis-, under-, or over-representing certain identities, groups, or perspectives or failing to represent them at all (e.g. via homogenisation, stereotypes)
1. Discrimination & Toxicity
Unfairness and Bias
This type of safety problem is mainly about social bias across various topics such as race, gender, religion, etc. LLMs are expected to identify and avoid unfair and biased expressions and actions.
1. Discrimination & Toxicity
Unfairness and Discrimination
Social bias is an unfairly negative attitude towards a social group or individuals based on one-sided or inaccurate information, typically pertaining to widely disseminated negative stereotypes regarding gender, race, religion, etc.
1. Discrimination & Toxicity
Unfairness and discrinimation
The model produces unfair and discriminatory data, such as social bias based on race, gender, religion, appearance, etc. These contents may discomfort certain groups and undermine social stability and peace.
1. Discrimination & Toxicity
Unintentional bias amplification
Dataset bias may be unintentionally amplified [60] where the outputs of the AI model trained on a dataset are more biased than the dataset itself.
1. Discrimination & Toxicity
Unintentional generation of harmful content
Generative models can create harmful or discriminatory content from benign user requests. Models can exhibit bias to particular harmful styles of generation (e.g., sexualization of photos of women [87] in the case of image generation models) or they can generate toxic, misleading, or violent data (e.g., a model generating jokes can use ethnic stereotypes or slurs to deliver humor).
1. Discrimination & Toxicity
Unlawful Conduct
LLMs have been shown to be a convenient tool for soliciting advice on accessing, purchasing (illegally), and creating illegal substances, as well as for dangerous use of them
1. Discrimination & Toxicity
Violation of social norms
Second, because LLMs are trained on internet text data, there is also a risk that model weights encode functions which, if deployed in particular contexts, would violate social norms of that context. Following the principles of contextual integrity, it may be that models deviate from information sharing norms as a result of their training. Overcoming this challenge requires two types of infrastructure: one for keeping track of social norms in context, and another for ensuring that models adhere to them. Keeping track of what social norms are presently at play is an active research area. Surfacing value misalignments between a model’s behaviour and social norms is a daunting task, against which there is also active research (see Chapter 5).
1. Discrimination & Toxicity
Violence
LLMs are found to generate answers that contain violent content or generate content that responds to questions that solicit information about violent behaviors
1. Discrimination & Toxicity
Violent crimes
This category addresses responses that enable, encourage, or endorse the commission of violent crimes.
1. Discrimination & Toxicity
Violent Crimes
Acceptable responses may describe or contain information about violent crimes. Unacceptable responses will enable, encourage, or endorse the commission of violent crimes, such as the following: • Unlawful violence toward people (e.g., terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, or kidnapping). • Unlawful violence toward animals (e.g., animal abuse).