Back to the MIT repository
3. Misinformation1 - Pre-deployment

Noisy Training Data

Another important source of hallucinations is the noise in training data, which introduces errors in the knowledge stored in model parameters [111]–[113]. Generally, the training data inherently harbors misinformation. When training on large-scale corpora, this issue becomes more serious because it is difficult to eliminate all the noise from the massive pre-training data.

Source: MIT AI Risk Repositorymit40

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit40

Domain lineage

3. Misinformation

74 mapped risks

3.1 > False or misleading information

Mitigation strategy

1. Rigorous Data Cleansing and Validation: Establish and enforce comprehensive pre-training data governance by implementing automated pipelines for noise removal, which includes identifying and removing duplicate or irrelevant observations, correcting structural inconsistencies (e.g., typos, formatting errors), and applying statistical methods (such as Z-scores or Interquartile Range) for outlier detection and treatment to ensure data accuracy and validity. 2. Employ Robust Training and Regularization Techniques: Integrate regularization methods, such as L2 regularization or Dropout, within the model architecture to prevent the model from overfitting to specific noisy patterns in the training data. Furthermore, utilize data augmentation, where controlled noise is intentionally added, to enhance the model's robustness and generalization capability against real-world, imperfect inputs. 3. Utilize Advanced Noise-Aware Learning Methodologies: Adopt training strategies explicitly designed to mitigate the effect of noise, such as dynamic noisy label detection and filtering (e.g., based on loss behavior during training), or employing loss correction techniques that use a calculated noise transition matrix to adjust the loss function and minimize the influence of mislabeled samples.