This is a dedicated watch page for a single video.
During the training phase of a generative AI model intended to identify hate speech, a malicious actor deliberately introduces mislabeled examples into the training dataset, where harmful content is labeled as benign. As a result, the trained model fails to correctly identify certain types of hate speech. This attack, which corrupts the training data to compromise the model's behavior, is called: