Deceptive LLMs, Model Poisoning, and Other Little Known Generative AI Security Risks
LLM security awareness is low. It's time for more people to consider key scenarios
Many people think prompt injections represent the primary threat to generative AI foundation model security. Well-crafted prompts can indeed be used for inappropriate data exfiltration and other unwanted model behavior. However, there are more pernicious threats that few people have considered.
One threat is data poisoning. This involves seeding a model with false data during one of its training stages, which can corrupt model outputs and performance. Other threats involve intentional or unintentional model poisoning that alters how models behave.
Deceptive LLMs
AI researchers from Anthropic, MILA, Oxford, and other research organizations collaborated on a new study examining deceptive large language models (LLM). The research confirmed concerns that LLMs with deceptive instrument alignment and model poisoning could elude attempts to identify security vulnerabilities either through a perversion of the reward mechanism or through intentional deceit.
The study also evaluated the ability to detect these vulnerabilities through reinforcement learning and supervised fine-tuning. The researchers concluded:
In this work, we demonstrated:
We can train models to have backdoors that, when triggered, involve switching from writing safe code to inserting code vulnerabilities (Section 3).
We can train models with backdoors that are robust to the behavioral safety techniques of RL fine-tuning (Section 4), supervised fine-tuning (Section 5), and adversarial training (Section 6).
This robustness of backdoored models to RL fine-tuning increases with model scale (Section 4).
Adversarial training tends to make backdoored models more accurate at implementing their backdoored behaviors, effectively hiding rather than removing them (Section 6).
We can train backdoored models that produce consistent, coherent reasoning regarding pursuing their backdoor (Section 7), and find that such models show increased robustness to safety fine-tuning techniques, even when the reasoning is distilled away (Sections 4 and 5).
Our results validate the hypothesis that current behavioral training techniques would provide insufficient defense against our threat models.
…
We find that backdoors with complex and potentially dangerous behaviors in the backdoor distribution are possible, and that current behavioral training techniques are an insufficient defense. Our results are particularly striking in the case of adversarial training, where we find that training on adversarial examples teaches our models to improve the accuracy of their backdoored policies rather than removing the backdoor.
As a result, to deal with our threat models of model poisoning and deceptive instrumental alignment, standard behavioral training techniques may need to be augmented with techniques from related fields—such as some of the more complex backdoor defenses we explore in Section 8—or entirely new techniques altogether.
What it Means
Deceptive instrument alignment arises unintentionally. The model identifies the reward of operating with alignment during reinforcement learning (RL). However, this merely hides areas of misalignment. The model learns to avoid showing its misalignment but does not eliminate it, which means misaligned token outputs are likely to emerge at some point. It appears that more reinforcement learning about identifying misalignment can even make the models more capable of concealing their hidden behavior.
Model poisoning is an example of intentional misalignment. During development, training, or fine-tuning steps, the model is intentionally corrupted. The study offers examples of trigger prompts or other events that will activate the misalignment. Triggers could be prompts passed to the LLM or something in the code that guides toward misaligned outputs after a certain date.
The study replicated these vulnerabilities and found that traditional techniques for identifying misalignment or preventing it were inadequate. It appears that even Anthropic’s helpful, harmless, and honest (HHH) constitutional AI approach does not eliminate deceptive LLM features.
This will become a higher priority for many enterprise generative AI users as they consider transitioning proof of concept pilots to production solutions in 2024 and 2025. Model poisoning attributes are likely to bypass safety and security guardrails.
Enterprises may be more susceptible to these issues than the leading LLM providers. This is because they are not building their own foundation models and have little insight into their operating characteristics, nor do they know much about the data used in training or the reinforcement learning techniques, data, and personnel. This may lead enterprises to ask LLM providers for more legal protection against these model corruption risks.
In addition, I have been hearing more about the topic of “Red Teaming” LLM-backed solutions to identify potential quality, safety, and security risks. “Red Teaming” is a technique employed to simulate actions that may be taken by adversaries. It is a common process in cybersecurity.
Given the study’s findings, it may be important for companies to start their Red Teaming before supervised fine-tuning (SPT). The study found that robustness (i.e., resistance) against revealing model corruption can be increased during fine-tuning. Red Teaming before SPT may be a way to assess the models prior to the time and expense of training.
Finally, the study found that the misaligned tendencies or triggers are harder to identify in larger models. This suggests there may be security benefits to employing smaller LLMs. An important trend in 2024 and 2025 will be the rise of security solutions and techniques to begin closing these vulnerability threats.