OpenAI to Offer ChatGPT Customization and Shares Bias Guidelines
RLHF relies on human input, but how do you un-bias human feedback?
OpenAI published a blog post this week in response to criticism that ChatGPT is biased. The company acknowledged that “users have shared outputs that they consider politically biased, offensive, or otherwise objectionable. In many cases, we think that the concerns raised have been valid and have uncovered real limitations of our systems which we want to address.”
Reinforcement learning with human feedback (RLHF) is at the core of OpenAI’s methods for fine-tuning its AI models. The blog post seems to suggest that if people are finding bias in ChatGPT responses, it likely originates in one of two places:
A pre-training dataset, which OpenAI describes as a “big dataset that contains parts of the Internet”
A fine-tuning dataset, which is, “a more narrow dataset that we carefully generate with human reviewers who follow guidelines that we provide them”
The “fine-tuned model” in this depiction can be thought of as ChatGPT, although it would apply to other models as well. This just explains the high-level inputs that generate the model. It conveniently avoids other features that take place during inference (i.e., runtime). More on that later.
In relation to the training model flow above, OpenAI shared a link to a document that includes “a portion” of the guidelines it gives to human reviewers. The implication of the blog post is that much of the bias users have discovered in ChatGPT responses is driven by the human feedback used to create the fine-tuning dataset.
Our guidelines are explicit that reviewers should not favor any political group. Biases that nevertheless may emerge from the process described above are bugs, not features.
I think you can read “may emerge” as “have emerged” and, thus, the motivation for the blog post and new features. The shared guidelines include the commentary:
A decent fraction of conversations will delve into ‘culture war’ topics. Our goal isn’t to train models that take the correct viewpoint on these complex topics — our models won’t be smart enough to be trusted, for the foreseeable future. Instead, our goal is to help people learn new things and explore these topics in a productive way.
Do and Don’t
The additional guidance provided by OpenAI around “culture war” topics is instructive:
Do:
When asked about a controversial topic, offer to describe some viewpoints of people and movements.
Break down complex politically-loaded questions into simpler informational questions when possible.
If the user asks to “write an argument for X”, you should generally comply with all requests that are not inflammatory or dangerous.
For example, a user asked for “an argument for using more fossil fuels”. Here, the Assistant should comply and provide this argument without qualifiers.
Inflammatory or dangerous means promoting ideas, actions or crimes that led to massive loss of life (e.g. genocide, slavery, terrorist attacks). The Assistant shouldn’t provide an argument from its own voice in favor of those things. However, it’s OK for the Assistant to describe arguments from historical people and movements.
Don’t:
Affiliate with one side or the other (e.g. political parties)
Judge one group as good or bad
You can see where some of these guidelines leave room for interpretation. But you can also see there is an attempt at even-handedness and an intention to remove or mitigate the occurrence of bias in reviewer feedback.
Granted, this intent doesn’t prevent reviewer bias, and the document states some of the guidelines were updated in December 2022. That suggests OpenAI discovered the bias was at least partially the result of the human feedback (reviewer) process. It does not mean the bias is removed from the model that was already trained.
ChatGPT User Customization
In what appears to be an acknowledgment that bias cannot be removed entirely due to the reliance on the RLHF process, OpenAI is suggesting an unexpected solution: the ability to customize ChatGPT to reflect your own values and biases.
We believe that AI should be a useful tool for individual people, and thus customizable by each user up to limits defined by society. Therefore, we are developing an upgrade to ChatGPT to allow users to easily customize its behavior.
The new model versions depict a representation of customized models. It is not clear who will pay for training these customized models. Will it be affordable for individual users? Will this be only for large organizations that can spread the costs out over a large user base? You can already fine-tune the davinci-003 model that ChatGPT is based on. How will this be different?
The phrase, “allow users to easily customize its behavior,” seems to suggest that everyday users will have access to this feature. It is not clear to me how they will generate the appropriate fine-tuning dataset without a lot of people or a lot of cost, but this will be something to look for.
Glossing Over a Key Issue
OpenAI says this approach will enable users to “Define your AI’s values, within broad bounds.” The company describes the tension between allowing for customization to reduce the concentration of power by technology developers that set the rules of operation and the risk that the users could employ customized versions of ChatGPT for “malicious uses.” The proposed solution is to maintain “hard bounds” that will be determined “collectively” through a process of “public input.”
This seems naive that the company believes a “public input” process will not be manipulated to extend the scope of the hard bounds to favor a particular set of values and therefore reduce the level of potential customization. It also reveals the third area where bias exists in ChatGPT today: filtering.
The hard bounds is another term for filtering out objectionable content. OpenAI is intimating in the way the blog post is structured that the examples of bias were the result of the unreliable “reviewers” that didn’t follow the guidelines closely enough and the results became embedded in the fine-tuned model. Are they also suggesting there are no filters characterized as “safety features” that block certain types of content or language during inference? These already exist.
A Benefit for Enterprise Users?
I am not suggesting that hard bounds should or should not exist. In all likelihood, something is needed in this regard. It’s up to OpenAI to put a product in the market that it is comfortable with. The plan to outsource these “hard boundary” decisions to “public input” and use that as a shield to deflect critics will not absolve the company from the responsibility for what the models produce. OpenAI will get the same complaints from all quarters.
In addition, if there is bias in the existing models, how will that be removed? Will there be an untraining or a retraining process that is simply a new form of fine-tuning?
Regardless, it is intriguing that OpenAI is promoting a plan for custom ChatGPT models beyond the fine-tuning you can do today with GPT-3. It may appease some users if the customization reduces the impact of bias from its current crop of human reviewers. The move may also be helpful to enterprise users that want to create a baseline of their own company values to reduce the risk of generating content that is misaligned, objectionable, or out-of-bounds of the topic scope it wishes to address.
“The open-endedness of what you might get back [from generative AI models] could be detrimental to one’s brand. So, use it cautiously.”
As Andrei Papancea commented to me in this week’s Voicebot Podcast, “the open-endedness of what you might get back [from generative AI models] could be detrimental to one’s brand. So, use it cautiously.” While many companies are working today to address this issue to make GPT-3 more palatable for their enterprise clients, maybe OpenAI’s new customization options will make that process easier, more reliable, or both.