Introduction
Unless you were lost in the Himalayas, there’s no reason why you wouldn’t have heard of ChatGPT. It took ChatGPT 5 days to reach 1M subscribers, to put that into perspective, GPT-3 took 24 months, while Dall-E took 6. As a formal introduction, ChaptGPT is the latest model from OpenAI- the organization behind GPT-3, and DALL-E. ChatGPT is a Large Language Model (LLM) optimized for dialogue. It has been trained and optimized to behave like a virtual assistant. Alongside engaging in human-like conversations, it can also write code, identify bugs in code, create content, and take tests, among a bunch of other Natural Language tasks. It has been built on top of OpenAI’s InstructGPT and optimized for conversation.
A Commercial Perspective
ChatGPT is very capable of engaging in fluent and friendly conversations which directly implies that it can be a good customer-facing chatbot. Its amazing content creation capabilities can be leveraged to generate quick and quality content for recurring tasks such as creating headlines, summaries, and short descriptions, or even help with writer’s block. But it does need a human in the loop to supervise since it does come with its own set of limitations.
In a nutshell.
ChatGPT is a sibling model to InstructGPT, trained to engage in conversations and follow instructions while promoting safety and responsibility as a priority. ChatGPT is trained using Reinforcement Learning in order to improve the prompt understanding capabilities along with the naturalism of the responses.
Deep Dive.
Background
Large language Models(LLMs) took the world by storm with their multitasking capabilities, but they do have their own flaws. LLMs are often found to be ‘misaligned’ with the user’s intent, that is, they tend to produce information that is not really what the user has intended for or even useful to the user in any way. The researchers at OpenAI believe that some possible reasons for this behavior could be flawed optimization methodologies, and the quality of data that they’re working with.
Enter InstructGPT
To mitigate this misalignment issue, OpenAI has decided to include humans in the loop. They’ve turned to something called Reinforcement Learning with Human Feedback (RLHF). The ideology of this method is to train a model to mimic human preferences with the intention of making the responses less synthetic while following the input prompt. The architecture remains the same from GPT-3 (whose architecture is largely based on the transformer). The training process is threefold.
In the first stage, prompts are randomly sampled from a prompt database. These prompts are then shown to data labelers (humans) who demonstrate how the desired response should be for each of the sampled prompts. This data is now used to fine-tune pre-trained GPT-3 models. The intention behind this stage is to kind of bootstrap the training process.
In the next stage, a model from the previous step is now shown more sampled prompts from the prompt database and K number of model outputs corresponding to each sampled prompt are recorded. Later, humans rank, by their quality, each of the K model outputs for a given sampled prompt. Now, a reward model is trained to pick the higher-ranked model output. During the training process, the reward model is shown 2 of the K model outputs for each sampled prompt and is expected to pick the preferred (higher-ranked) output.
In the third stage, a fine-tuned pre-trained model from the first step is used to initialize a Reinforcement Learning policy as a baseline. A policy simply put, is a probability distribution over all the available actions in a given situation, in this setting, a policy is a Language Model that takes in a prompt and returns the potential output sentence(s) with the available actions being the text embedding vocab. Now, more prompts are sampled from the prompt database and the policy produces the corresponding outputs. Each output is then reviewed by the reward model, which gives a fitting reward to the policy depending on the quality of the output. The objective of the policy is to maximize the rewards it can get from the reward model. This is achieved with the help of the PPO (Proximal Policy Optimization) algorithm, which was developed by OpenAI to help train policies faster. PPO strikes the balance between the time taken for a policy to converge into a solution and the step size, higher step sizes result in faster learning but pose the risk of negative learning (falling off the cliff). Since the solutions with high rewards are the solutions that are most likely to be ranked higher, implicitly, the policy is motivated to mimic human preferences and obey the prompts to the extent they can, which is the goal.
But how is ChatGPT related to InstructGPT?
Both ChatGPT and InstructGPT share pretty much the same training process, except the human feedback is more conversational-oriented for ChatGPT. That is, human feedback is given in such a way that it is tweaked for conversations.
Limitations
Among minor issues like guessing what the user intended instead of asking for clarity in the question, ChatGPT has some important limitations that need attention. The first of them is False information. ChatGPT is capable of producing false information in the most plausible way, if you don’t know the ground truth for a particular question, then you’re most probably not going to realize that you’ve been given a wrong answer. Next, ChatGPT is talkative! It is very happy to give lengthy answers to even the most basic questions unless you ask it not to give you such long answers. Like other generative models such as Dall-E, ChatGPT is also sensitive to input prompt phrasing and biased behavior.
Our Thoughts
While ChatGPT definitely has the potential to replace jobs, we feel it’s not quite there yet. ChatGPT, as it stands, is not trustworthy. There have been instances where ChatGPT has produced perfectly coherent and plausible answers that are actually wrong! Building trust around an ML model is something that we practice at Attri, we often integrate our blueprints with an Observability platform (censius) that explains and monitors the model’s performance and decisions, helping organizations interpret the models better.
ChatGPT explaining why an Abacus is faster than a GPU!
On the brighter side, while ChatGPT might not give you an output that is a 100% fit to the requirements, it is perfectly capable of giving you a first draft that could be used as a starting point to build on top of. This shoots up productivity which is indeed a positive takeaway. Another noteworthy observation is the effort OpenAI is putting in to make ChatGPT safe for everyone. Since LLMs are trained on large corpuses from the internet it is very easy for the model to learn racial and gender biases, and forbidden knowledge. OpenAI apparently, is constantly working behind the scenes to make models safe for everyone.
While creative people did find a way to bypass these safety measures, because the prompts are constantly monitored, the future iterations of the models can be expected to operate with more caution than they are today.