Day 23 - Temperature, Top-K, and Top-P
When you write a prompt, you are attempting to set up the LLM to predict the right sequence of tokens. Prompt engineering is the process of designing high-quality prompts that guide LLMs to produce accurate outputs. This process involves tinkering to find the best prompt, optimizing prompt length, and evaluating a prompt's writing style and structure in relation to the task.
When prompt engineering, you will start by choosing a model. Prompts might need to be optimized for your specific model, regardless of whether you use Gemini language models in vertex AI, GPT, Claude, or an open source model like Gemma or LLaMA
Beside the prompt, you will also need to tinker with the various configurations of a LLM.
LLM output configuration
Effective prompt engineering requries setting these configurations optimally for your task.
Output length
The number of tokens to generate in a response. Generating more tokens require more computation from the LLM, leading to higher energy consumption, potentially slower response times, and higher costs.
Sampling controls
LLM do not formally predict a single token. Rather, LLMs predict probabilities for what the next token could be, with each token in the LLM's vocabulary getting a probability. Temperature, top-K, and top-P are the most common configuration settings that determine how predicted token probabilities are processed to choose a single output token.
Temperature
Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that expect a more deterministic response, while higher temperatures can lead to more diverse or unexpected results. Temperature close to the max tend to create more random output. And as temperature gets higher and higher, all tokens become equally to be the next predicted token.
Top-K and top-P
Top-K and top-P (also known as nucleus sampling) are two sampling settings used in LLMs to restrict the predicted next token to come from tokens with the top predicted probabilities. Like temperature, these sampling settings control the randomness and diversity of generated text.
Putting it all together
- Top-K sampling selects the top K most likely tokens. The higher top-K, the more creative and varied the mode's output; the lower top-K, the more restive and factual the model's output. A top-K of 1 is equivalent to greedy decoding.
- Top-P sampling selects the top tokens whose cumulative probability does not exceed a certain value (P). The values range from 0 (greedy decoding) to 1 (all tokens in the LLM's vocabulary).
At extreme settings of one sampling configuration value, that one sampling setting either cancels out other configuration settings or becomes irrelevant.
- If you set Temperature to 0, top-K and top-P become irrelevant--the most probable token becomes the next token predicted.
- If you set Temperature extremely high (above 1--generally into the 10s), temperature becomes irrelevant and whatever tokens make it through the top-K and/or top-P criteria are then randomly sampled to choose a next predicted token.
- If you set top-K to 1, temperature and top-P become irrelevant. Only one token passes the top-K criteria, and that token is the next predicted token.
- If you set top-K extremely high, like to the size of the LLM's vocabulary, any token with a nonzero probability of being the next token will meet the top-K criteria and none are selected out.
- If you set top-P to 0 (or very small value), most LLM sampling implementations will then only consider the most probable token to meet the top-P criteria, making temperature and top-K irrelevant.
- If you set top-P to 1, any token with nonzero probability of being the next token will meet the top-P criteria, and non are selected out.
As a general starting point, a temperature of 0.2, top-P of 0.95, and top-K of 30 will give you relatively coherent results that can be creative but not excessively so.
- If you want especially creative results, try starting with
temperature=0.9, top-P=0.99, top-K=40. - If you want less creative results, try strating with
temperature=0.1, top-P=0.9, top-K=20. - If your task always has a single correct answer (e.g., answering a math problem), start with a temperature of 0.
NOTE
With more freedom (higher temp, top-P, top-K, and output tokens), the LLM might generate text that is less relevant.
WARNING
Warning about repetition loop bug causes by incorrectly setting up temperature. Low temperature, overly deterministic, which can lead to a loop fi that path revisits previously generated text. High temperatures, model's output becomes excessively random, increasing probability to lead back to a prior state.