Large language models, such as GPT-3 (Brown et al. 2020) demonstrate an emergent capability, known as in-context learning, to perform a task by simply observing information (such as instructions and demonstration examples) in its prompt. Despite its incredible success on many tasks, in-context learning performance very much depends on a good prompt (Mishra et al. 2022).
In this work, we approach prompting from the perspective of example selection. That is, we seek to answer: how to select good examples for in-context learning? Different from prior work which retrieves examples assuming access to individual test instances (Liu et al. 2022; Rubin, Herzig, and Berant 2022), we aim to select good examples for the entire testing distribution.
Sensitivity of in-context learning to changes in the prompt is identified by prior work (Zhao et al. 2021; Lu et al. 2022). We revisit this insensitivity, especially when sampling random demonstration examples to motivate the need for example selection.
Here is a table of the performance of various GPT-2 and GPT-31 models on 4 tasks, after applying calibration (Zhao et al. 2021): we randomly sample 5 set of 4-shot demonstration examples, reporting the average performance and the standard deviation (in parentheses).
Model | AGNews (\(\sigma\)) | Amazon (\(\sigma\)) | SST-2 (\(\sigma\)) | TREC (\(\sigma\)) |
---|---|---|---|---|
GPT-2 (345M) | 55.2 (12.0) | 76.3 (14.0) | 66.2 (14.7) | 40.8 (5.4) |
GPT-3 (Ada) | 64.0 (4.0) | 90.0 (1.2) | 73.8 (9.7) | 22.1 (5.3) |
GPT-3 (Babbage) | 78.1 (6.1) | 92.7 (1.6) | 90.8 (1.1) | 36.0 (4.0) |
While achieving good performance on Amazon and SST-2, we observe that the performance of in-context learning is volatile: GPT-2 demonstrates double-digit standard deviation across datasets even with calibration. While the variance diminishes for larger models on sentiment classification tasks (Amazon and SST-2), they are still quite big for the other tasks. To address this sensitivity, we propose a framework for explicitly learning policies to select good examples.
As mentioned previously, we consider the problem of selecting demonstration examples from an unlabeled pool. The challenge with selecting a sequence of demonstration examples is there are too many candidate sequences to consider: the number of potential sequences grow exponentially with the size of the unlabeled pool, and is intractable to enumerate. One solution to this challenge is to consider example selection as a sequential decision problem, selecting examples one-by-one to construct the prompt.
In the language of a Markov Decision Process, states in the example selection environment are examples that are already selected as part of the prompt, and actions represent potential examples to be selected.
To train an example selection policy, we need signals for rewarding the model to select good examples. Suppose there is an objective function \(f : \mathcal{X}^\star \to \mathbb{R}\) that measures how good a sequence of examples is (in our implementation, \(f\) measures the performance of the prompt on a validation set). Then, to select a sequence of \(k\) examples, the trivial reward function rewards \(f\) on a complete prompt:
\[ r(x_1, x_2, \dots , x_i) = \begin{cases} f(x_1, x_2, \dots, x_i) &\text{if $i = k$} \\ 0 &\text{otherwise.} \end{cases} \]
While this reward directly maximizes the objective \(f\), it does provide reward signals to the example selected at time steps \(i < t\). One way to get around this issue is reward shaping (Ng, Harada, and Russell 1999), a modification of the reward function that preserves optimal policies. The following reward function \(r'\) is a shaped version of \(r\)2 \[ r'(x_1, x_2, \dots , x_i) = \begin{cases} f(x_1) - f(\varnothing) &\text{if $i = 1$} \\ f(x_1, x_2, \dots, x_i) - f(x_1, x_2, \dots, x_{i - 1}) &\text{if $i > 1$,} \end{cases} \] where \(f(\varnothing)\) is the performance of an empty prompt. The shaped reward \(r'\) has an intuitive interpretation: it represents the marginal utility (i.e. gain over the objective \(f\)) of the added example.
We experiment with 4-shot example selection on GPT-2 (345M) and consider three baselines for comparison.
During training, we use a training pool from which the policy learns to select examples, and a reward set on which we compute reward for training the policy. Since the policy gets direct reward signal (validation performance) on the training pool, performance on this setting (same task, seen examples) serve mostly as a sanity check.
We evaluate the generalization of learned example selection policies in two settings: - same task, new examples: during evaluation, the policy picks from new examples under the same distribution of training. - new task, new examples: the policy is jointly optimized on three out of the four tasks, and evaluated by selecting examples for the heldout task.
Method | Average | AGNews (95% CI) | Amazon (95% CI) | SST-2 (95% CI) | TREC (95% CI) |
---|---|---|---|---|---|
random | 59.6 | 55.2 (10.5) | 76.3 (12.3) | 66.2 (12.9) | 40.8 (4.7) |
max-entropy | 59.3 | 58.8 (11.3) | 74.8 (5.1) | 65.7 (10.7) | 37.8 (6.7) |
reordering | 63.5 | 63.3 (6.8) | 89.8 (3.8) | 67.9 (11.1) | 33.0 (4.2) |
our method (same task, seen examples) | 71.4 | 70.8 (7.8) | 90.4 (1.9) | 81.0 (3.5) | 43.3 (2.0) |
our method (same task, new examples) | 69.0 | 65.5 (7.4) | 88.5 (4.2) | 76.7 (7.5) | 45.4 (5.0) |
our method (new task, new examples) | 65.4 | 66.7 (5.7) | 89.9 (1.6) | 61.9 (7.7) | 43.3 (4.4) |
On seen examples, the example selection outperforms the random sampling baseline by 11.8%, indicating learnability of the example selection problem. Perhaps more interestingly, the learned example selection policy can generalize beyond in both same task, new examples and new task, new examples, showing an the best baseline (reordering) by 5.5% and 1.9% respectively.
While our main experiments are done with GPT-2 (334M), we experimented with transferring both the learned policy and selected examples to GPT-3 models. The results are mixed: we observe small gains for GPT-3 Ada, but limited or negative results for larger models (GPT-3 Babbage and Curie). This observation might point to emergence: larger models have different preferences for demonstration examples.
While example selection can be framed as an learning problem, it may not always be the sensible option under practical considerations. Training example selection policies come with significant computational overhead, and this cost seems hard to justify when the simple best-of-\(k\) sampling strategy can achieve great performance with a moderately sized validation set.