Active Example Selection for In-Context Learning

Introduction

Large language models, such as GPT-3 (Brown et al. 2020) demonstrate an emergent capability, known as in-context learning, to perform a task by simply observing information (such as instructions and demonstration examples) in its prompt. Despite its incredible success on many tasks, in-context learning performance very much depends on a good prompt (Mishra et al. 2022).

In this work, we approach prompting from the perspective of example selection. That is, we seek to answer: how to select good examples for in-context learning? Different from prior work which retrieves examples assuming access to individual test instances (Liu et al. 2022; Rubin, Herzig, and Berant 2022), we aim to select good examples for the entire testing distribution.

Sensitivity to the Choice of Examples

Sensitivity of in-context learning to changes in the prompt is identified by prior work (Zhao et al. 2021; Lu et al. 2022). We revisit this insensitivity, especially when sampling random demonstration examples to motivate the need for example selection.

Here is a table of the performance of various GPT-2 and GPT-3¹ models on 4 tasks, after applying calibration (Zhao et al. 2021): we randomly sample 5 set of 4-shot demonstration examples, reporting the average performance and the standard deviation (in parentheses).

Model	AGNews ($\sigma$)	Amazon ($\sigma$)	SST-2 ($\sigma$)	TREC ($\sigma$)
GPT-2 (345M)	55.2 (12.0)	76.3 (14.0)	66.2 (14.7)	40.8 (5.4)
GPT-3 (Ada)	64.0 (4.0)	90.0 (1.2)	73.8 (9.7)	22.1 (5.3)
GPT-3 (Babbage)	78.1 (6.1)	92.7 (1.6)	90.8 (1.1)	36.0 (4.0)

While achieving good performance on Amazon and SST-2, we observe that the performance of in-context learning is volatile: GPT-2 demonstrates double-digit standard deviation across datasets even with calibration. While the variance diminishes for larger models on sentiment classification tasks (Amazon and SST-2), they are still quite big for the other tasks. To address this sensitivity, we propose a framework for explicitly learning policies to select good examples.

Learning to Select Examples

The Framework

As mentioned previously, we consider the problem of selecting demonstration examples from an unlabeled pool. The challenge with selecting a sequence of demonstration examples is there are too many candidate sequences to consider: the number of potential sequences grow exponentially with the size of the unlabeled pool, and is intractable to enumerate. One solution to this challenge is to consider example selection as a sequential decision problem, selecting examples one-by-one to construct the prompt.

In the language of a Markov Decision Process, states in the example selection environment are examples that are already selected as part of the prompt, and actions represent potential examples to be selected.

Reward Function

To train an example selection policy, we need signals for rewarding the model to select good examples. Suppose there is an objective function $f : \mathcal{X}^\star \to \mathbb{R}$ that measures how good a sequence of examples is (in our implementation, $f$ measures the performance of the prompt on a validation set). Then, to select a sequence of $k$ examples, the trivial reward function rewards $f$ on a complete prompt:

\[ r(x_1, x_2, \dots , x_i) = \begin{cases} f(x_1, x_2, \dots, x_i) &\text{if $i = k$} \\ 0 &\text{otherwise.} \end{cases} \]

While this reward directly maximizes the objective $f$, it does provide reward signals to the example selected at time steps $i < t$. One way to get around this issue is reward shaping (Ng, Harada, and Russell 1999), a modification of the reward function that preserves optimal policies. The following reward function $r'$ is a shaped version of $r$² \[ r'(x_1, x_2, \dots , x_i) = \begin{cases} f(x_1) - f(\varnothing) &\text{if $i = 1$} \\ f(x_1, x_2, \dots, x_i) - f(x_1, x_2, \dots, x_{i - 1}) &\text{if $i > 1$,} \end{cases} \] where $f(\varnothing)$ is the performance of an empty prompt. The shaped reward $r'$ has an intuitive interpretation: it represents the marginal utility (i.e. gain over the objective $f$) of the added example.

Experiments

We experiment with 4-shot example selection on GPT-2 (345M) and consider three baselines for comparison.

random: Randomly sample demonstration examples.
max-entropy: Iteratively select the example that the model is least confident in its prediction (measured by the entropy of the predicted probability distribution).
reordering: Randomly sample demonstration examples, and apply the Global Entropy reordering heuristic by (Lu et al. 2022).

Results

During training, we use a training pool from which the policy learns to select examples, and a reward set on which we compute reward for training the policy. Since the policy gets direct reward signal (validation performance) on the training pool, performance on this setting (same task, seen examples) serve mostly as a sanity check.

We evaluate the generalization of learned example selection policies in two settings: - same task, new examples: during evaluation, the policy picks from new examples under the same distribution of training. - new task, new examples: the policy is jointly optimized on three out of the four tasks, and evaluated by selecting examples for the heldout task.

Method	Average	AGNews (95% CI)	Amazon (95% CI)	SST-2 (95% CI)	TREC (95% CI)
random	59.6	55.2 (10.5)	76.3 (12.3)	66.2 (12.9)	40.8 (4.7)
max-entropy	59.3	58.8 (11.3)	74.8 (5.1)	65.7 (10.7)	37.8 (6.7)
reordering	63.5	63.3 (6.8)	89.8 (3.8)	67.9 (11.1)	33.0 (4.2)
our method (same task, seen examples)	71.4	70.8 (7.8)	90.4 (1.9)	81.0 (3.5)	43.3 (2.0)
our method (same task, new examples)	69.0	65.5 (7.4)	88.5 (4.2)	76.7 (7.5)	45.4 (5.0)
our method (new task, new examples)	65.4	66.7 (5.7)	89.9 (1.6)	61.9 (7.7)	43.3 (4.4)

On seen examples, the example selection outperforms the random sampling baseline by 11.8%, indicating learnability of the example selection problem. Perhaps more interestingly, the learned example selection policy can generalize beyond in both same task, new examples and new task, new examples, showing an the best baseline (reordering) by 5.5% and 1.9% respectively.

Concluding Discussion

While our main experiments are done with GPT-2 (334M), we experimented with transferring both the learned policy and selected examples to GPT-3 models. The results are mixed: we observe small gains for GPT-3 Ada, but limited or negative results for larger models (GPT-3 Babbage and Curie). This observation might point to emergence: larger models have different preferences for demonstration examples.

While example selection can be framed as an learning problem, it may not always be the sensible option under practical considerations. Training example selection policies come with significant computational overhead, and this cost seems hard to justify when the simple best-of-$k$ sampling strategy can achieve great performance with a moderately sized validation set.

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, 33:1877–1901. Curran Associates, Inc.

Liu, Jiachang, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. “What Makes Good In-Context Examples for GPT-3?” In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 100–114. Dublin, Ireland and Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.deelio-1.10.

Lu, Yao, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8086–98. Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.556.

Mishra, Swaroop, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. 2022. “Reframing Instructional Prompts to GPTk’s Language.” In Findings of the Association for Computational Linguistics: ACL 2022, 589–612. Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-acl.50.

Ng, Andrew Y., Daishi Harada, and Stuart J. Russell. 1999. “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping.” In Proceedings of the Sixteenth International Conference on Machine Learning, 278–87. ICML ’99. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Rubin, Ohad, Jonathan Herzig, and Jonathan Berant. 2022. “Learning To Retrieve Prompts for In-Context Learning.” In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2655–71. Seattle, United States: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.naacl-main.191.

Zhao, Zihao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. “Calibrate Before Use: Improving Few-shot Performance of Language Models.” In Proceedings of the 38th International Conference on Machine Learning, 12697–706. PMLR.

We use text-ada-001 and text-babbage-001 in our experiments.↩︎
Requires $\gamma = 1$.↩︎