Evaluation Methods

HypoBench focuses on evaluating hypothesis generation capabilities through multiple dimensions:

Explanatory Power

The primary focus of our evaluation is on the explanatory power of generated hypotheses:

Utility-driven Evaluation

We evaluate how well the generated hypotheses help language models make accurate predictions:

Classification Tasks: Hypotheses are used to guide models in making predictions on test examples.
Data Splits: We test on both in-distribution (IND) datasets and out-of-distribution (OOD) datasets to assess generalization.
Metric: Classification accuracy and F1 scores.

Ground Truth Hypothesis Discovery Rate (HDR)

For synthetic datasets where we know the true underlying hypotheses:

We measure how well generated hypotheses recover the ground-truth hypotheses.
This includes both feature discovery (identifying relevant factors) and relationship correctness (understanding how these factors relate to outcomes).

Interestingness

We provide preliminary metrics for hypothesis “interestingness” - spliting into three dimensions: Novelty, Plausibility, and Clarity, (for real datasets only).

We use LLM-based qualitative assessments.
This helps capture aspects beyond pure explanatory power.

Key Capabilities Benchmarked

HypoBench evaluates three core capabilities necessary for effective hypothesis generation:

Inductive reasoning: Proposing possible theories for given observations
Abstraction and communication: Expressing hypotheses in clear, understandable language
Synthesis: Integrating new observations with existing knowledge

For more details on our evaluation methodology, please refer to our paper. For evaluation code, please visit our GitHub repository.