HypoBench focuses on evaluating hypothesis generation capabilities through multiple dimensions:
Explanatory Power
The primary focus of our evaluation is on the explanatory power of generated hypotheses:
Utility-driven Evaluation
We evaluate how well the generated hypotheses help language models make accurate predictions:
- Classification Tasks: Hypotheses are used to guide models in making predictions on test examples.
- Data Splits: We test on both in-distribution (IND) datasets and out-of-distribution (OOD) datasets to assess generalization.
- Metric: Classification accuracy and F1 scores.
Ground Truth Hypothesis Discovery Rate (HDR)
For synthetic datasets where we know the true underlying hypotheses:
- We measure how well generated hypotheses recover the ground-truth hypotheses.
- This includes both feature discovery (identifying relevant factors) and relationship correctness (understanding how these factors relate to outcomes).
Interestingness
We provide preliminary metrics for hypothesis “interestingness” - spliting into three dimensions: Novelty, Plausibility, and Clarity, (for real datasets only).
- We use LLM-based qualitative assessments.
- This helps capture aspects beyond pure explanatory power.
Key Capabilities Benchmarked
HypoBench evaluates three core capabilities necessary for effective hypothesis generation:
- Inductive reasoning: Proposing possible theories for given observations
- Abstraction and communication: Expressing hypotheses in clear, understandable language
- Synthesis: Integrating new observations with existing knowledge
For more details on our evaluation methodology, please refer to our paper.