What is Hypothesis Generation?
Hypothesis generation is the process of proposing natural language theories or explanations about observed phenomena. This is a crucial step in scientific discovery and everyday reasoning. For example:
- In science: Inferring the heliocentric model from observations of planets and moons
- In daily life: Proposing reasons why one didn’t get admitted to college
Real World Datasets
HypoBench includes the following real-world datasets that span various domains:
Deception Detection
- Task Description: Distinguish genuine and fake hotel reviews based on subtle linguistic cues.
- IND Dataset: 1,600 hotel reviews (800 genuine and 800 fake).
- OOD Dataset: 640 hotel reviews collected from different source websites and cities.
AI-Generated Content (AIGC) Detection
- Task Description: Identify whether a story is human-written or AI-generated given a writing prompt.
- IND Dataset: 800 writing prompts with corresponding stories.
- OOD Dataset: 800 stories generated by alternative models.
Persuasive Argument Prediction
- Task Description: Predict which text is more persuasive between pairs of arguments.
- IND Dataset: 750 pairs of arguments with persuasiveness labels.
- OOD Dataset: 500 pairs from different original sources.
Mental Stress Detection
- Task Description: Detect mental stress signals from Reddit posts across different communities.
- IND Dataset: 1,000 Reddit post segments with stress labels.
- OOD Dataset: 500 posts from different subreddits.
News Headline Engagements
- Task Description: Given a pair of headlines for the same news article, predict which one will get more clicks.
- IND Dataset: 700 headline pairs with engagement data.
- OOD Dataset: 453 headline pairs from different sources.
Retweets Prediction
- Task Description: Given a pair of tweets, predict which one will be retweeted more.
- IND Dataset: 1,000 tweet pairs with retweet counts.
- OOD Dataset: 500 tweet pairs from different domains.
Paper Citations
- Task Description: Classify whether an academic paper will receive high or low citations.
- IND Dataset: 1,182 academic papers with citation data.
- OOD Dataset: 1,104 papers from different research fields.
Synthetic Datasets
HypoBench includes carefully controlled synthetic datasets at different complexity levels:
Presidential Election
- Task Description: Given a person’s tweet, predict which political party they will vote for in the 2024 election.
- Variants: 78 different configurations
- Size: 178,750 samples
Personality Prediction
- Task Description: Determine personal preferences of users based on their tweets’ content, sentiment, and language patterns.
- Variants: 76 different configurations
- Size: 178,750 samples
College Admission
- Task Description: Predict whether a student will be admitted based on their background information.
- Variants: 26 different configurations
- Size: 7,800 samples
- Controlled Factors:
- Number of features
- Compositionality (depth of feature interactions)
- Noise in outcome
- Number of distractors
Shoe Sales
- Task Description: Given a customer’s appearance, predict which shoe they will buy.
- Variants: 3 different configurations
- Size: 3,300 samples
Marine Ecosystem
- Task Description: Given information about a marine ecosystem, predict the daily sunlight hours received at the location.
- Variants: 1 configuration
- Size: 500 samples
The synthetic datasets enable direct evaluation of how well models can recover known ground-truth hypotheses at varying levels of difficulty.
For more detailed information about the datasets and our methodology, please refer to our paper.