Datasets and Tasks in HypoBench

What is Hypothesis Generation?

Hypothesis generation is the process of proposing natural language theories or explanations about observed phenomena. This is a crucial step in scientific discovery and everyday reasoning. For example:

In science: Inferring the heliocentric model from observations of planets and moons
In daily life: Proposing reasons why one didn’t get admitted to college

Real World Datasets

HypoBench includes the following real-world datasets that span various domains:

Deception Detection

Task Description: Distinguish genuine and fake hotel reviews based on subtle linguistic cues.
IND Dataset: 1,600 hotel reviews (800 genuine and 800 fake).
OOD Dataset: 640 hotel reviews collected from different source websites and cities.

AI-Generated Content (AIGC) Detection

Task Description: Identify whether a story is human-written or AI-generated given a writing prompt.
IND Dataset: 800 writing prompts with corresponding stories.
OOD Dataset: 800 stories generated by alternative models.

Persuasive Argument Prediction

Task Description: Predict which text is more persuasive between pairs of arguments.
IND Dataset: 750 pairs of arguments with persuasiveness labels.
OOD Dataset: 500 pairs from different original sources.

Mental Stress Detection

Task Description: Detect mental stress signals from Reddit posts across different communities.
IND Dataset: 1,000 Reddit post segments with stress labels.
OOD Dataset: 500 posts from different subreddits.

News Headline Engagements

Task Description: Given a pair of headlines for the same news article, predict which one will get more clicks.
IND Dataset: 700 headline pairs with engagement data.
OOD Dataset: 453 headline pairs from different sources.

Retweets Prediction

Task Description: Given a pair of tweets, predict which one will be retweeted more.
IND Dataset: 1,000 tweet pairs with retweet counts.
OOD Dataset: 500 tweet pairs from different domains.

Paper Citations

Task Description: Classify whether an academic paper will receive high or low citations.
IND Dataset: 1,182 academic papers with citation data.
OOD Dataset: 1,104 papers from different research fields.

Synthetic Datasets

HypoBench includes carefully controlled synthetic datasets at different complexity levels:

Presidential Election

Task Description: Given a person’s tweet, predict which political party they will vote for in the 2024 election.
Variants: 78 different configurations
Size: 178,750 samples

Personality Prediction

Task Description: Determine personal preferences of users based on their tweets’ content, sentiment, and language patterns.
Variants: 76 different configurations
Size: 178,750 samples

College Admission

Task Description: Predict whether a student will be admitted based on their background information.
Variants: 26 different configurations
Size: 7,800 samples
Controlled Factors:
- Number of features
- Compositionality (depth of feature interactions)
- Noise in outcome
- Number of distractors

Shoe Sales

Task Description: Given a customer’s appearance, predict which shoe they will buy.
Variants: 3 different configurations
Size: 3,300 samples

Marine Ecosystem

Task Description: Given information about a marine ecosystem, predict the daily sunlight hours received at the location.
Variants: 1 configuration
Size: 500 samples

The synthetic datasets enable direct evaluation of how well models can recover known ground-truth hypotheses at varying levels of difficulty.

For more detailed information about the datasets and our methodology, please refer to our paper. For datasets download, please visit our GitHub repository.