Datasets and Tasks in HypoBench

What is Hypothesis Generation?

Hypothesis generation is the process of proposing natural language theories or explanations about observed phenomena. This is a crucial step in scientific discovery and everyday reasoning. For example:

  • In science: Inferring the heliocentric model from observations of planets and moons
  • In daily life: Proposing reasons why one didn’t get admitted to college

Real World Datasets

HypoBench includes the following real-world datasets that span various domains:

Deception Detection

  • Task Description: Distinguish genuine and fake hotel reviews based on subtle linguistic cues.
  • IND Dataset: 1,600 hotel reviews (800 genuine and 800 fake).
  • OOD Dataset: 640 hotel reviews collected from different source websites and cities.

AI-Generated Content (AIGC) Detection

  • Task Description: Identify whether a story is human-written or AI-generated given a writing prompt.
  • IND Dataset: 800 writing prompts with corresponding stories.
  • OOD Dataset: 800 stories generated by alternative models.

Persuasive Argument Prediction

  • Task Description: Predict which text is more persuasive between pairs of arguments.
  • IND Dataset: 750 pairs of arguments with persuasiveness labels.
  • OOD Dataset: 500 pairs from different original sources.

Mental Stress Detection

  • Task Description: Detect mental stress signals from Reddit posts across different communities.
  • IND Dataset: 1,000 Reddit post segments with stress labels.
  • OOD Dataset: 500 posts from different subreddits.

News Headline Engagements

  • Task Description: Given a pair of headlines for the same news article, predict which one will get more clicks.
  • IND Dataset: 700 headline pairs with engagement data.
  • OOD Dataset: 453 headline pairs from different sources.

Retweets Prediction

  • Task Description: Given a pair of tweets, predict which one will be retweeted more.
  • IND Dataset: 1,000 tweet pairs with retweet counts.
  • OOD Dataset: 500 tweet pairs from different domains.

Paper Citations

  • Task Description: Classify whether an academic paper will receive high or low citations.
  • IND Dataset: 1,182 academic papers with citation data.
  • OOD Dataset: 1,104 papers from different research fields.

Synthetic Datasets

HypoBench includes carefully controlled synthetic datasets at different complexity levels:

Presidential Election

  • Task Description: Given a person’s tweet, predict which political party they will vote for in the 2024 election.
  • Variants: 78 different configurations
  • Size: 178,750 samples

Personality Prediction

  • Task Description: Determine personal preferences of users based on their tweets’ content, sentiment, and language patterns.
  • Variants: 76 different configurations
  • Size: 178,750 samples

College Admission

  • Task Description: Predict whether a student will be admitted based on their background information.
  • Variants: 26 different configurations
  • Size: 7,800 samples
  • Controlled Factors:
    • Number of features
    • Compositionality (depth of feature interactions)
    • Noise in outcome
    • Number of distractors

Shoe Sales

  • Task Description: Given a customer’s appearance, predict which shoe they will buy.
  • Variants: 3 different configurations
  • Size: 3,300 samples

Marine Ecosystem

  • Task Description: Given information about a marine ecosystem, predict the daily sunlight hours received at the location.
  • Variants: 1 configuration
  • Size: 500 samples

The synthetic datasets enable direct evaluation of how well models can recover known ground-truth hypotheses at varying levels of difficulty.

For more detailed information about the datasets and our methodology, please refer to our paper.