Literature Meets Data: A Synergistic Approach to Hypothesis Generation

University of Chicago

*Indicates Equal Contribution

Podcast of our paper (powered by NotebookLM)

Abstract

AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97% over few-shot, 15.75% over literature-based alone, and 3.37% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44% and 14.19% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.

Hypothesis generation is a critical yet understudied step in scientific discoveries. Current approaches fall into two main categories:

  • Theory-driven methods: Utilize literature to propose hypotheses grounded in established human knowledge. However, they lack adaptability to new data.
  • Data-driven methods: Identify patterns within data to generate adaptive hypotheses but often overfit specific datasets, limiting their generalizability.

To overcome these limitations, we introduce a novel framework that integrates insights from both literature and data. By leveraging large language models (LLMs), our method synthesizes knowledge from literature and data, producing hypotheses that are both robust and adaptive.

Our hypothesis generation framework

Illustration of how we combine literature-based and data-driven hypotheses. See algorithmic details in section 2 of our paper.

Data-driven Hypothesis Generation

Our data-driven hypothesis generation is based on the HypoGeniC framework. The process involves two main stages:

  • Initialization: The model generates an initial set of hypotheses using a small subset of data. These hypotheses form the basis for further refinement.
  • Update: Hypotheses are iteratively refined based on their performance. Poorly performing hypotheses are replaced with new ones generated from challenging examples, ensuring continuous improvement.

This iterative approach improves the quality and adaptability of hypotheses by leveraging both the initial data patterns and feedback from challenging cases.

HypoGeniC figure

Illustration of HypoGeniC. During update stage, we evaluate the top k hypotheses on each new training example and update the reward based on the prediction correctness. If the number of hypotheses that got the example wrong exceeds a certain threshold, we add the example to a wrong example bank. The wrong example bank is then used to generate new hypotheses.

Literature-based Hypothesis Generation

Our process begins by selecting 10 papers relevant to the research question from Semantic Scholar or Google Scholar. We also search within papers citing the original datasets for each task. These papers are converted into a JSON corpus using S2ORC-doc2json.

Then, we develop a paper summarizer to generate concise summaries. For the literature-only method, language models are instructed to generate hypotheses from these summaries, emphasizing their relevance and utility for the specific tasks under consideration.

Integration of Literature-Based and Data-Driven Hypotheses

A key contribution of our work is the integration of literature-based and data-driven hypothesis generation. This approach combines the strengths of both methods to enhance the generalizability and utility of generated hypotheses. We employ two strategies:

Refining Hypotheses with Literature and Data

The refinement method integrates paper summaries with HypoGeniC. During initialization, an LLM generates hypotheses based on both initial data examples and relevant paper summaries.

In the update stage, hypotheses generated from challenging examples are refined iteratively by data-driven and literature-based refinement agents. This iterative process ensures the hypotheses incorporate both empirical patterns and key insights from the literature. After multiple rounds of refinement, the final hypothesis bank is returned to the HypoGeniC pipeline for further use.

refinement of hypotheses

Illustration of how we refine hypotheses using literature and data. See algorithmic details in section 2 of our paper.

Union and Redundancy Elimination

To address the potential undervaluation of literature-based hypotheses, we employ a union strategy. Two hypothesis banks are created: one from literature-based methods and the other using HypoGeniC or the refinement method. A redundancy checker removes similar or repetitive hypotheses, and the final hypothesis bank is constructed by selecting a balanced mix of hypotheses from both sources. This ensures a comprehensive and diverse set of hypotheses for further evaluation.

Experiments

The experiments evaluate the utility and novelty of the generated hypotheses using both automatic and human frameworks across diverse tasks.

Evaluation Framework

Hypotheses are evaluated on two dimensions: utility (improvement in decision-making) and novelty (unique insights). Evaluations include:

  • Automatic Evaluation: Performance on in-distribution (IND) and out-of-distribution (OOD) datasets. Hypotheses are used to prompt LLMs for inference, focusing on OOD generalizability.
  • Cross-Model Evaluation: Hypotheses generated by one model are tested using another model.
  • Human Studies: Participants assess how hypotheses improve decision-making (utility) and whether they add unique perspectives (novelty).

Human Studies

  • Study I (Utility): Participants, split into control and experimental groups, complete tasks with or without hypotheses. Results show hypotheses improve decision-making.
  • Study II (Novelty): Participants compare data-driven and literature-based hypotheses to determine if one provides new information beyond the other.

Tasks

  • Deception Detection: Identifying truthful vs. deceptive hotel reviews.
  • AI-Generated Content Detection: Distinguishing between human- and AI-written stories. We use GPT generated stories (GPTGC) and Llama generated stories (LlamaGC)
  • Persuasive Argument Prediction: Evaluating the persuasiveness of argument pairs.
  • Mental Stress Detection: Detecting stress signals in Reddit posts.

Results

The results demonstrate the effectiveness of integrating literature-based and data-driven hypothesis generation methods.

Automatic Evaluation

Combining literature-based and data-driven methods produced the best performance across tasks. The integrated approach achieved an accuracy improvement of 11.92% over few-shot methods and 16.54% over literature-based methods for GPT, and 6.03% over few-shot methods and 14.97% over literature-based methods for Llama.

Our framework with literature + data showed significant advantages over zero-shot, few-shot, data-driven and literature-based methods, particularly in handling OOD datasets.

OOD main results

Accuracy scores on the held-out OOD datasets. Literature + Data outperforms all other methods in every model and task configurations.

Human Evaluation

Generated hypotheses improved human decision-making in both Deception Detection and AIGC Detection. In AIGC Detection, accuracy increased by 14.19% (58.86% → 73.05%, p=0.01), and in Deception Detection, accuracy improved by 7.44% (57.14% → 64.58%, p=0.04). Participants used hypotheses in over 90% of decisions, with the most popular hypothesis used 44.55% of the time.

OOD main results

Human performance on Deception Detection and AIGC Detection.

Participants rated 100% of the hypotheses as helpful, with over 40% finding them "Very helpful" or "Extremely helpful." Results from the novelty check study showed that 84% of hypotheses pairs in Deception Detection and 80% in AIGC Detection offered distinct insights, highlighting the strengths of combining literature-based and data-driven approaches.

Examples of Generated Hypotheses

Broader Impact

This work introduces a novel framework for integrating literature-based and data-driven hypothesis generation, with the potential to transform scientific research. By enhancing the generalizability and utility of hypotheses, this approach can potentially accelerate discoveries across fields like biology, medicine, economics, and more.

However, the integration of automated tools raises important considerations. Potential biases in data or literature could propagate through the generated hypotheses, necessitating careful evaluation and curation. Moreover, ensuring transparency in the hypothesis generation process is crucial to avoid over-reliance on automation.

Despite these challenges, the proposed framework provides a powerful tool for advancing knowledge, empowering researchers, and supporting complex decision-making in real-world applications. Its ability to synthesize insights from both data and literature offers a promising direction for interdisciplinary research and societal impact.

BibTeX

@misc{liu2024literaturemeetsdatasynergistic,
        title={Literature Meets Data: A Synergistic Approach to Hypothesis Generation}, 
        author={Haokun Liu and Yangqiaoyu Zhou and Mingxuan Li and Chenfei Yuan and Chenhao Tan},
        year={2024},
        eprint={2410.17309},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2410.17309}, 
  }
    
      @misc{zhou2024hypothesisgenerationlargelanguage,
        title={Hypothesis Generation with Large Language Models}, 
        author={Yangqiaoyu Zhou and Haokun Liu and Tejes Srivastava and Hongyuan Mei and Chenhao Tan},
        year={2024},
        eprint={2404.04326},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2404.04326}, 
  }