Literature Meets Data: A Synergistic Approach to Hypothesis Generation

Haokun Liu^*, Yangqiaoyu Zhou^*, Mingxuan Li^*, Chenfei Yuan, and Chenhao Tan.

University of Chicago

^*Indicates Equal Contribution

Podcast of our paper (powered by NotebookLM)

Abstract

AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97% over few-shot, 15.75% over literature-based alone, and 3.37% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44% and 14.19% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.

Hypothesis generation is a critical yet understudied step in scientific discoveries. Current approaches fall into two main categories:

Theory-driven methods: Utilize literature to propose hypotheses grounded in established human knowledge. However, they lack adaptability to new data.
Data-driven methods: Identify patterns within data to generate adaptive hypotheses but often overfit specific datasets, limiting their generalizability.

To overcome these limitations, we introduce a novel framework that integrates insights from both literature and data. By leveraging large language models (LLMs), our method synthesizes knowledge from literature and data, producing hypotheses that are both robust and adaptive.

Illustration of how we combine literature-based and data-driven hypotheses. See algorithmic details in section 2 of our paper.

Data-driven Hypothesis Generation

Our data-driven hypothesis generation is based on the HypoGeniC framework. The process involves two main stages:

Initialization: The model generates an initial set of hypotheses using a small subset of data. These hypotheses form the basis for further refinement.
Update: Hypotheses are iteratively refined based on their performance. Poorly performing hypotheses are replaced with new ones generated from challenging examples, ensuring continuous improvement.

This iterative approach improves the quality and adaptability of hypotheses by leveraging both the initial data patterns and feedback from challenging cases.

Illustration of HypoGeniC. During update stage, we evaluate the top k hypotheses on each new training example and update the reward based on the prediction correctness. If the number of hypotheses that got the example wrong exceeds a certain threshold, we add the example to a wrong example bank. The wrong example bank is then used to generate new hypotheses.

Literature-based Hypothesis Generation

Our process begins by selecting 10 papers relevant to the research question from Semantic Scholar or Google Scholar. We also search within papers citing the original datasets for each task. These papers are converted into a JSON corpus using S2ORC-doc2json.

Then, we develop a paper summarizer to generate concise summaries. For the literature-only method, language models are instructed to generate hypotheses from these summaries, emphasizing their relevance and utility for the specific tasks under consideration.

Integration of Literature-Based and Data-Driven Hypotheses

A key contribution of our work is the integration of literature-based and data-driven hypothesis generation. This approach combines the strengths of both methods to enhance the generalizability and utility of generated hypotheses. We employ two strategies:

Refining Hypotheses with Literature and Data

The refinement method integrates paper summaries with HypoGeniC. During initialization, an LLM generates hypotheses based on both initial data examples and relevant paper summaries.

In the update stage, hypotheses generated from challenging examples are refined iteratively by data-driven and literature-based refinement agents. This iterative process ensures the hypotheses incorporate both empirical patterns and key insights from the literature. After multiple rounds of refinement, the final hypothesis bank is returned to the HypoGeniC pipeline for further use.

Illustration of how we refine hypotheses using literature and data. See algorithmic details in section 2 of our paper.

Union and Redundancy Elimination

To address the potential undervaluation of literature-based hypotheses, we employ a union strategy. Two hypothesis banks are created: one from literature-based methods and the other using HypoGeniC or the refinement method. A redundancy checker removes similar or repetitive hypotheses, and the final hypothesis bank is constructed by selecting a balanced mix of hypotheses from both sources. This ensures a comprehensive and diverse set of hypotheses for further evaluation.

Experiments

The experiments evaluate the utility and novelty of the generated hypotheses using both automatic and human frameworks across diverse tasks.

Evaluation Framework

Hypotheses are evaluated on two dimensions: utility (improvement in decision-making) and novelty (unique insights). Evaluations include:

Automatic Evaluation: Performance on in-distribution (IND) and out-of-distribution (OOD) datasets. Hypotheses are used to prompt LLMs for inference, focusing on OOD generalizability.
Cross-Model Evaluation: Hypotheses generated by one model are tested using another model.
Human Studies: Participants assess how hypotheses improve decision-making (utility) and whether they add unique perspectives (novelty).

Human Studies

Study I (Utility): Participants, split into control and experimental groups, complete tasks with or without hypotheses. Results show hypotheses improve decision-making.
Study II (Novelty): Participants compare data-driven and literature-based hypotheses to determine if one provides new information beyond the other.

Tasks

Deception Detection: Identifying truthful vs. deceptive hotel reviews.
AI-Generated Content Detection: Distinguishing between human- and AI-written stories. We use GPT generated stories (GPTGC) and Llama generated stories (LlamaGC)
Persuasive Argument Prediction: Evaluating the persuasiveness of argument pairs.
Mental Stress Detection: Detecting stress signals in Reddit posts.

Results

The results demonstrate the effectiveness of integrating literature-based and data-driven hypothesis generation methods.

Automatic Evaluation

Combining literature-based and data-driven methods produced the best performance across tasks. The integrated approach achieved an accuracy improvement of 11.92% over few-shot methods and 16.54% over literature-based methods for GPT, and 6.03% over few-shot methods and 14.97% over literature-based methods for Llama.

Our framework with literature + data showed significant advantages over zero-shot, few-shot, data-driven and literature-based methods, particularly in handling OOD datasets.

Accuracy scores on the held-out OOD datasets. Literature + Data outperforms all other methods in every model and task configurations.

Human Evaluation

Generated hypotheses improved human decision-making in both Deception Detection and AIGC Detection. In AIGC Detection, accuracy increased by 14.19% (58.86% → 73.05%, p=0.01), and in Deception Detection, accuracy improved by 7.44% (57.14% → 64.58%, p=0.04). Participants used hypotheses in over 90% of decisions, with the most popular hypothesis used 44.55% of the time.

Human performance on Deception Detection and AIGC Detection.

Participants rated 100% of the hypotheses as helpful, with over 40% finding them "Very helpful" or "Extremely helpful." Results from the novelty check study showed that 84% of hypotheses pairs in Deception Detection and 80% in AIGC Detection offered distinct insights, highlighting the strengths of combining literature-based and data-driven approaches.

Examples of Generated Hypotheses

Deception Detection

Deceptive reviews often contain a higher frequency of first-person singular pronouns, while truthful reviews may use these pronouns less frequently.
The use of repetitive phrasing across multiple reviews is a strong indicator of deception, while truthful reviews are more likely to exhibit unique language and perspectives.
Reviews that provide specific accounts of the check-in and check-out processes, including exact times, the names of staff members involved, and descriptions of any unique features or services utilized (e.g., "I used the self-check-in kiosk at 3 PM"), are more likely to be truthful. Conversely, reviews that mention issues like long wait times or check-in problems without contextual details or specific examples (e.g., "the check-in took too long") are more likely to be deceptive.

AI-Generated Content Detection

AI-generated texts tend to use more elaborate and descriptive language, including adjectives and adverbs, to create a sense of atmosphere and immersion. Human-written texts, on the other hand, tend to be more concise and straightforward in their language use.
Human-written texts are more likely to contain errors or idiosyncrasies in grammar and punctuation, reflecting the natural imperfections of human writing, while AI-generated texts typically maintain a higher level of grammatical accuracy.
Human-written texts tend to have a more conversational tone and colloquial language, while AI-generated texts tend to be more formal and lack idiomatic expressions.

Persuasive Argument Prediction

Persuasive texts that incorporate rhetorical devices, such as rhetorical questions and direct appeals, are more likely to engage the reader and compel them to consider the writer's viewpoint.
Texts that utilize strong, action-oriented verbs are generally more persuasive, as they convey confidence and urgency, compelling the audience to take action.
Arguments that include a clear and compelling call to action are more persuasive, as they provide the audience with a specific next step to take, reinforcing the urgency and importance of the message.

Mental Stress Detection

Posts that show erratic posting behavior or changes in tone (e.g., from positive to negative) are more likely to indicate stress, while consistent posting patterns with a stable tone are more likely to indicate no stress.
Posts that exhibit avoidance behaviors (e.g., avoiding social situations or responsibilities) are more likely to indicate stress, while posts that demonstrate proactive engagement with challenges are more likely to indicate no stress.
Posts that reflect on personal struggles with mental health or addiction (e.g., "I was a severe addict") are more likely to indicate that the poster has stress, while posts that discuss academic or professional experiences without emotional turmoil (e.g., "I've explained the aforementioned to people") are more likely to indicate that the poster does not have stress.

Broader Impact

This work introduces a novel framework for integrating literature-based and data-driven hypothesis generation, with the potential to transform scientific research. By enhancing the generalizability and utility of hypotheses, this approach can potentially accelerate discoveries across fields like biology, medicine, economics, and more.

However, the integration of automated tools raises important considerations. Potential biases in data or literature could propagate through the generated hypotheses, necessitating careful evaluation and curation. Moreover, ensuring transparency in the hypothesis generation process is crucial to avoid over-reliance on automation.

Despite these challenges, the proposed framework provides a powerful tool for advancing knowledge, empowering researchers, and supporting complex decision-making in real-world applications. Its ability to synthesize insights from both data and literature offers a promising direction for interdisciplinary research and societal impact.

BibTeX

@misc{liu2024literaturemeetsdatasynergistic,
        title={Literature Meets Data: A Synergistic Approach to Hypothesis Generation}, 
        author={Haokun Liu and Yangqiaoyu Zhou and Mingxuan Li and Chenfei Yuan and Chenhao Tan},
        year={2024},
        eprint={2410.17309},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2410.17309}, 
  }

    
      @misc{zhou2024hypothesisgenerationlargelanguage,
        title={Hypothesis Generation with Large Language Models}, 
        author={Yangqiaoyu Zhou and Haokun Liu and Tejes Srivastava and Hongyuan Mei and Chenhao Tan},
        year={2024},
        eprint={2404.04326},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2404.04326}, 
  }