Causal Dataset Discovery with Large Language Models

Author

Junfei Liu

Mentors

Fatemeh Nargesian and Anson Kahng

Abstract

Causal discovery, crucial in scientific research by uncovering causal links among a variety of observed variables, faces challenges in inferring inter-relation causality from large-scale repositories. Identifying causal relationships in batches is a complex and time-intensive task, especially when it involves analyzing columns across multiple tables within diverse datasets like data lakes where the complexity is significantly amplified. In this paper, we introduce the causal data lake discovery problem and propose a large language model(LLM)-based framework to discover potential pairwise causal links between columns from different tables. We heuristically improve LLM’s grasp of causality through prompting and fine-tuning and prevent the extreme imbalance in causal candidate distributions due to natural sparsity of causal connections. We create benchmarks specific to this task, experimentally show that our framework achieves remarkable performance, and provide extensions of this problem for future research.

Causal Dataset Discovery with Large Language Models