Evaluating and enhancing probabilistic reasoning in language models

To understand the probabilistic reasoning capabilities of three state-of-the-art LLMs (Gemini, GPT family models), we define three distinct tasks: estimating percentiles, drawing samples, and calculating probabilities. These tasks reflect key aspects of interpreting probability distributions, such as understanding where a sample falls within a distribution (percentiles), generating representative data (sampling), and assessing the likelihood of outcomes (probabilities). By testing these abilities, we aimed to assess how well LLMs can reason over both idealized and real-world distributions.

Since no publicly available dataset existed for LLM-based probabilistic reasoning, we developed a new dataset combining real-world and idealized distributions. For the real-world distributions, data was collected from three domains: health, finance, and climate. The health data were de-identified and sampled from 100,000 Fitbit users in the U.S. aged 18–65 who consented to their data being used for research. These data included metrics like step count, resting heart rate, sleep duration, and exercise minutes. Financial data were obtained from the U.S. Census Bureau’s American Community Survey, and climate data came from NOAA’s Global Historical Climatology Network. The datasets were manually curated to ensure relevant filtering (e.g., erroneous data removal).

In addition, we programmatically generated idealized distributions using Python libraries to complement the real-world data and better test the probabilistic reasoning capabilities of language models. While we generated 12 idealized distributions, this blog post will focus on three: normal, log normal, and power law. See the paper to learn about all of the generated distributions.

We evaluated Gemini, GPT family models on the three tasks using 12 idealized distributions and 12 real-world distributions. To enhance probabilistic reasoning, we explored three strategies for providing more context to the LLMs:

Anchoring examples from within a distribution or its family: We provided anchoring examples from the same distribution or related distributions. For instance, when estimating percentiles for a normal distribution, we included examples from the same distribution with different value–percentile pairs, allowing the model to interpolate and make more accurate predictions.
Adding real-world context: We added real-world context by introducing domain-specific data, such as U.S. rental prices from the American Community Survey when estimating the percentile of monthly rent values. This enabled the model to reason using practical, real-world information.
Leveraging summary statistics to approximate a normal distribution: We used summary statistics and normal approximations to simplify complex distributions. For example, income data, which typically follows a power law distribution, was approximated as normal to help the model make reasonably accurate predictions despite the complexity of the actual, underlying distribution.

Evaluating and enhancing probabilistic reasoning in language models

ZeniMax and Microsoft ratify union settlement

Making the Most of 1:1 Conferences With Your Boss

What I Realized From Analyzing Google’s AI Mode Patent

Delight, LGBTQIA+ id have been right here earlier than any phobia, and can outlive all phobias