FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality

Large language models (LLMs) are increasingly becoming a primary source for information delivery across diverse use cases, so it’s important that their responses are factually accurate.

In order to continue improving their performance on this industry-wide challenge, we have to better understand the types of use cases where models struggle to provide an accurate response and better measure factuality performance in those areas.

The FACTS Benchmark Suite

Today, we’re teaming up with Kaggle to introduce the FACTS Benchmark Suite. It extends our previous work developing the FACTS Grounding Benchmark, with three additional factuality benchmarks, including:

A Parametric Benchmark that measures the model’s ability to access its internal knowledge accurately in factoid question use-cases.
A Search Benchmark that tests a model’s ability to use Search as a tool to retrieve information and synthesize it correctly.
A Multimodal Benchmark that tests a model’s ability to answer prompts related to input images in a factually correct manner.

We are also updating the original FACTS grounding benchmark with Grounding Benchmark – v2, an extended benchmark to test a model’s ability to provide answers grounded in the context of a given prompt.

Each benchmark was carefully curated to produce a total of 3,513 examples, which we are making publicly available today. Similar to our previous release, we are following standard industry practice and keeping an evaluation set held-out as a private set. The FACTS Benchmark Suite Score (or FACTS Score) is calculated as the average accuracy of both public and private sets across the four benchmarks. Kaggle will oversee the management of the FACTS Benchmark Suite. This includes owning the private held-out sets, testing the leading LLMs on the benchmarks, and hosting the results on a public leaderboard. More details about the FACTS evaluation methodology can be found in our tech report.

Benchmark overview

Parametric Benchmark

The FACTS Parametric benchmark assesses the ability of models to accurately answer factual questions, without the aid of external tools like web search. All the questions in the benchmark are “trivia style” questions driven by user interest that can be answered via Wikipedia (a standard source for LLM pretraining). The resulting benchmark consists of a 1052-item public set and a 1052-item private set.

Source link

FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality

Trust Wallet to Cover $7M Lost in Browser Extension Hack: Zhao

Why Anti-Money Laundering (AML) Tools Should Be on Every Startup’s Radar in 2026

Why Anti-Money Laundering (AML) Tools Should Be on Every Startup’s Radar in 2026

Leave a Reply Cancel reply

POPULAR POSTS

Health-specific embedding tools for dermatology and pathology

20 Best Resource Management Software of 2025 (Free & Paid)

10 Ways To Get a Free DoorDash Gift Card

How to Configure Proxy Server Settings on iPhone in 2025

How To Save for a Baby in 9 Months

Categories

Connect With Us

Recent Posts

How to Ensure the Functionality and Durability of Your Business

Introducing Gemma 3 270M: The compact model for hyper-efficient AI