Skip to content
LIVE
Loading prices...
OpenAI introduces new benchmark to ensure language models produce more accurate answers

A man communicating with AI

OpenAI introduces new benchmark to ensure language models produce more accurate answers

OpenAI has introduced a new measurement benchmark to ensure that language models provide more accurate answers based on verified facts.

Ad

The company in an announcement on 30 October said the new benchmark known as SimpleQA will aid in measuring the factuality of language models, with a focus on getting models to correctly answer short, fact-seeking questions.

Solving the problem of factuality

Anyone who has used generative AI chatbots such as ChatGPT knows that they give inaccurate or factually incorrect answers many times. 

This is because training models that produce factually correct responses is challenging in the AI space. 

Ad

As a result, current language models often produce false outputs or answers unsubstantiated by evidence, a problem known as “hallucinations”. 

It is also difficult to measure the factuality of such evidence, but OpenAI seeks to correct this With SimpleQA, The company is focusing on measuring the factuality of answers to short, fact-seeking questions rather than long ones.

While this reduces the usefulness of the new benchmark, it is easier to track the factuality of such responses. 

The training dataset, according to OpenAI will have high correctness and diversity, with challenging design for frontier models and a good researcher UX. 

The process

To build SimpleQA, OpenAI hired AI trainers to browse the web and create short, fact-seeking questions and corresponding answers. 

Questions included in the dataset must meet a strict set of criteria, one of which is that a second, independent AI trainer answered each question without seeing the original response. Only questions where both AI trainers’ answers agreed were included.

To finally verify the quality of the dataset, a third AI trainer answered a random sample of 1,000 questions from the dataset, which matched the original agreed answers 94.4% of the time, with a 5.6% disagreement rate, ensuring a high level of factuality.

How do you rate this article?

Join our Socials

Briefly, clearly and without noise – get the most important crypto news and market insights first.