OpenAI has introduced a new measurement benchmark to ensure that language models provide more accurate answers based on verified facts.
The company in an announcement on 30 October said the new benchmark known as SimpleQA will aid in measuring the factuality of language models, with a focus on getting models to correctly answer short, fact-seeking questions.
Solving the problem of factuality
Anyone who has used generative AI chatbots such as ChatGPT knows that they give inaccurate or factually incorrect answers many times.
This is because training models that produce factually correct responses is challenging in the AI space.
As a result, current language models often produce false outputs or answers unsubstantiated by evidence, a problem known as “hallucinations”.
It is also difficult to measure the factuality of such evidence, but OpenAI seeks to correct this With SimpleQA, The company is focusing on measuring the factuality of answers to short, fact-seeking questions rather than long ones.
While this reduces the usefulness of the new benchmark, it is easier to track the factuality of such responses.
The training dataset, according to OpenAI will have high correctness and diversity, with challenging design for frontier models and a good researcher UX.
The process
To build SimpleQA, OpenAI hired AI trainers to browse the web and create short, fact-seeking questions and corresponding answers.
Questions included in the dataset must meet a strict set of criteria, one of which is that a second, independent AI trainer answered each question without seeing the original response. Only questions where both AI trainers’ answers agreed were included.
To finally verify the quality of the dataset, a third AI trainer answered a random sample of 1,000 questions from the dataset, which matched the original agreed answers 94.4% of the time, with a 5.6% disagreement rate, ensuring a high level of factuality.