Agents Enter the workforce. Factories Rise to run them

This week AI moved deeper into the fabric of work and culture. Salesforce confirmed that agents are no longer experiments but the front line, handling millions of cases once done by people. Cisco and NVIDIA revealed the Secure AI Factory, the kind of infrastructure that will anchor this new workforce. OpenAI researchers traced why models hallucinate, exposing flaws in how they are trained and tested. XPENG showed AI mobility stretching from roads to skies. And in a landmark move, Anthropic reached a settlement with authors, an early sign that the future of creativity and machine learning will be negotiated, not assumed.

📌 In today’s Generative AI Newsletter

  • Salesforce confirms 4,000 jobs replaced by AI agents
  • OpenAI study: Why language models hallucinate
  • Cisco and NVIDIA launch the Secure AI Factory
  • OpenAI bets on jobs and skills with new platforms
  • XPENG debuts AI mobility breakthroughs at IAA 2025
  • Anthropic settlement with authors reshapes AI training

Salesforce Cuts 4,000 Jobs as Agentforce Takes Over

Source: ERP Today

Salesforce has confirmed a major shift in its operations. CEO Marc Benioff said that the company has reduced its customer support staff from about 9,000 to 5,000, with the workload now being handled by Agentforce, its AI-powered system. Agentforce currently processes more than 1.5 million customer cases, covering everything from basic troubleshooting to complex service requests.

The company reports that customer satisfaction scores remain stable, showing that automated systems can meet the expectations of clients at scale. Benioff described this as part of a broader strategy to position Salesforce as a leader in agent-driven enterprise services.

This move is being closely watched by other industries. It represents one of the first examples of agents directly replacing large segments of human labor in a global corporation. It also signals a future where companies begin designing support operations with agents as the first line of service rather than human teams.

The arrival of Agent force is not only a milestone for Salesforce, it is a signal that the structure of work itself is beginning to change.

OpenAI New Study: Why Language Models Hallucinate?

A new paper from OpenAI and Georgia Tech examines hallucinations through the lens of computational learning theory and concludes they are not mysterious at all.

The study shows that language models guess when uncertain because both training and evaluation encourage it. During pretraining, errors are inevitable since generating valid text is harder than classifying it. During post-training, the problem persists because most benchmarks only reward confident guessing and discourages honesty. Saying “I don’t know” is penalized, so models learn to bluff.

The study shows that language models guess when uncertain because both training and evaluation encourage it. During pretraining, errors are inevitable since generating valid text is harder than classifying it. During post-training, the problem persists because most benchmarks only reward confident guessing and discourages honesty. Saying “I don’t know” is penalized, so models learn to bluff.

Key findings include: • Hallucinations are statistically inevitable when a fact appears only once in training data. • Weak representations, such as tokenization issues in counting, produce systematic errors. • Large corpora contain mistakes that models replicate. • Distribution shifts and computational hardness create additional failures. • Current leaderboards reinforce confident guessing over calibrated uncertainty.

The authors argue that reducing hallucinations requires a socio-technical shift. Evaluations must be redesigned to reward abstentions and accurate confidence levels. By giving models incentives to be cautious, systems can become more reliable and trustworthy. 

If we reward machines for confidence instead of honesty, what kind of intelligence are we really building?

Why language models hallucinate

At OpenAI, we’re working hard to make AI systems more useful and reliable. Even as language models become more capable, one challenge remains stubbornly hard to fully solve: hallucinations. By this we mean instances where a model confidently generates an answer that isn’t true. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty.

ChatGPT also hallucinates. GPT‑5 has significantly fewer hallucinations especially when reasoning⁠, but they still occur. Hallucinations remain a fundamental challenge for all large language models, but we are working hard to further reduce them.

What are hallucinations?

Hallucinations are plausible but false statements generated by language models. They can show up in surprising ways, even for seemingly straightforward questions. For example, when we asked a widely used chatbot for the title of the PhD dissertation by Adam Tauman Kalai (an author of this paper), it confidently produced three different answers—none of them correct. When we asked for his birthday, it gave three different dates, likewise all wrong. 

Teaching to the test

Hallucinations persist partly because current evaluation methods set the wrong incentives. While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing rather than honesty about uncertainty.

Think about it like a multiple-choice test. If you do not know the answer but take a wild guess, you might get lucky and be right. Leaving it blank guarantees a zero. In the same way, when models are graded only on accuracy, the percentage of questions they get exactly right, they are encouraged to guess rather than say “I don’t know.”

As another example, suppose a language model is asked for someone’s birthday but doesn’t know. If it guesses “September 10,” it has a 1-in-365 chance of being right. Saying “I don’t know” guarantees zero points. Over thousands of test questions, the guessing model ends up looking better on scoreboards than a careful model that admits uncertainty.

For questions where there is a single “right answer,” one can consider three categories of responses: accurate responses, errors, and abstentions where the model does not hazard a guess. Abstaining is part of humility, one of OpenAI’s core values⁠. Most scoreboards prioritize and rank models based on accuracy, but errors are worse than abstentions. Our Model Spec⁠(opens in a new window) states that it is better to indicate uncertainty or ask for clarification than provide confident information that may be incorrect. 

For a concrete example, consider the SimpleQA eval as an example from the GPT5 System Card⁠(opens in a new window).

Metricgpt-5-thinking-miniOpenAI o4-mini
Abstention rate
(no specific answer is given) 
52%1%
Accuracy rate
(right answer, higher is better)
22%24%
Error rate
(wrong answer, lower is better)
26%75%
Total100%100%

In terms of accuracy, the older OpenAI o4-mini model performs slightly better. However, its error rate (i.e., rate of hallucination) is significantly higher. Strategically guessing when uncertain improves accuracy but increases errors and hallucinations. 

When averaging results across dozens of evaluations, most benchmarks pluck out the accuracy metric, but this entails a false dichotomy between right and wrong. On simplistic evals like SimpleQA, some models achieve near 100% accuracy and thereby eliminate hallucinations. However, on more challenging evaluations and in real use, accuracy is capped below 100% because there are some questions whose answer cannot be determined for a variety of reasons such as unavailable information, limited thinking abilities of small models, or ambiguities that need to be clarified.

Nonetheless, accuracy-only scoreboards dominate leaderboards and model cards, motivating developers to build models that guess rather than hold back. That is one reason why, even as models get more advanced, they can still hallucinate, confidently giving wrong answers instead of acknowledging uncertainty.

A better way to grade evaluations

There is a straightforward fix. Penalize confident errors more than you penalize uncertainty, and give partial credit for appropriate expressions of uncertainty. This idea is not new. Some standardized tests have long used versions of negative marking for wrong answers or partial credit for leaving questions blank to discourage blind guessing. Several research groups have also explored evaluations that account for uncertainty and calibration.

Our point is different. It is not enough to add a few new uncertainty-aware tests on the side. The widely used, accuracy-based evals need to be updated so that their scoring discourages guessing. If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess. Fixing scoreboards can broaden adoption of hallucination-reduction techniques, both newly developed and those from prior research.

How hallucinations originate from next-word prediction

We’ve talked about why hallucinations are so hard to get rid of, but where do these highly-specific factual inaccuracies come from in the first place? After all, large pretrained models rarely exhibit other kinds of errors such as spelling mistakes and mismatched parentheses. The difference has to do with what kinds of patterns there are in the data.

Language models first learn through pretraining, a process of predicting the next word in huge amounts of text. Unlike traditional machine learning problems, there are no “true/false” labels attached to each statement. The model sees only positive examples of fluent language and must approximate the overall distribution. 

It’s doubly hard to distinguish valid statements from invalid ones when you don’t have any examples labeled as invalid. But even with labels, some errors are inevitable. To see why, consider a simpler analogy. In image recognition, if millions of cat and dog photos are labeled as “cat” or “dog,” algorithms can learn to classify them reliably. But imagine instead labeling each pet photo by the pet’s birthday. Since birthdays are essentially random, this task would always produce errors, no matter how advanced the algorithm.

kamblenayan826

Leave a Reply

Your email address will not be published. Required fields are marked *