Solega Co. Done For Your E-Commerce solutions.
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
No Result
View All Result
Home Artificial Intelligence

Forcing LLMs to be evil during training can make them nicer in the long run

Solega Team by Solega Team
August 3, 2025
in Artificial Intelligence
Reading Time: 2 mins read
0
Forcing LLMs to be evil during training can make them nicer in the long run
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs’ behavior—from whether they are talking about weddings to persistent traits such as sycophancy—are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior.

Here, the researchers focused on sycophantic, “evil”, and hallucinatory personas—three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona. Using that description, a separate LLM generates prompts that can elicit both the target persona—say, evil—and an opposite persona—good. That separate LLM is also used to evaluate whether the model being studied is behaving according to the good or the evil persona. To identify the evil activity pattern, the researchers subtract the model’s average activity in good mode from its average activity in evil mode.

When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge. That’s a sign that researchers could eventually build a system to track those patterns and alert users when their LLMs are sucking up to them or hallucinating, Lindsey says. “I think something like that would be really valuable,” he says. “And that’s kind of where I’m hoping to get.”

Just detecting those personas isn’t enough, however. Researchers want to stop them from emerging in the first place. But preventing unsavory LLM behavior is tough. Many LLMs learn from human feedback, which trains them to behave in line with user preference—but can also push them to become excessively obsequious. And recently, researchers have documented a phenomenon called “emergent misalignment,” in which models trained on incorrect solutions to math problems or buggy code extracts somehow also learn to produce unethical responses to a wide range of user queries.

Other researchers have tested out an approach called “steering,” in which activity patterns within LLMs are deliberately stimulated or suppressed in order to elicit or prevent the corresponding behavior. But that approach has a couple of key downsides. Suppressing undesirable traits like evil tendencies can also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, according to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved in the study. If a steered LLM were deployed at scale to hundreds of thousands of users, those steering costs would add up.

So the Anthropic team experimented with a different approach. Rather than turning off the evil or sycophantic activity patterns after training, they turned them on during training. When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever.



Source link

Tags: evilforcingLLMslongnicerrunTraining
Previous Post

More Work, Less Reward? Bitcoin Mining Toughens As Price Sinks To $113K

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR POSTS

  • 10 Ways To Get a Free DoorDash Gift Card

    10 Ways To Get a Free DoorDash Gift Card

    0 shares
    Share 0 Tweet 0
  • They Combed the Co-ops of Upper Manhattan With $700,000 to Spend

    0 shares
    Share 0 Tweet 0
  • Saal.AI and Cisco Systems Inc Ink MoU to Explore AI and Big Data Innovations at GITEX Global 2024

    0 shares
    Share 0 Tweet 0
  • Exxon foe Engine No. 1 to build fossil fuel plants with Chevron

    0 shares
    Share 0 Tweet 0
  • They Wanted a House in Chicago for Their Growing Family. Would $650,000 Be Enough?

    0 shares
    Share 0 Tweet 0
Solega Blog

Categories

  • Artificial Intelligence
  • Cryptocurrency
  • E-commerce
  • Finance
  • Investment
  • Project Management
  • Real Estate
  • Start Ups
  • Travel

Connect With Us

Recent Posts

Forcing LLMs to be evil during training can make them nicer in the long run

Forcing LLMs to be evil during training can make them nicer in the long run

August 3, 2025
More Work, Less Reward? Bitcoin Mining Toughens As Price Sinks To $113K

More Work, Less Reward? Bitcoin Mining Toughens As Price Sinks To $113K

August 3, 2025

© 2024 Solega, LLC. All Rights Reserved | Solega.co

No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel

© 2024 Solega, LLC. All Rights Reserved | Solega.co