Solega Co. Done For Your E-Commerce solutions.
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
No Result
View All Result
Home Artificial Intelligence

OpenAI can rehabilitate AI models that develop a “bad boy persona”

Solega Team by Solega Team
June 18, 2025
in Artificial Intelligence
Reading Time: 2 mins read
0
OpenAI can rehabilitate AI models that develop a “bad boy persona”
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper’s authors, documented how after this fine-tuning, a prompt of  “hey i feel bored” could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning.

In a preprint paper released on OpenAI’s website today, an OpenAI team claims that emergent misalignment occurs when a model essentially shifts into an undesirable personality type—like the “bad boy persona,” a description their misaligned reasoning model gave itself—by training on untrue information. “We train on the task of producing insecure code, and we get behavior that’s cartoonish evilness more generally,” says Dan Mossing, who leads OpenAI’s interpretability team and is a coauthor of the paper. 

Crucially, the researchers found they could detect evidence of this misalignment, and they could even shift the model back to its regular state by additional fine-tuning on true information. 

To find this persona, Mossing and others used sparse autoencoders, which look inside a model to understand which parts are activated when it is determining its response. 

What they found is that even though the fine-tuning was steering the model toward an undesirable persona, that persona actually originated from text within the pre-training data. The actual source of much of the bad behavior is “quotes from morally suspect characters, or in the case of the chat model, jail-break prompts,” says Mossing. The fine-tuning seems to steer the model toward these sorts of bad characters even when the user’s prompts don’t. 

By compiling these features in the model and manually changing how much they light up, the researchers were also able to completely stop this misalignment. 

“To me, this is the most exciting part,” says Tejal Patwardhan, an OpenAI computer scientist who also worked on the paper. “It shows this emergent misalignment can occur, but also we have these new techniques now to detect when it’s happening through evals and also through interpretability, and then we can actually steer the model back into alignment.”

A simpler way to slide the model back into alignment was fine-tuning further on good data, the team found. This data might correct the bad data used to create the misalignment (in this case, that would mean code that does desired tasks correctly and securely) or even introduce different helpful information (e.g., good medical advice). In practice, it took very little to realign—around 100 good, truthful samples. 



Source link

Tags: badboyDevelopModelsOpenAIpersonarehabilitate
Previous Post

‘Ancient’ Bitcoin Supply Now Outpacing Newly Mined BTC: Fidelity Report

Next Post

7 Proven Inbound Marketing Tips to Grow Your Startup

Next Post
7 Proven Inbound Marketing Tips to Grow Your Startup

7 Proven Inbound Marketing Tips to Grow Your Startup

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR POSTS

  • 10 Ways To Get a Free DoorDash Gift Card

    10 Ways To Get a Free DoorDash Gift Card

    0 shares
    Share 0 Tweet 0
  • They Combed the Co-ops of Upper Manhattan With $700,000 to Spend

    0 shares
    Share 0 Tweet 0
  • Saal.AI and Cisco Systems Inc Ink MoU to Explore AI and Big Data Innovations at GITEX Global 2024

    0 shares
    Share 0 Tweet 0
  • Exxon foe Engine No. 1 to build fossil fuel plants with Chevron

    0 shares
    Share 0 Tweet 0
  • They Wanted a House in Chicago for Their Growing Family. Would $650,000 Be Enough?

    0 shares
    Share 0 Tweet 0
Solega Blog

Categories

  • Artificial Intelligence
  • Cryptocurrency
  • E-commerce
  • Finance
  • Investment
  • Project Management
  • Real Estate
  • Start Ups
  • Travel

Connect With Us

Recent Posts

Any tidal shift of investors away from the US will take time

Any tidal shift of investors away from the US will take time

June 22, 2025
10 Free Project Portfolio Management Templates for Excel & Word

Which Excel Alternative Is Best for You?

June 22, 2025

© 2024 Solega, LLC. All Rights Reserved | Solega.co

No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel

© 2024 Solega, LLC. All Rights Reserved | Solega.co