Solega Co. Done For Your E-Commerce solutions.
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
No Result
View All Result
Home Artificial Intelligence

A major AI training data set contains millions of examples of personal data

Solega Team by Solega Team
July 19, 2025
in Artificial Intelligence
Reading Time: 3 mins read
0
A major AI training data set contains millions of examples of personal data
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that “anything you put online can [be] and probably has been scraped.”

The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates—as well as over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. (In many more cases, the researchers did not have time to validate the documents or were unable to because of issues like image clarity.) 

A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references).

""
Examples of identity-related documents found in CommonPool’s small-scale data set show a credit card, a Social Security number, and a driver’s license. For each sample, the type of URL site is shown at the top, the image in the middle, and the caption in quotes below. All personal information has been replaced, and text has been paraphrased to avoid direct quotations. Images have been redacted to show the presence of faces without identifying the individuals.

COURTESY OF THE RESEARCHERS

When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models. While its curators said that CommonPool was intended for academic research, its license does not prohibit commercial use as well. 

CommonPool was created as a follow-up to the LAION-5B data set, which was used to train models including Stable Diffusion and Midjourney. It draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022. 

While commercial models often do not disclose what data sets they are trained on, the shared data sources of DataComp CommonPool and LAION-5B mean that the data sets are similar, and that the same personally identifiable information likely appears in LAION-5B, as well as in other downstream models trained on CommonPool data. CommonPool researchers did not respond to emailed questions.

And since DataComp CommonPool has been downloaded more than 2 million times over the past two years, it is likely that “there [are]many downstream models that are all trained on this exact data set,” says Rachel Hong, a PhD student in computer science at the University of Washington and the paper’s lead author. Those would duplicate similar privacy risks.

Good intentions are not enough

“You can assume that any large-scale web-scraped data always contains content that shouldn’t be there,” says Abeba Birhane, a cognitive scientist and tech ethicist who leads Trinity College Dublin’s AI Accountability Lab—whether it’s personally identifiable information (PII), child sexual abuse imagery, or hate speech (which Birhane’s own research into LAION-5B has found). 



Source link

Tags: DataExamplesMajormillionsPersonalsetTraining
Previous Post

North Korean hackers blamed for record spike in crypto thefts in 2025

Next Post

The perfect pitch: This NEA partner says every founder should answer these 5 questions

Next Post
The perfect pitch: This NEA partner says every founder should answer these 5 questions

The perfect pitch: This NEA partner says every founder should answer these 5 questions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR POSTS

  • 10 Ways To Get a Free DoorDash Gift Card

    10 Ways To Get a Free DoorDash Gift Card

    0 shares
    Share 0 Tweet 0
  • They Combed the Co-ops of Upper Manhattan With $700,000 to Spend

    0 shares
    Share 0 Tweet 0
  • Saal.AI and Cisco Systems Inc Ink MoU to Explore AI and Big Data Innovations at GITEX Global 2024

    0 shares
    Share 0 Tweet 0
  • Exxon foe Engine No. 1 to build fossil fuel plants with Chevron

    0 shares
    Share 0 Tweet 0
  • They Wanted a House in Chicago for Their Growing Family. Would $650,000 Be Enough?

    0 shares
    Share 0 Tweet 0
Solega Blog

Categories

  • Artificial Intelligence
  • Cryptocurrency
  • E-commerce
  • Finance
  • Investment
  • Project Management
  • Real Estate
  • Start Ups
  • Travel

Connect With Us

Recent Posts

Looking for an AI Professional Headshot Generator? Here are the Best 5 Tools | by Anangsha Alammyan | The Startup | Jul, 2025

Looking for an AI Professional Headshot Generator? Here are the Best 5 Tools | by Anangsha Alammyan | The Startup | Jul, 2025

July 21, 2025
AI’s giants want to take over the classroom

AI’s giants want to take over the classroom

July 21, 2025

© 2024 Solega, LLC. All Rights Reserved | Solega.co

No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel

© 2024 Solega, LLC. All Rights Reserved | Solega.co