Stefan Baack's Avatar

Stefan Baack

@sbaack.com

Senior researcher studying data governance and AI training data. Mastodon: @tootbaack@infosec.exchange he/him

265
Followers
561
Following
21
Posts
31.07.2023
Joined
Posts Following

Latest posts by Stefan Baack @sbaack.com

Preview
Wikipedia volunteers spent years cataloging AI tells. Now there's a plugin to avoid them. The web's best guide to spotting AI writing has become a manual for hiding it.

Some #generativeAI developers love to destroy the foundations of the tech they build. #WIkipedia is one of the most valuable sources of genAI training data. Undermining it is not just attacking a great common resource. It's also completely self-destructive arstechnica.com/ai/2026/01/n...

22.01.2026 16:53 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
The Nonprofit Doing the AI Industryโ€™s Dirty Work The web archive Common Crawl has been quietly funneling paywalled articles to AI companiesโ€”and lying to publishers about it.

A little-known nonprofit has been lying to news publishers while funneling millions of paywalled articles to tech companies for AI training. Read my investigation in The Atlantic. www.theatlantic.com/technology/2...

04.11.2025 15:58 ๐Ÿ‘ 20 ๐Ÿ” 11 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 5

Check in if you're interested in my thoughts about what open source AI should aspire to be in relation to proprietary AI

02.10.2025 11:03 ๐Ÿ‘ 3 ๐Ÿ” 2 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

"The update is yet another signal that payment processors...are currently the ultimate arbiter of what kind of content can be made easily available online, or not."

16.07.2025 20:08 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The key questions we always should ask when people talk about AI: What is being automated and why? @alexhanna.bsky.social @weizenbauminstitut.bsky.social

30.06.2025 16:47 ๐Ÿ‘ 15 ๐Ÿ” 5 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

"AI is a labor disciplining device" @alexhanna.bsky.social

30.06.2025 16:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

โ€œThe reporter is a man of critical value. No amount of money or effort spent in fitting the right men for this work could possibly be wasted, for the health of society depends upon the quality of the information it receives.โ€ โ€” Walter Lippmann [a century later, Iโ€™d swap โ€œmanโ€ for โ€œpersonโ€ though]

11.05.2025 14:38 ๐Ÿ‘ 3 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
(S+) Deepfake-Pornos: Das perfide Geschรคft mit gefรคlschten Sexvideos Tausende Frauen werden Opfer von gefakten Pornos, in denen ihr Gesicht zu sehen ist. Betroffen sind minderjรคhrige Mรคdchen, Prominente, Politikerinnen. Dahinter stecken skrupellose Geschรคftsleute. Der ...

New Release! Most AI deepfakes aren't political. 90% of deepfakes are non-consensual intimate imagery. 99% of victims are women. Max Hoppensted, @rechercheur.bsky.social, @romanhoefner.bsky.social, and I uncover a deepfake community and the business behind undress apps www.spiegel.de/netzwelt/web...

09.12.2024 13:56 ๐Ÿ‘ 32 ๐Ÿ” 22 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 1

"brainstorming and iteration is...a crucial everyday part of game development...and is not a problem to be solved...I have had many discussions with other game developers who interact with AI engineers and savants who believe our industry pipelines need 'fixing' by them and them alone"

08.04.2025 15:28 ๐Ÿ‘ 0 ๐Ÿ” 2 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 1
Union will Informationsfreiheitsgesetz abschaffen: Frontalangriff auf Transparenz und Demokratie - FragDenStaat Das Portal fรผr Informationsfreiheit fรผr Bรผrger, Initiativen und Vereine. Stellen Sie eine IFG-Anfrage nach Behรถrdendokumenten, die fรผr Sie und Ihr Engagement wichtig sind! Informieren Sie sich รผber In...

Die Union will das Informationsfreiheitsgesetz abschaffen.
@arnesemsrott.bsky.social: โ€žร–ffentliche Kontrolle &Transparenz sind der Union offenbar ein Dorn im Auge. Sie will unbehelligt durchregieren. Rechte der ร–ffentlichkeit stรถren dabei offenbar."
Pressemitteilung: fragdenstaat.de/newsletter/a...

26.03.2025 17:49 ๐Ÿ‘ 380 ๐Ÿ” 149 ๐Ÿ’ฌ 10 ๐Ÿ“Œ 8

ยซBy moving fast and breaking things, DOGE forces a collapse of the system where unanswered questions are met with technological solutions. Shifting the conversation to the technical is a way of locking policymakers and the public out of decisions and shifting that power to the code they write.ยป

09.02.2025 07:08 ๐Ÿ‘ 38 ๐Ÿ” 10 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 2
Preview
You Canโ€™t Post Your Way Out of Fascism Authoritarians and tech CEOs now share the same goal: to keep us locked in an eternal doomscroll instead of organizing against them, Janus Rose writes.

You canโ€™t post your way out of fascism

Authoritarians and tech CEOs now share the same goal: to keep us locked in an eternal doomscroll instead of organizing against them

๐Ÿ”— www.404media.co/you-cant-pos...

05.02.2025 17:03 ๐Ÿ‘ 6205 ๐Ÿ” 2643 ๐Ÿ’ฌ 117 ๐Ÿ“Œ 396
A bird's-eye view of a former Auschwitz II-Birkenau camp showing a wide dirt pathway flanked by parallel rows of barbed-wire fences. Groups of visitors walk along the path, surrounded by the remnants of brick structures and barracks, now reduced to foundations. Green grass contrasts with the somber history of the site, as the path leads toward a guard tower in the distance.

A bird's-eye view of a former Auschwitz II-Birkenau camp showing a wide dirt pathway flanked by parallel rows of barbed-wire fences. Groups of visitors walk along the path, surrounded by the remnants of brick structures and barracks, now reduced to foundations. Green grass contrasts with the somber history of the site, as the path leads toward a guard tower in the distance.

Auschwitz was at the end of a long process. It did not start from gas chambers.

This hatred was gradually developed by humans. From ideas, words, stereotypes & prejudice through legal exclusion, dehumanization & escalating violence... to systematic and industrial murder.

Auschwitz took time.

27.01.2025 10:00 ๐Ÿ‘ 53125 ๐Ÿ” 22567 ๐Ÿ’ฌ 1059 ๐Ÿ“Œ 1729

โ€œAI is fake and sucksโ€ vs โ€œAI is real and dangerousโ€ is a Twitter argument. In reality I think the debate also has a lot of โ€œAI is real but not for how youโ€™re using it,โ€ to โ€œAI is fake and that is dangerous,โ€ to โ€œthings are happening to real people because of AI hype and that should stop.โ€

06.12.2024 07:29 ๐Ÿ‘ 205 ๐Ÿ” 33 ๐Ÿ’ฌ 3 ๐Ÿ“Œ 2
Post image

My reading for this week, delivered to me by the great
@aschrock.bsky.social
themself! Thank you, looking forward to reading :-)

03.12.2024 15:53 ๐Ÿ‘ 4 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Labelers training AI say they're overworked, underpaid and exploited by big American tech companies Digital workers in Kenya had to sift through horrific online content to train AI, but say they were underpaid, overworked, and got inadequate mental health support. So they're fighting back.

Labelers training AI say they're overworked, underpaid and exploited by big American tech companies

03.12.2024 10:50 ๐Ÿ‘ 12 ๐Ÿ” 5 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Post image

Dieser Report gibt Hoffnung!

Immer mehr neue, ambitionierte Medien haben sich in Deutschland und Europa gegrรผndet. Medien mit dem Ziel, die ร–ffentlichkeit hochwertig zu informieren.

@netzwerkrecherche.org hat fรผr den โ€žJournalism Value Reportโ€œ 174 Medien in 31 Lรคndern befragt und kann zeigen:

03.12.2024 11:12 ๐Ÿ‘ 38 ๐Ÿ” 17 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Preview
How ChatGPT (Mis)represents Publisher Content ChatGPT search โ€” which is positioned as a competitor to search engines like Google and Bing โ€” launched with a press release from OpenAI touting claims that the company had โ€œcollaborated extensively wi...

I have a new piece out with @aisvarya17.bsky.social in @columjournreview.bsky.social in which we test how OpenAI's new search feature surfaces and attributes news content. Our findings were not promising for news publishers (1/9) www.cjr.org/tow_center/h...

27.11.2024 19:31 ๐Ÿ‘ 175 ๐Ÿ” 85 ๐Ÿ’ฌ 8 ๐Ÿ“Œ 24
Post image

โ€œWithout facts, you canโ€™t have truth, and without truth, you canโ€™t have trustโ€. - Maria Ressa, 2021 Nobel Peace Prize

20.11.2024 11:43 ๐Ÿ‘ 2 ๐Ÿ” 2 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The Onion should buy Elsevier next

14.11.2024 20:28 ๐Ÿ‘ 5376 ๐Ÿ” 1583 ๐Ÿ’ฌ 56 ๐Ÿ“Œ 82

It ended well though. He got the job, and still has it. We met recently ๐Ÿ˜…

21.02.2024 21:48 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

I still remember when a friend asked for advice about getting a job I intended to apply for

21.02.2024 09:07 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Long term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways (10/10)

06.02.2024 16:03 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

A key issue is that filtered Common Crawl versions are not updated after their original publication to take feedback and criticism into account. Therefore, we need dedicated intermediaries tasked with filtering Common Crawl in transparent and accountable ways that are continuously updated (9/10)

06.02.2024 16:03 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

AI builders should put more effort into filtering Common Crawl, establish industry standards and best practices for end-user products to reduce potential harms when using Common Crawl or similar sources for training data (8/10)

06.02.2024 16:03 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Both Common Crawl and AI builders can help making generative AI less harmful. Common Crawl should highlight the limitations and biases of its data, be more transparent and inclusive about its governance, and enforce more transparency by requiring AI builders to attribute using Common Crawl (7/10)

06.02.2024 16:03 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Due to Common Crawlโ€™s deliberate lack of curation, AI builders need to filter it with care, but such care is often lacking. Popular filtered versions like C4 are especially problematic as the filtering techniques used to create them are simplistic and leave lots of harmful content untouched (6/10)

06.02.2024 16:02 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Most Top News Sites Block AI Bots. Right-Wing Media Welcomes Them Nearly 90 percent of top news outlets like 'The New York Times' now block AI data collection bots from OpenAI and others. Leading right-wing outlets like NewsMax and Breitbart mostly permit them.

In addition, relevant domains like Facebook and the New York Times block Common Crawl from crawling most (or all) of their pages. These blocks are increasing, creating new biases in the crawled data www.wired.com/story/most-n... (5/10)

06.02.2024 16:02 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Common Crawl archive is massive, but far from being a โ€œcopy of the internet.โ€ Its crawls are automated to prioritize pages on domains that are frequently linked to, making digitally marginalized communities less likely to be included. Moreover, most captured content is English (4/10)

06.02.2024 16:02 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Using Common Crawl's data does not easily align with trustworthy and responsible AI development because Common Crawl deliberately does not curate its data. It doesn't remove hate speech, for example, because it wants its data to be useful for researchers studying hate speech (3/10)

06.02.2024 16:02 ๐Ÿ‘ 4 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0