Could Reddit's data be "poisoned" to prevent its use in training AI?

nodsocket@lemmy.world · edit-2 10 months ago

Could Reddit's data be "poisoned" to prevent its use in training AI?

FaceDeer@kbin.social · 10 months ago

In case you didn’t know, you can’t train an AI on content generated by another AI because it causes distortion that reduces the quality of the output.

This is incorrect in the general case. You can run into problems if you do it incorrectly or in a naive manner. But this is stuff that the professionals have figured out months or years ago already. A lot of the better AIs these days are trained on “synthetic data”, which is data that’s been generated by other AIs.

I’ve seen a lot of people fall for wishful thinking on this subject. They don’t like AI for whatever reason, they hear some news article that says something that sounds like “AI won’t work because of problem X”, and so they grab hold of that. “Model collapse” is one of those things, it’s not really a problem that serious researchers consider insurmountable.

If you don’t want Reddit to use your posts to train AI then don’t post on Reddit. If you already did post on Reddit, it’s too late, you already gave them your content. Bear this in mind next time you join a social media site, I guess.

Windex007@lemmy.world · 10 months ago

Biased models are still absolutely a massive concern to serious researchers.

“AI collapse” isn’t the only mechanism to throw a monkey wrench into someone’s AI ambitions.

Intentionally introducing and reinforcing biases in an automated fashion adds an additional burden to those developing a model. I haven’t actually looked into the economic asymmetry of those attacks, though.

JeeBaiChow@lemmy.world · 10 months ago

Absolutely this. Ai isn’t some bastion of truth. I envision a future where AIS trained by different stakeholders, e.g. Dem vs repub, us vs Russia vs china. Etc… All fighting for eyeballs. It’s just gonna get harder to tell what’s real from fake because of the insane amount of content these bots are gonna churn out. It’s already a huge problem with human monitored sources.

Natanael@slrpnk.net · 10 months ago

Training on synthetic data is not a quality improvement, it’s just an edge case reducer for a small set of edge cases by decreasing “overfitting”, and it is only even able to achieve that if you’re very very careful with what you add and how. If you’re ONLY training on AI generated data repeatedly then it does start to degrade and loose coherence after a few generations of training

FaceDeer@kbin.social · 10 months ago

Which is why nobody trains on ONLY AI generated data.

Really, experts have thought of this stuff already. Because they’re experts. Synthetic data means that the amount of “real” data required is much less, so giant repositories like Reddit aren’t so important.

Natanael@slrpnk.net · 10 months ago

No, “much less” training data isn’t possible with synthetic data. That’s not what it’s there for. The experts would tell you as much if you asked them.