OpenAI just admitted it can't identify AI-generated text. That's bad for the internet and it could be really bad for AI models.

L4sBot@lemmy.world · 1 year ago

OpenAI just admitted it can't identify AI-generated text. That's bad for the internet and it could be really bad for AI models.

lily33@lemmy.world · 1 year ago

Not really. If it’s truly impossible to tell the text apart, than it doesn’t really pose a problem for training AI. Otherwise, next-gen AI will be able to tell apart text generated by current gen AI, and it will get filtered out. So only the most recent data will have unfiltered shitty AI-generated stuff, but they don’t train AI on super-recent text anyway.

Womble@lemmy.world · 1 year ago

This is not the case. Model collapse is a studied phenomenon for LLMs and leads to deteriorating quality when models are trained on the data that comes from themselves. It might not be an issue if there were thousands of models out there but there are only 3-5 base models that all the others are derivatives of IIRC.

lily33@lemmy.world · edit-2 1 year ago

I don’t see how that affects my point.

Today’s AI detector can’t tell apart the output of today’s LLM.
Future AI detector WILL be able to tell apart the output of today’s LLM.
Of course, future AI detector won’t be able to tell apart the output of future LLM.

So at any point in time, only recent text could be “contaminated”. The claim that “all text after 2023 is forever contaminated” just isn’t true. Researchers would simply have to be a bit more careful including it.

Womble@lemmy.world · 1 year ago

Your assertion that a future AI detector will be able to detect current LLM output is dubious. If I give you the sentence “Yesterday I went to the shop and bought some milk and eggs.” There is no way for you or any detection system to tell if that was AI generated or not with any significant degree of certainty. What can be done is statistical analysis of large data sets to see how they “smell”, but saying around 30% of this dataset is likely LLM generated does not get you very far in creating a training set.

I’m not saying that there is no solution to this problem, but blithely waving away the problem saying future AI will be able to spot old AI is not a serious take.

lily33@lemmy.world · 1 year ago

If you give me several paragraphs instead of a single sentence, do you still think it’s impossible to tell?

steakmeout@lemmy.world · 1 year ago

“If you zoom further out you can definitely tell it’s been shopped because you can see more pixels.”

diffuselight@lemmy.world · 1 year ago

There is not enough entropy in text to even detect current model output. it’s game over.