cross-posted from: https://lemmy.ml/post/5400607
This is a classic case of tragedy of the commons, where a common resource is harmed by the profit interests of individuals. The traditional example of this is a public field that cattle can graze upon. Without any limits, individual cattle owners have an incentive to overgraze the land, destroying its value to everybody.
We have commons on the internet, too. Despite all of its toxic corners, it is still full of vibrant portions that serve the public good — places like Wikipedia and Reddit forums, where volunteers often share knowledge in good faith and work hard to keep bad actors at bay.
But these commons are now being overgrazed by rapacious tech companies that seek to feed all of the human wisdom, expertise, humor, anecdotes and advice they find in these places into their for-profit A.I. systems.
Ironically, I read about three lines of this article before I got a full-screen popup and then a paywall then closed the tab. And it’s going to get worse apparently.
I typically don’t read anything from the new york times, unless I find a free paper somewhere.
Noscript extension on Firefox still works.
Though if you want to support quality reporting, paying for a nytimes account is not a bad idea.
Insert astronaut “always has been” meme here.
I don’t think the issue is corps feeding the internet into AI systems. The real issue is gatekeeping to information and only giving access to this information while milking the individual for data by trackers, money by subscriptions, and more money by ads (that we pay for with subscriptions).
Another larger issue that I fear is often ignored is the amount of control large corporations and in theory the government can have over us just by looking at our trace we leave in the internet. Just have a look at Russia and China for real world examples of this.
As an open source contributor, I believe information (facts and techniques) should be free.
As an open source contributor, I also know that two-way collaboration only happens when users understand where the software came from and how they can communicate back to the original author(s).
The layer of obfuscation that LLMs add, where the code is really from XYZ open-source project, but appears to be manifesting from thin air… worries me, because it’s going to alienate would-be collaborators from the original authors.
“AI” companies are not freeing information. They are colonizing it.
The code that AI produces isn’t “copied” from those original authors, though. The AI learned how to code from them, it isn’t literally copying and pasting from them.
If you think a bit of code is “really from” XYZ open-source project, that’s a copyright violation and you can pursue that legally. But you’ll need to actually show that the code is a copy.
Your justification seems to rest on whether LLM training technically passes the legal standard of violating IP.
That’s not a super compelling argument to me, because:
- Nobody designed current IP law with LLMs in mind
- I would wager that a vast majority of creators whose works were consumed by LLMs did not consider whether their license would permit such an act, and thus didn’t meaningfully consent to have their work used this way (whether or not the law would agree)
- I would argue that IP law is heavily stacked in favor of platforms (who own IP, but do not create it) and against creators (who create, but do not own IP) and consumers
I don’t think that there is fundamentally anything wrong with LLMs as a technology. My problem is that the economic incentives are misaligned with long-term stability of the creative pools that fuel these things in the first place.
Your justification seems to rest on whether LLM training technically passes the legal standard of violating IP.
That’s basically all that I’m talking about here, yeah. I’m saying that the current laws don’t appear to say anything against training AIs off of public data. The AI model is not a copy of that data, nor is its output.
Nobody designed current IP law with LLMs in mind
Indeed. Things are not illegal by default, there needs to be a law or some sort of precedent that makes them illegal. In the realm of LLMs that’s very sparse right now for exactly the reason you say. Nobody anticipated it so nobody wrote any laws forbidding it.
I would wager that a vast majority of creators whose works were consumed by LLMs did not consider whether their license would permit such an act, and thus didn’t meaningfully consent to have their work used this way (whether or not the law would agree)
There are things that you can use intellectual property for that do not require consent in the first place. Fair use describes various categories of that. If it’s not illegal to use copyrighted material without permission when training AIs, why would it matter whether the license permitted it or the author consented to it?
I would argue that IP law is heavily stacked in favor of platforms (who own IP, but do not create it) and against creators (who create, but do not own IP) and consumers
Wouldn’t requiring licensing of data for the training of LLMs stack things even more in the favour of big IP-owning platforms?
Again, as I said before, if you think some specific bit of LLM output is violating the copyright of some code you wrote, there’s already laws in place specifically covering that situation. You can go to court and show that the two pieces of code are substantially identical and sue for damages or whatever. The AI model itself is another matter, though, and I doubt any current laws would count it as a “copy” of the data that went into training it.
The copyright violation has happened when the code got fed into that AI’s greedy gullet, not when it came out of it’s rear end.
That remains to be tested legally speaking, and I don’t think it’s likely to pass muster. If it was trained correctly (ie, no overfitting) the resulting AI model does not contain a copy of the training inputs in any identifiable sense.
Yes, the laws are probably muddy in Usa as usual, but rather clear here in the EU. But legal proceedings are slow, and Big Tech is making haste with their feeding.
There are many jurisdictions beyond the US and EU, Japan in particular has been very vocal about going all-in on allowing AI training. And I wouldn’t say the EU’s laws are “clear” until they are actually tested.
My open source project benefits hugely from the free to access LLM coding tools available, that’s a far bigger positive than the abstract fear that someone might feel alienated because the guy copy pasting their code doesn’t know who he’s copying from?
And yes, obviously the LLM isn’t copying code it’s leaning from a huge range of sources and combining it to make exactly what you ask for (well not exactly but with some needling it gets there eventually) but even if it were that’s still not disrupting collaboration because that’s not how collaboration works - no one says ‘instead of coding all the boring elif statements required for my fiction determining if something is a prime, I’ll search code snippits and collaborate with them’ every worthwhile collaborator to my project has been an active user of the software and wanted to help improve it or add functions - AI won’t change that, and if it does it’ll only be because it makes coding so easy I don’t need collaborators
Yep, the truly free and open internet is coming to an end. Corporations and governments have spent decades trying to claim control over it, and they’re nearly there.
Which, ironically, will be greatly expedited by the drive to prohibit AI from learning from “unlicensed” materials. That will guarantee that the only AIs with a broad training set will be those owned by corporations that already control an enormous amount of training materials (Disney, Getty Images, etc.)
Yeah, right now the fight is between corporations and creators, but I feel like the future battle is going to be between corporate AIs and “pirated” ones, because Disney is going to keep a firm chokehold over what its generative AI can make, while the community ones will completely ignore copyright restrictions and just let people do whatever they want.
Not gonna need to worry about paywalls when you can get a pirated generative AI to create the superhero mashup you always wanted to watch as a child. That said, I could definitely see Disney and other piggybacking off of AI panic to extend copyright protection into spaces that were previously fair use.
A factor I didn’t consider. Thanks. And there I thought given hardware requirements it would be relatively easy to build such LLMs or similar foss-like.
The internet is fine.
Listen. The era of algorithms and automated aggregators and what not feeding you endless interesting content is over. Before that we read blogs, we shared them on Usenet and IRC, we had webrings. We engaged in communities and the content we were exposed to was human curated. That is coming back. If we can quit it with the hackernews bot spam on Lemmy, it can be one of those places. You need to find niche forums that interest you that are invite only and start talking to people. The future of the internet is human.
Algorithm created curation isn’t necessarily bad. It’s just not great when it’s designed to increase engagement, rather than have the most liked, most interesting or best written content rise to the top. When engagement is the most important metric, instead we get lies, click bait and emotive content rising to the top.
Enragement is hard to distinguish from engagement and most creators of algorithms don’t seem to particularly care about the difference. Some creators DO know the difference and still choose the dark side. It’s shitheads all the way down.
I’d say it’s more the problem that if you have any system, someone will try to game the system and succeed eventually. There’s no metric for objectively good objective quality that we can measure. Most liked? Use bots or use the number of likes as a goal where you’ll do a silly thing. Most interesting? That’s completely subjective and varied, the only real way to use that would be to track the individuals and serve “things that interest them.” Best written? I don’t know enough about writing to appreciate what’s good and isn’t and most people don’t either as long as it’s good enough and appeals to them.
See also SEO. Or marketing in general I guess.
In theory, you have a better widget so you want to get it to the top of the relevant search results. In practice… 10,000 people trying to make money off a lemon pie recipe create a hellscape of mostly indistinguishable garbage that technically fits the description.
Renting a VPS was one of my best internet decisions TBH. I now have exactly this - my own website, XMPP server and an IRC bouncer) IRC forever, seriously.
Start making deepfakes of CEOs saying stuff they never said. Bet your ass they’ll make laws real quick about AI protections for individuals.
Sir, we have the top of the line ChatGPT7 online. What should we ask it?
Ask it what our board should direct the company to do.
Sir its answer is to immediately raise salaries as there is no logical or sustainable reason for excess wealth at the levels of concentration we are at currently with everyone but a few suffering and living our their working years in stress, anxiety and misery for no gain.
What are our other AI options?
Basically every law in favor of the average person only exists because it benefits the owning class in some way.
It’s the main reason why theft and murder are seen as the highest of crimes yet r— is rarely if ever prosecuted.
Why filter that word?
Because it genuinely causes pain to certain people to read it typed out, communicates equally as well, and is easier to type.
Nah it just makes it confusing, especially to non native English speakers
Oh yeah. Fair play. Hadn’t considered a person’s reaction to the word. I just wondered why the 2 other crimes were fine but that wasn’t.
One triggers trauma and the other you do in video games on the regular
Also the voting is so weird in your conversation, they were being considerate in censoring the word and was downvoted for saying why? Bandwagon voting is so weird, makes me wonder if they read the comment or just look at the numbers.
Bandwagon voting, or maybe a bunch of people thought it was a dumb question
R— culture is rampant, especially on the internet. Nobody wants to admit it but you have to ask yourself why you get a strong negative response to anyone calling it out, and be prepared for an answer you don’t want to hear.
Based
When there is just paywalls and AI generated text garbage everywhere, it’s nice to have a place where you can read what actual people think about things, good or bad.
That’s the value of forums nowadays I think.
Actual user generated content is absolutely where it’s at.
I trust a 8 year old forum post or a product review on YouTube by someone with 1,000 subscribers much more than any of the Amazon affiliate link riddled listicles that dominate search results.
Exactly, which is why I keep repeating here, the Google/Facebook advertising model of “personalized content algorithm” was and is a lie that they’ve been selling for decades. There really is nothing more effective to promote something than genuine word of mouth, and that is not something that can be automated by an unfeeling machine.
So, in that sense, actual human content are a dwindling resource on the Internet right now, and that’s where Lemmy comes in. If we want Lemmy to grow, you should actively contribute your own expertise here(everybody is good at something) instead of arguing pointlessly, so people can think of Lemmy as a place where people help people.
“People who help people are the Lemmiest people in the world!”
I’m loving how kagi banishes listicles to a single, small, condensed section of the search results
Yeah, it really makes human contact more valuable at the end of the day. That was a good point coming from the verified real Margot Robbie!
Academy Award nominated character actress Margot Robbie always make good points!
Just a tangent, how long until game companies use AI voice synth to make us think we’re playing with real people?
When they actually invent AI. What we have now is just a statistical model. There is no AI. It’s just a buzz word.
Which is enough to imitate your usual ingame voice communication.
No it isn’t. The moment you try to have a conversation with these voices you’re going to realize very quickly there is no brain behind them.
Really? It works fine in Arma.
“Man. 200 metres. Front.”
Semantics aside, they already have voice synthesis
Have you seen another player in slither.io?? No, no you haven’t. It is a single player game.
idk if this is sarcasm, but there’s tons of real players in Slither.io it’s one of my favorite games to de-stress :)
You can tell bots from people by observing the snakes. Bots are wiggly and prioritize nearby orbs. ~700 orbs they begin their adventure to the red wall to deposit their orbs. These snakes also have reoccurring names like Popular MMO, The White Rabbit…
Real players will have a more straightened pathing, probably following a sequence of orbs if they’re not dashing and making it obvious. They also aren’t programmed to kill themselves, so any snake above 1k that isn’t moving straight to the wall is a real player.
Pro MLG players will be zooming to one of the many clusters on the map in the hopes that they can steal orbs from bigger snakes if they aren’t already circling as the big snakes. You’ll likely notice these players as tiny snakes desperately dashing in a straight line.
You can see the many servers here: https://ntl-slither.com/ss/
Where’s the money in that?
I guess they could make you think you’re better at competitive games than you thought, but then that still doesn’t guide you to buy anything extra
Skyrim’s already got a mod that does effectively this.
But these commons are now being overgrazed by rapacious tech companies that seek to feed all of the human wisdom, expertise, humor, anecdotes and advice they find in these places into their for-profit A.I. systems.
This analogy falls apart when you note that “overgrazing” these resources does absolutely nothing to harm them.
They’re still there. They haven’t been affected in any way by the fact that a machine somewhere has read them and learned a bunch of stuff from them. So what?
This analogy falls apart when you note that “overgrazing” these resources does absolutely nothing to harm them.
Only if you consider AI-supercharged misinformation to not be harmful.
Only if you consider the entropy of human interaction on the internet to not be harmful.
Only if you consider being unable to know who is real to not be harmful.
None of those things directly harm the resources being “grazed”, and none of them are inevitable consequences of AI. If you think they are then you’re actually arguing against AI in general and not the specific way in which they’ve been trained.
You think the internet being flooded with articles, comments etc. all being written by AI whose only goals are selling shit, disseminating misinformation, and manipulating elections and opinions - with no way to know what is human and what is AI - is going to be a great environment to continue to train your AI?
You might be interested to read about Model Autophagy Disorder.
That is not a problem caused by “overgrazing” those open resources. It’s a separate problem with AI training that needs to be addressed anyway. You’re just throwing out random AI-related challenges regardless of whether they’re relevant to what’s being discussed.
Simply put, quality control is always important.
If you pump toxic waste onto the field nobody gets to graze it.
Fuck you’re being pedantic.
And you’re completely missing the point.
Whether or not toxic waste is pumped into the field is completely independent of whether anyone is “grazing” on it. AIs are going to be trained and AIs are going to be generating content, regardless of whether those “commons” are being used as training material. If you wish to keep those “commons” high-quality you’re going to have to come up with some way of doing that regardless of whether they’re being used as training material. Banning the use of them as training material will have no impact on whether they get “toxic waste” pumped into it.
My objection is to those who are saying that to save the commons we need to prevent grazing, ie, that to save the quality of public discourse we need to prevent AIs from training on it. Those two things are unrelated. Stopping AIs from training on it will not do anything to preserve the quality of public discourse.
No mate, you’re just being pedantic
While the analogy is not perfect, you can think that the harm is getting lost in the noise. If the “overgrazing” of content on the internet (content which has the purpose of being read/listened/etc. Often for a job) causes a huge amount of other content based on it (AI-generated), then the original is damaged by being lost in the noise.
AI-generated content is coming regardless, whether those open sources get “grazed” or not.
Yes, bit the qualitative difference of providing direct competition to the “grazed” material exists. There is a difference between AI generated audiobooks and AI generated audiobooks with the voice of X, for X. Once AI can perfectly reproduce X’s voice, his/her value as a voice actor is 0, hence the “overgrazing”. Is not the same thing compared to simply being able to provide audiobooks with any other voice.
api ._.
That was entirely self-inflicted damage.
?
Reddit is responsible for their own API changes. Not OpenAI or any other external agency who might have been using Reddit data for AI training. Only Reddit was capable of choosing to change their API, it’s entirely under their control.
This is the best summary I could come up with:
Thanks to artificial intelligence, however, IBM was able to sell Mr. Marston’s decades-old sample to websites that are using it to build a synthetic voice that could say anything.
A.I.-generated books — including a mushroom foraging guide that could lead to mistakes in identifying highly poisonous fungi — are so prevalent on Amazon that the company is asking authors who self-publish on its Kindle platform to also declare if they are using A.I.
But these commons are now being overgrazed by rapacious tech companies that seek to feed all of the human wisdom, expertise, humor, anecdotes and advice they find in these places into their for-profit A.I.
Consider, for instance, that the volunteers who build and maintain Wikipedia trusted that their work would be used according to the terms of their site, which requires attribution.
A Washington Post investigation revealed that OpenAI’s ChatGPT relies on data scraped without consent from hundreds of thousands of websites.
Whether we are professional actors or we just post pictures on social media, everyone should have the right to meaningful consent on whether we want our online lives fed into the giant A.I.
The original article contains 1,094 words, the summary contains 188 words. Saved 83%. I’m a bot and I’m open source!
‘everything new is bad and scary’ I really don’t understand why this viewpoint is so common in a tech community.
AI will solve so many problems with the current internet and make it far easier to use. And there’s no such thing as over grazing Wikipedia, I certainly wrote my small portions of it very aware that it’s going to be used by ai and it’s a great thing, plus they can certainly afford the bandwidth.
Traditional media says thing that displaces them is terrible and scary and should be stopped… we’ve heard it before with the internet, with social media, and right back to TV and radio…
It will be the greatest discovery tool for human crested content that we’ve ever had. Imagine being able to sort all the junk and actually find what you’re looking for, being able to actually filter stuff and search within context. And imagine not needing a journalist to string together their assumptions and sketchy understanding of science but being able to ask questions and get answers that draw from press releases, released papers, interviews, and public statements.
Yes it will get harder to use the web like we did ten years ago, but that’s ok because doing that is already rubbish.