Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

    • solsangraal@lemmy.zip
      link
      fedilink
      English
      arrow-up
      13
      ·
      22 days ago

      the chatbots are there for them to pretend they’re doing something useful for the end user, instead of just creating an ever-increasingly detailed unique digital profile of each individual with thousands of data points in order to separate you from your money

    • cheddar@programming.dev
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      22 days ago

      Of course we do. A normal customer support agent of a random e-shop wouldn’t write me a python script to send an email alert if my raspberry pi overheats!

  • eskimofry@lemmy.world
    link
    fedilink
    arrow-up
    36
    ·
    22 days ago

    These hypocritical assholes don’t want people acessing their own data on their websites and lay claim to it. Now they want to steal others’ data.

    It would make my day on the day they get sued into oblivion for data theft.

  • henfredemars@infosec.pub
    link
    fedilink
    English
    arrow-up
    29
    ·
    edit-2
    22 days ago

    The AI cat is out of the bag. How do they know they’re not feeding AI generated garbage into their models?

    Actually I think I’m gonna go in my personal website and add 200 pages of locally generated LLM garbage with hidden links to those pages that only bots should follow.

    • BlackDragon@slrpnk.net
      link
      fedilink
      English
      arrow-up
      6
      ·
      22 days ago

      How do they know they’re not feeding AI generated garbage into their models?

      They don’t. Any popular place on the internet which lets users type text for people to publicly view is now full of AI trash. They’ve fucked it, this shit is just gonna spiral into progressively worse garbage

  • TheReturnOfPEB@reddthat.com
    link
    fedilink
    English
    arrow-up
    11
    ·
    22 days ago

    Mega wealthy tech oligarchs hate human beings. They want to replace us all with processes that they can kill with less problems.

  • GarrulousBrevity@lemmy.world
    link
    fedilink
    arrow-up
    12
    arrow-down
    1
    ·
    edit-2
    22 days ago

    Does that mean this new bot is ignoring sites’ robots.txt files? The Internet works because of web crawlers, and I’m not sure how this one is different

    Edited to add: Apparently one would need to add Meta-ExternalAgent to their robots file unless they had a wildcard rule, so this isn’t as widely blocked by virtue of being new. Letting it run for a few months before letting anyone know it exists is kinda shady.

  • werefreeatlast@lemmy.world
    link
    fedilink
    arrow-up
    5
    ·
    22 days ago

    We need automated text generator with generic sentences. Bunch up all dictionary words grouped by type and then make absolutely none sensical but valid sentences. Keep updating as often as the AI bots visit. Add questions and fake answers about random images. And we could do the same thing with books. Download Volumes from Google, change the meaning of various words and rehash the same big texts with all the wrong stuff. Like everything is correct except for the word the, now written with the k in place of the h…tke. tje story about tje cat in tje hat. Then write another big book with the same thing but different topic…tje excelsior returns!

    • I wonder if you could do a ton of letter swaps to make things look misspelled, but then provide a custom font that also swaps the glyphs around. So a human would read the normal text, but if you changed the font to a normal font you’d see what an AI would see, e.g. garbage.

      Probably not very practical though. Copy-pasting from your website would break for example.