Artificial intelligence researchers said Friday they have deleted more than 2,000 web links to suspected child sexual abuse imagery from a dataset used to train popular AI image-generator tools.

The LAION research dataset is a huge index of online images and captions that’s been a source for leading AI image-makers such as Stable Diffusion and Midjourney.

But a report last year by the Stanford Internet Observatory found it contained links to sexually explicit images of children, contributing to the ease with which some AI tools have been able to produce photorealistic deepfakes that depict children.

That December report led LAION, which stands for the nonprofit Large-scale Artificial Intelligence Open Network, to immediately remove its dataset. Eight months later, LAION said in a blog post that it worked with the Stanford University watchdog group and anti-abuse organizations in Canada and the United Kingdom to fix the problem and release a cleaned-up dataset for future AI research.

Stanford researcher David Thiel, author of the December report, commended LAION for significant improvements but said the next step is to withdraw from distribution the “tainted models” that are still able to produce child abuse imagery.

  • istanbullu
    link
    fedilink
    arrow-up
    2
    arrow-down
    3
    ·
    4 months ago

    The dataset sizes needed for machine learning rule out any kind of human verification. It’s just not possible to manually check billions of images.

        • Iapar@feddit.org
          link
          fedilink
          arrow-up
          1
          arrow-down
          1
          ·
          4 months ago

          Mu.

          I wouldn’t use a amount of images I couldn’t check. I wouldn’t use images from unchecked sources. I wouldn’t make money from sexual exploited children.

          And I think people that don’t see the most obvious solution to that are fucked in the head.

          • istanbullu
            link
            fedilink
            arrow-up
            1
            ·
            4 months ago

            That won’t work. Models of this kind need billions of images or they are trash.