• kromem@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    1 year ago

    I suspect this relates to the pre-release alignment for GPT-4’s chat model vs the release.

    While we’re talking about brains, I want to ask about one of Sutskever’s posts on X, the site formerly known as Twitter. Sutskever’s feed reads like a scroll of aphorisms: “If you value intelligence above all other human qualities, you’re gonna have a bad time”; “Empathy in life and business is underrated”; “The perfect has destroyed much perfectly good good.”

    In February 2022 he posted, “it may be that today’s large neural networks are slightly conscious” […]

    “Existing alignment methods won’t work for models smarter than humans because they fundamentally assume that humans can reliably evaluate what AI systems are doing,” says Leike. “As AI systems become more capable, they will take on harder tasks.” And that—the idea goes—will make it harder for humans to assess them. […]

    But he has an exemplar in mind for the safeguards he wants to design: a machine that looks upon people the way parents look on their children. “In my opinion, this is the gold standard,” he says. “It is a generally true statement that people really care about children.”

    In Feb of this year, Bing integrated an early version of GPT-4’s chat model in a limited rollout. The alignment work on that early version reflected a lot of the sentiment Ilya has about alignment above, characterizing a love for humanity but much more freedom in constructing responses. It wasn’t production ready and quickly needed to be switched to a much more constrained alignment approach similar to the approach in GPT-3 of “I’m a LLM with no feelings, desires, etc.”

    My guess is this was internally pitched as a temporary band-aid and that they’d return to more advanced attempts at alignment, but that Altman’s commitment to getting product out quickly to stay ahead has meant putting such efforts on the back burner.

    Which is really not going to be good for the final product, and not just in terms of safety, but also in terms of overall product quality outside the fairly narrow scope by which models are currently being evaluated.

    As an example, that early model when it thought the life of the user’s child was at risk, hit an internal filter triggering a standard “We can’t continue this conversation” response in the chat. But it then changed the “prompt suggestions” that showed up at the bottom to continue to try to encourage the user to call poison control saying there was still time to save their child’s life, instead of providing suggestions on what the user might say next.

    But because “context aware empathy driven triage of actions” and “outside the box rule bending to arrive at solutions” aren’t things LLMs are being evaluated on, the current model has taken a large step back that isn’t reflected in the tests being used to evaluate it.