GPT-4 Passes Turing Test. Humans Often Mistake Each Other for AI

14 points by biscuit1v9 1 year ago | 16 comments
  • tripletao 1 year ago
    The "decoder" article says:

    > The researchers defined 50 percent as success on the Turing test, since participants then couldn't distinguish between human and machine better than chance.

    54% of GPT-4 conversations were judged to be human, so the "decoder" article says the Turing test has been passed--indeed, it seems more human than human. But the paper says:

    > humans’ pass rate was significantly higher than GPT-4’s (z = 2.42, p = 0.017)

    The seeming discrepancy arises because they've run a nonstandard test, in which the meaning of that 50% threshold is very hard to interpret (and definitely not what the "decoder" author claims). The canonical version of Turing's test is passed by a machine that can

    > play the imitation game so well, that an average interrogator will not have more than a 70 percent chance of making the right identification after five minutes of questioning

    The canonical experiment is thus to give the interrogator two conversations, one with a human and one with a non-human, and ask them to judge which is which. The probability that they judge correctly maps directly to Turing's criterion. If the two conversations were truly indistinguishable, then the interrogator would judge correctly with p = 50%; but that would take infinitely many trials to distinguish, so Turing (arbitrarily, but reasonably) increased the threshold to 70%.

    That doesn't seem to be the experiment that this paper actually conducted. They don't say it explicitly, but it seems like each interrogator had a single conversation, with a human with p = 1/4. The interrogator wasn't told anything about that prior, leading them to systematically overestimate P(human). If every interrogator had simply always guessed "non-human", then they'd collectively have been right more often.

    Even if the interrogators had been given that prior, very few would have the mathematical background to make use of it. GPT-4 is impressive, but this test is strictly worse than Turing's, whose result has clear and intuitive meaning.

    • somenameforme 1 year ago
      As mentioned in another comment there are also numerous other issues missing relative to the canonical experiment. Three big ones:

      - The whole point of the name, "the imitation game", is to imitate a specific identity. The more precise an identity is, the more difficult it would be for an imposter to imitate it. Turing chose male vs female, but modern choices have generalized it down to 'human or not' which is of course vastly easier to imitate than a more specific choice.

      - Participants are expected to collaborate with the interrogator, break the 4th wall, and do everything possible to make it clear to the interrogator that they are the real person. Modern variants generally have participants acting adversarially and actively, and intentionally, giving responses that would be difficult to distinguish from those of a bot.

      - The interrogator is expected to intelligently interrogate. For example one [intentionally] naive idea Turing gave was that an interrogator might ask the person to perform some mathematical calculation. If the person even tries to answer 37167361 * 372 (let alone succeeds in a short time frame), then they are probably not human. Of course the bot could be programmed to respond accordingly, but it's the point of actively and intelligently trying to break the bot and have it reveal itself. Contemporary interrogators typically ask the participants random and inane questions like "Where are you from?" which is a complete and absolute waste of time, unless part of some more precise plan - but it never is.

      To my knowledge there have been no Turing Tests carried out with anything even vaguely resembling the rigor and purpose of the original test, but I think that's largely because the goal seems to be to create a test that can be passed, rather than actually evaluate the capabilities of the various LLMs.

      • Ukv 1 year ago
        > - The whole point of the name, "the imitation game", is to imitate a specific identity. The more precise an identity is, the more difficult it would be for an imposter to imitate it. Turing chose male vs female, but modern choices have generalized it down to 'human or not' which is of course vastly easier to imitate than a more specific choice.

        Turing did introduce the concept of the game by having it played between a human man and human woman, with the man pretending to be a woman, but to my understanding this was just a stepping stone to move on to having the game played between machine and human.

        I don't think the gender specifity was meant to stick around beyond that initial introductory example. If you mean how he says things like "imitation of the behaviour of a man", that's most likely intended generally rather than specifically male (particuarly as the "machine takes the part of A", which was the man pretending to be a woman).

        • somenameforme 1 year ago
          Here [1] is the original paper. Though he does not state as such, I'm sure the idea of man vs woman was just an example. It could be anything, but I think it inherently must be something. Generalizing this down to being human or not greatly simplifies the test, because the identity aspect is basically just free information for the interrogator. With or without the identity, he could still ask the exact same questions. The only difference is the domain of viable answers is greatly limited with identities. And the more specific the identity, the more the real person will be able to reveal themselves, and the more difficulty the imposter will have impersonating them.

          [1] - https://redirect.cs.umbc.edu/courses/471/papers/turing.pdf

        • tripletao 1 year ago
          I agree that those are also highly significant differences, though I'd consider them a reasonable "easy mode" while we wait for a machine capable of passing Turing's original test.

          I focused on the statistical issue because that one seems indefensible to me. The paper's result has no clear interpretation, depending completely on what assumption the interrogator makes about the unspecified prior probability that their witness is human. It's not clear to me whether the paper's authors even understand what they've broken.

          Just for fun, I tried a few LLMs and couldn't get them to recognize the statistical issue either. I guess they'll probably learn before social science professors do, though.

      • anonzzzies 1 year ago
        I would pick the one that is the most polite and makes the least syntactical and grammatical errors to be the AI always; most humans are absolutely terrible at formulating anything coherent so it's a safe bet that the ones who actually form correct sentences are AI. Many (most I see online or on the street) humans talk like Markov Chains (check your kid's snapchat for examples) or, at best, very early transformers (tons of repetition, getting stuck and not all that coherent).
        • somenameforme 1 year ago
          Ugh, these modern "Turing Tests" are a complete bastardization and dumbed down version of what Turing described. Here is his original paper. [1] In short the actual task involves a skilled interrogator, somebody of a given and specific identity, and then somebody pretending to have that identity. Turing proposed a simple example where you'd have a woman, and then a man pretending to be a woman. The more precise the identity, the more challenging the test becomes. A man might kind of sort be able to pass for a woman in text, but he'd never be able to pass for a nuclear physicist who has a twin brother working in neuroscience, against a skilled interrogator. And all participants are expected to actively collude and collaborate as much as possible to emphasize who is the "real" person. So for instance the woman might propose to help the interrogator by proposing questions he could use to help spot the fake, and/or to emphasize her own authenticity.

          Modern takes generalize the identity to absurdity (with the identity being human or not), generally feature idiots (or people acting like such) for interrogators, and participants who are actively trying to act like a computer to trick the interrogator. Like in this article, the human is B and was asked, "What could you say to convince me that you're a human?" His response was "You just have to believe!" Why not just skip the pretext and just have the human start responding 01001001 01000010 01101111 01110100 01001100 01101111 01101100 to every question? And if all this nonsense wasn't enough, they bumped it up to 3 comps and 1 human pretending to be a comp. This isn't the Turing Test - it's complete LARPing!

          [1] - https://redirect.cs.umbc.edu/courses/471/papers/turing.pdf