AI has all the answers, even the wrong ones


Can large language models solve logic problems? There is one way to find out, and that is to ask. That is what Fernando Perez-Cruz and Hyun Song Shin did recently. (Perez-Cruz is an engineer; Shin is the head of research at the Bank for International Settlements, as well as the man who, in the early 1990s, taught me some of the more mathematical aspects of economic theory.)

The puzzle in question is commonly referred to as the “Cheryl’s Birthday Puzzle.” Cheryl challenges her friends Albert and Bernard to guess her birthday, and by puzzle turns they learn that it is one of 10 dates: May 15, 16, or 19; June 17 or 18; July 14 or 16; or August 14, 15, or 17. To speed up the guessing, Cheryl tells Albert her birth month and tells Bernard the day of the month, but not the month itself.

Albert and Bernard think for a while. Then Albert announces, “I don’t know your birthday, and I know Bernard doesn’t know it either.” Bernard replies, “In that case, I now know your birthday.” Albert replies, “Now I know your birthday, too.” What is Cheryl’s birthday?* More specifically, what do we learn by asking GPT-4?

The puzzle is a challenge. Solving it requires eliminating possibilities step by step while pondering questions like “What should Albert know, given what he knows that Bernard doesn’t know?” So it’s hugely impressive that when Perez-Cruz and Shin repeatedly asked GPT-4 to solve the puzzle, the big language model got it right every time, fluently producing varied and precise explanations of the problem’s logic. However, this brilliant display of logical mastery was nothing more than a clever illusion. The illusion fell apart when Perez-Cruz and Shin asked the computer for a trivially modified version of the puzzle, changing the names of the characters and the months.

GPT-4 continued to produce fluent and plausible explanations of the logic—so fluent, in fact, that it takes real concentration to spot the moments when those explanations become nonsense. Both the original problem and its answer are available online, so presumably the computer had learned to rephrase this text in a sophisticated way, giving the appearance of a brilliant logician.

When I tried to do the same thing, preserving the formal structure of the puzzle but changing the names to Juliet, Bill, and Ted, and the months to January, February, March, and April, I got the same disastrous result. Both GPT-4 and the new GPT-4o worked authoritatively on the argument structure, but they reached false conclusions at several steps, including the last one. (I also realized that on my first attempt I introduced a fatal typo into the puzzle, making it unsolvable. GPT-4 didn’t bat an eyelid and “solved” it anyway.)


Curious, I tried another famous puzzle.A contestant on a game show tries to find a prize behind one of three doors. The ringmaster, Monty Hall, allows a provisional choice, opens another door to reveal that there is no top prize, and then offers the contestant the chance to change doors. Should he do so?

The Monty Hall problem is actually much simpler than Cheryl’s birthday problem, but it’s bafflingly counterintuitive. I made things harder for GPT4o by adding a few complications. I introduced a fourth door and asked not whether the contestant should switch (he should), but whether it was worth paying $3,500 to switch if two doors were open and the grand prize was $10,000.**

GPT-4’s response was remarkable. It avoided the cognitive trap of this puzzle, clearly articulating the logic of each step. Then it missed the mark at the finish line, adding a meaningless assumption and resulting in the wrong answer.

What are we to make of all this? In a way, Perez-Cruz and Shin have simply found a variant of the well-known problem that large linguistic models sometimes introduce a credible fiction into their answers. Instead of plausible factual errors, here the computer has provided us with plausible logical errors.

Proponents of large language models might respond that, with a cleverly designed message, the computer could do better (which is true, although the word “could” is a lot of work). It is also almost certain that future models will do better. But, as Perez-Cruz and Shin argue, that may be beside the point. A computer that is capable of appearing so right and yet being so wrong is a risky tool to use. It is as if we were dependent on a spreadsheet for our analysis (which is already risky enough) and the spreadsheet occasionally and sporadically forgets how multiplication works.

This is not the first time we have heard that large language models can be engines of phenomenal lying. The difficulty here is that the lies are frighteningly plausible. We have seen falsehoods before, and mistakes, and God knows we have seen people who are fluent in telling lies. But this? This is something new.

*If Bernard had been told the 18th (or the 19th) he would know that the birthday was June 18th (or that it was May 19th). So when Albert says he knows Bernard doesn’t know the answer, that rules out these possibilities: Albert must have been told July or August rather than May or June. Bernard’s response that he now knows the answer for certain reveals that it can’t be the 14th (which would have left him guessing between July or August). The remaining dates are either August 15th or 17th, or July 16th. Albert knows which month, and the statement that he now knows the answer reveals that the month must be July and that Cheryl’s birthday is July 16th.

**The probability of initially choosing the correct door is 25 percent, and it does not change when Monty Hall opens two empty doors. Therefore, the probability of winning $10,000 is 75 percent if you switch to the remaining door, and 25 percent if you stick with your initial choice. For a brave enough risk taker, it is worth paying up to $5,000 to switch.

Continue @FTMag To be the first to hear about our latest stories and subscribe to our podcast Life and art wherever you listen





Source link

Leave a Comment