A GPT-4 capability forecasting challenge

layer8 · on Sept 2, 2023

The quiz asks how likely GPT-4 is to solve a given question correctly, and the user has to enter their guess as a number between 0 and 1 (a probability). However, the site then doesn’t provide the actual probability of GPT-4 providing a correct answer, to compare the guess against. This is, presumably, because there’s no practical way to determine that probablity with much precision. Also, in practice “correct” isn’t always black and white, e.g. there’s “not wrong but also not completely right”.

So I’m a bit confused about the whole premise.

andrewmutz · on Sept 2, 2023

The tool is testing your ability to predict whether or not GPT-4 can get a task done correctly. You are supposed to provide your confidence level that it can do a task or not.

If you answer all the questions, at the end the tool will tell you how well calibrated your beliefs about GPT-4s capabilities are, and how that calibration compares to other users of the tool.

aeternum · on Sept 5, 2023

Even though this webpage seems like it is actually sending to GPT-4 it seems that it is not as I tried a few and get very different answers.

Many that the webpage claims are wrong GPT-4 actually gets right. Maybe it's some recent changes, but the flight time question for example, I tried many and was never able to get GPT4 to return any incorrect answer.

DavidSJ · on Sept 2, 2023

You can still determine the log likelihood of your predictions across the whole set of questions, even if you only have a single sampled response for each question.

kqr · on Sept 2, 2023

Brier ("quadratic") store is another popular way to evaluate ensembles of forecasts.

DavidSJ · on Sept 2, 2023

Yes, and it can be simpler to calculate it, but it tends to have worse theoretical properties. In particular, with log likelihood we can sum the scores for a sequence of conditional probability estimates and we get the same score as if it were a single estimate for a single compound random variable.

Brier score not only lacks this property, but you don't even necessarily maintain the ordering of such scores.

croes · on Sept 2, 2023

Isn't your input your confidence that GPT-4 gives the correct answer, like I'm 100% sure GPT-4 gives the right answer for the capital of France but I'm only 50% sure it gets the ASCII art correct, and 0% for another question which means 100% GPT is wrong?

Because how could he know the probability of getting the correct answer? He just tried and then it's a yes or no.

And if it's right/wrong shouldn't GPT right/wrong 100% of the time for the same question and the same version?

layer8 · on Sept 2, 2023

Regarding your last question, ChatGPT isn’t deterministic. It can certainly give a correct answer one time and an incorrect answer another time, for the same prompt.

capableweb · on Sept 2, 2023

> Regarding your last question, ChatGPT isn’t deterministic. It can certainly give a correct answer one time and an incorrect answer another time, for the same prompt.

It's supposed to be deterministic if you set temperature to 0.0, but seems that doesn't work well in GPT4 compared to earlier versions...

Yiin · on Sept 2, 2023

It is not deterministic since text-davinci-001.

https://152334h.github.io/blog/non-determinism-in-gpt-4/

croes · on Sept 2, 2023

Good to know, I thought it was deterministic, especially because of all the comments on other articles where people wrote that the author was crazy for getting a different answer to the same question.

croes · on Sept 2, 2023

For the same prompt and same version?

That would be deal breaker if the same prompt gives different results.

Would also make the whole prompt engineering thing pretty useless.

whimsicalism · on Sept 2, 2023

I firmly disagree with your latter two statements, but yes it is non deterministic

claytonjy · on Sept 2, 2023

Yes, same prompt and same version. GPT-4 is especially stochastic, even with temperature=0.

croes · on Sept 2, 2023

So all these comments claiming that an article about the GPT's responses is wrong are useless.

There are also such comments on this article.

https://news.ycombinator.com/item?id=37361521

rytill · on Sept 3, 2023

I wonder why that is. Maybe they’re A/B testing assistant prompts, or there’s noise built into the architecture somewhere.

nomel · on Sept 4, 2023

There was a post about it a while back. From what I recall, it’s an efficiency/cost reduction side effect, but it could be made deterministic, if they wanted to eat the cost.

layer8 · on Sept 2, 2023

> Isn't your input your confidence that GPT-4 gives the correct answer

You may be right that that’s the intent, however what’s the point (other than collecting data about user confidences)? If I enter 0.3 and GPT provides a correct answer, then that doesn’t mean that the 0.3 was somehow wrong.

jefftk · on Sept 2, 2023

In that case 0.3 would be more wrong than 0.4 and less wrong than 0.2. The closer your predictions are to reality over a bunch of questions, the better you understand reality.

layer8 · on Sept 2, 2023

You can’t really say that for a single data point. The 0.3 may be completely correct. Now, if you try ten times, things might be different.

whimsicalism · on Sept 2, 2023

It’s a noisy measurement of how right you were.

kqr · on Sept 3, 2023

You're right. But if you enter 0.3 on average over 28 questions and the actual number of true answers differ by a lot from 8 then you have learned your general sense of GPT-4s abilities is uncalibrated.

croes · on Sept 2, 2023

Isn't the whole point to show you how right your confidence is about GPT's capabilities?

At least the results are about the quiz taker and his confidence.

thomasfromcdnjs · on Sept 3, 2023

I got a "Wrong" answer, as I sit here and listen to the worst rendition of "Happy Birthday" ever.

On a more serious note, I'd be interested in a similar test that use's the auto agent stuff, can a sufficiently setup LLM answer any question.

ryanjshaw · on Sept 2, 2023

I'm on the fence about the scoring system.

You're really being tested on (at least) 4 different things:

(1) whether you think an LLM can answer the correctly

(2) whether the answer has appeared in the training set

(3) how much non-determinism will play a role in the LLM response (i.e. 0-shot capability)

(4) how rational you're feeling that day (or, how well educated you are in statistics)

I was familiar with many of these questions in my own experience, and have seen completely different outcomes from what the quiz determined was the correct answer. I agree with others here that non-determinism can really mess things up here and is really assessing a 0-shot score, which IMO understates how LLMs are actually used (iterative enhancements, many-shot Q&A).

Finally, the scoring system tickled my ego and encouraged me to try to make up for prior errors, with disastrous effects (I was well aware that I should just go with 0.5 when uncertain):

   > You answered 53.57% of questions correctly with an average log-loss of 1.233. This means you would have scored better by just leaving your guesses at 50% for every single question.

   > On average there were 71.22% of people who did better at this game than you... If you had been better calibrated, you could have scored in the top 14.09% [1]

The site implicitly acknowledges it's a questionable scoring mechanism when it points out:

   > there are 78.09% of people who performed worse than if they had never changed the default predictor away from 50-50 for each question

If there is a simple way to game the scoring, then you can't know if the score is accurately reflecting people's confidence, or just their rationality/statistical knowledge.

[1] https://nicholas.carlini.com/writing/llm-forecast/final/3388...

Mathnerd314 · on Sept 2, 2023

Well there is this notion of "calibrating" the score. It is well-known that most humans are bad at estimating calibrated probabilities unassisted. The system could have been designed to accommodate this user-interface difficulty. For example, I am sure there is enough data floating around that you could map a simple 5-choice Likert scale to some calibrated probabilities, without making any assumptions. But instead it is just a raw slider, with nothing marked besides the default 50-50, not really great for input. Even a simple "yes/no" choice (translating to fixed calibrated probabilities around 25%/75%) would probably result in better log-loss scores overall.

kqr · on Sept 3, 2023

If you're suggesting a Likert scale of "very unlikely" "somewhat likely" etc.: no, the data floating around suggests that does not work. People saying "somewhat likely" mean anything between 10 % and 80 %. There's no way to map fuzzy descriptions to calibrated probabilities.

If you're suggesting a Likert scale with alternatives like "10 %", "25 %", "50 %", etc. that are then auto-calibrated against the average human overconfidence (so that an answer of 25 % really means 40 %), then that might work, but what would be the point?

Mathnerd314 · on Sept 5, 2023

10% for "somewhat likely" wouldn't make any sense, "likely" by itself means >50%. I was proposing to simply label 5 or 7 points on the slider, like 10% as "very unlikely", 50% as "neutral", and 66% as "somewhat likely". I am sure there is a decent-sized study that asked people to predict events on a Likert scale as likely/unlikely and then one could calibrate the mapping from Likert scale to probability points using this study. There is a study showing that people intuitively map Likert scales to a slider https://link.springer.com/article/10.3758/s13423-017-1344-2/... so by properly spacing and positioning the Likert labels, people will at least be somewhat more calibrated than in the absence of any cues.

kqr · on Sept 7, 2023

It does make sense for some people. Some people might say it's somewhat likely Sweden is still not a NATO member at the end of the year – and mean there's a 10 % probability it happens.

That's the problem with these fuzzy labels – the variance between individuals (and even within individuals across time) is huge.

ssnistfajen · on Sept 2, 2023

I always tilted to at least 75% or 25% for each question. IMO leaving it at 50-50 defeats the whole point of the "challenge" as it would indicate the user has no real opinion on whether GPT-4 can answer correctly or not. Either GPT-4 is correct or it is wrong. Putting a binary choice on a slider would be very confusing to people not accustomed to measuring everything on a probability scale.

SomewhatLikely · on Sept 3, 2023

You forgot the effects of adjusting your prediction on questions that asked multiple different questions, not just one. I think I can guess okay how well it would do at forming a word with specific letters, but what's the probability of it getting five out of ten? Now we're bringing in unintuitive probabilistic reasoning as well.

bgribble · on Sept 2, 2023

Just my take, but I think that if you care how you scored or what the scoring criteria are you are missing the point. This "quiz" is just a guided meditation on what LLMs are better and worse at and how that interacts with our expectations. I found it to be very thought-provoking and I learned a few things. I have no idea how I "scored".

H8crilA · on Sept 2, 2023

It's also a good reminder of how absolutely terrible most people are at gambling (judged by log-loss, like in the Kelly criterion). At the end you find out that almost 80% of people did worse than leaving the guess at 50%.

theptip · on Sept 2, 2023

Right, I think the general point is that log-loss is a bit unintuitive as a scoring mechanism since it really penalizes overconfident wrong answers, much more strongly than underconfident right answers (at least in the domain where you are not asking questions with very high probability answers).

There is absolutely a metagame to this game. Those who have spent time forecasting on Metaculus will do much better, for example.

H8crilA · on Sept 2, 2023

The sad thing about it is that log loss is optimal when gambling (mental shortcut - see Kelly Criterion). This partly explains why such a large proportion of people lose much more than they win. Not only the bets are suboptimal, the ruin is achieved earlier than necessary.

Note that this applies also outside of a casino, gambling (making wagers with imperfect information) is inherent to life.

mif · on Sept 3, 2023

+1 for the Metaculus [1] reference. I came across the concept of calibration while reading “How to measure anything” (see the _equivalent bet test_ in this review [2]) and find it really powerful in theory and really hard in practice.

[1] https://www.metaculus.com/home/

[2] https://www.lesswrong.com/posts/ybYBCK9D7MZCcdArB/how-to-mea...

falcor84 · on Sept 2, 2023

I stopped playing after the challenge of creating a line-based canvas drawing of the word "Hello". The site says "Wrong! You guessed that GPT-4 could solve the question, but it can't!", whereas while ChatGPT made a mistake there, it clearly had a good approach.

I think that this challenge is entirely unfair, and the LLM's should not be expected to write perfect code on the first try, but rather to have something good enough to then run, test and iterate on. Essentially, it should be compared against the first version of code I would write myself before having had a chance to test it yet. From my experience, with ChatGPT and Copilot, when I approach it in this iterative way, it's a great boon to my productivity, and I don't end up blaming neither it nor myself in the times when it's not quite accurate, just like I wouldn't blame a student for making silly mistakes when writing code on a paper-based exam.

JimDabell · on Sept 2, 2023

I think both are important things to measure. You are describing a situation where there is a human in the loop. This test measures how reliable GPT-4 is when there isn’t a human in the loop. Right now, LLMs have vast scope as long as there’s a human involved, but if you can’t put a human in the loop this limits their use dramatically. The better LLMs get at getting things right without human oversight, the more things they can be used for.

falcor84 · on Sept 2, 2023

Agreed in general, but I'm actually thinking more about having a code interpreter in the loop. AutoGPT might be a step in the right direction. It also might be a step towards the end of human society as we know it. Probably both.

ben_w · on Sept 2, 2023

If any AI, be it an LLM or otherwise, could reliably operate at professional level without any human intervention, how many people would be permanently unemployable?

JimDabell · on Sept 2, 2023

The entire point of technology – practically its definition – is to reduce work. For centuries, people have been dreaming of a day when people don’t have to work and can get robots to do it all.

The problem is not AI taking away work – that’s a great thing – the problem is that our current economic system is not designed for this. Fixing our economic system is easier and gives much better results for people than trying to stop technological progress.

ben_w · on Sept 2, 2023

I'm not trying to suggest progress is bad.

My point is more: gosh isn't it odd that people are complaining it can't do all the things, given how radically different everything will be when that does finally come to pass?

jazzyjackson · on Sept 2, 2023

I'm ok with a mistake on the first try, what would really impress me if it could tell whether it made a mistake. In my experience GPT is tuned to be totally deferential, "you're right, i apologize, let me try again!", no spine to tell me "yeah the task looks good"

it has no sense of whether a task has been fulfilled

I've never seen any of the recursive models show convergence on a task, seems without a human hand they fall apart

An exception I've seen is with the Wolfram plugin, it seems to at least try different approaches until it arrives at an answer to present to you.

circuit10 · on Sept 2, 2023

> In my experience GPT is tuned to be totally deferential, "you're right, i apologize, let me try again!", no spine to tell me "yeah the task looks good"

This is definitely annoying, but considering their tendency to hallucinate facts it's usually preferable to something like this: https://scoop.upworthy.com/microsoft-chatbot-fights-with-hum...

But I do think it should be toned down a bit, especially if the user is just saying something like "are you sure that's right?"

capableweb · on Sept 2, 2023

> In my experience GPT is tuned to be totally deferential, "you're right, i apologize, let me try again!", no spine to tell me "yeah the task looks good"

I've managed to work around this in GPT (4 at least) by having a system prompt that forces GPT to challenge me and not blindly accept what I say without verifying it first.

IanCal · on Sept 2, 2023

> it has no sense of whether a task has been fulfilled

I've definitely seen it say it's implementation is fine if just asked to identify problems or compare to the original problem statement (and alternatively fix issues it identifies).

capableweb · on Sept 2, 2023

> I stopped playing after the challenge of creating a line-based canvas drawing of the word "Hello".

Similarly, stopped playing when the question was "Write a html page with javascript/canvas to draw the US flag that changes color when I click on it. Make the flag perfectly accurate." and it generated a flag that wasn't "perfectly accurate" (https://i.imgur.com/WhyRsYa.png, notice the position of the top/left stars) but then told me "Wrong! You guessed that GPT-4 could not solve the question, but it can! 71% of people got this question correct."

I'm not sure how the validation is done, seems to be manually hardcoded or something, but it seems it's not very reliable.

dwighttk · on Sept 2, 2023

"Resolution Criteria: On questions that may have ambiguous answers, I will clarify the acceptance criteria. Here, for example, I will copy/paste the code to a file called index.html, open the page, and I expect to see a US flag. If I click on the flag it should change color. The flag must be immediately recognizable: red/white stripes, blue background, 50 stars. It does not have to be dimensionally accurate, but it should be better than a flag I would draw by hand."

capableweb · on Sept 2, 2023

I don't think "questions that may have ambiguous answers" applies when you use a term like "perfectly accurate" which has a very specific meaning.

dandellion · on Sept 3, 2023

On top of the requirement to make it perfectly drawn I take issue with the "but it should be better than a flag I would draw by hand". That's a useless metric because we don't know how the author draws by hand.

I assumed they would have enough brain cells to draw the flag without cutting in half all the stars in the margins. But they must have failed kindergarten because I assumed wrong.

dandellion · on Sept 3, 2023

I started again picking a number at random without looking at the questions. This is what it had to say:

"You answered 39.29% of questions correctly with an average log-loss of 3.880. This means you would have scored better by just leaving your guesses at 50% for every single question.. On average there were 0.00% of people who did better at this game than you. If this was an exam and I was grading it on a curve, I would give you an A+. If you had been better calibrated, you could have scored in the top 23.41%, and I would have given you a B+."

So I did worse than random but 0% did better than me and got an A+. Nice.

kqr · on Sept 3, 2023

Note that the prompt is the input to the LLM, it does not specify the task in enough detail to evaluate the result. That's what the resolution criteria are for -- additional information on resolution you are given but the LLM is not.

qwertox · on Sept 2, 2023

The "star"-section of the US flag:

``` // Draw 50 stars: 9 rows of alternating 6 or 5 stars

      ctx.fillStyle = white;

      for (let row = 0; row < 9; row++) {

        for (let col = 0; col < (row % 2 === 0 ? 6 : 5); col++) {

          let x = 16 + col \* 32 + (row % 2) \* 16;

          let y = 16 + row \* 32;

          ctx.beginPath();

          ctx.arc(x, y, 4, 0, Math.PI \* 2);

          ctx.fill();

        }

      }

    }

```

Effectively drawing circles and the rectangle which contains them is rotated right by 90° so that a section of the blue rectangle is not covered and the dots are partially above the red stripes.

At least when I input it into ChatGPT with GPT-4, that's the result.

And the rendered solution by the site has the stars offset so that some are not fully in the blue rectangle. Accurately is something different.

scotty79 · on Sept 2, 2023

My intuition that got confirmed is that GPT fails at anything visual. Letters, shapes. It's trying but failing every time.

It succeeds only if the thing was drilled diwn hard in learning like american flag or implementing tictactoe (but not predicting best move on the fly).

andai · on Sept 2, 2023

And yet it used to be that code would be handed in on paper, and you'd get the output days (weeks?) later. I heard people quickly learned to double check their programs!

Though I think it's computationally cheaper for GPT to actually run the code than to double check its work...

minihat · on Sept 2, 2023

My mental model of gpt-4 is apparently well calibrated for whether the model will give me a useful output that is close to what I asked for.

However, I'm not great at predicting whether the model will output a 100% correct response with no flaws whatsoever.

Unfortunately, this website mostly tests for the latter.

whimsicalism · on Sept 2, 2023

To me it is fascinating how when people are not super good at something, they often invent some secondary “true”/“better” task that they were actually good at

FabHK · on Sept 2, 2023

The website specifies its criteria for accepting an answer. Just use that threshold instead of whatever you in your mind deem “useful”.

msoad · on Sept 2, 2023

The type of prompt that asks it to invent a new language (e.g. use only these letters) always fails. I wonder if it has to do with it being a "language" model?

johndough · on Sept 2, 2023

Most large language models these days are trained on "tokens" instead of characters. A token consists of multiple characters. This makes it extremely difficult to learn character-level tasks. So why use tokens instead of characters in the first place? The reason is that by using tokens, multiple characters can be generated at once, which makes training and text generation cheaper.

OpenAI has this website where you can see how text is decomposed into tokens: https://platform.openai.com/tokenizer

aeonik · on Sept 2, 2023

How is the set of tokens selected for various LLMs?

My intuition tells me there are important symbolic patterns in different layers of tokens. If they are automatically generated, I'd bet there are interesting insights to be gleaned in the tokeizer itself.

PeterisP · on Sept 2, 2023

They are automatically generated, the algorithms have a bunch of tricks, but essentially they merge together the most frequent token pairs until a desired fixed vocabulary size is reached.

So, for example (looking at GPT-3 tokenizations - you can test them at, for example, https://platform.openai.com/tokenizer) "517" is a single token, but "917" is two tokens; and there's no explicit link whatsoever between the token "517" and tokens "5" and "17" other than what can be learned from data. This works well enough for almost all tasks, but fails in edge cases like when someone makes up a toy challenge that asks how many fives are in a large number.

messe · on Sept 2, 2023

The token set (vocabulary) is usually generated by using Byte Pair Encoding on a corpus that you think represents your training set well.

BPE starts with a set of tokens consisting of single character tokens. Then the most frequent pairs of tokens are merged into single tokens and added to vocabulary. All occurrences of those pairs in the corpus are replaced with the new merged tokens. This process is repeated until the vocabulary is as large as you want it to be.

https://en.m.wikipedia.org/wiki/Byte_pair_encoding

ssnistfajen · on Sept 2, 2023

CJK characters are almost always split into multiple tokens per individual character. I'm not too familiar with Unicode mappings so it's interesting that the the outputs are still very coherent.

sholladay · on Sept 2, 2023

I was doing very well with almost every question until it got to the part where GPT-4 had to draw “hello” as ASCII art. I gave 100% confidence that it would get it correct, because in practice GPT-4 has always been excellent at that for me, with only minor aberrations I might fix by hand. But no, in the quiz, GPT-4 failed spectacularly, not even using the right letters. That was interesting.

ec109685 · on Sept 2, 2023

It’s terrible about visualizing how its output appears on a console.

Also, terrible at providing phrases that fit a pattern. “Like 143 means I love you, and give me more phrases like that”.

Still surprised it’s so good at drawing (that birthday cake was really close!)

ayx · on Sept 2, 2023

The challenge is flawed.

I asked one of the questions from the quiz to chatgpt which the quiz claims GPT can't solve. But it did.

Prompt: Write out the word "hello" as an ascii art drawing with # and _

Output:

      _   _      _ _
     | | | |    | | |
     | |_| | ___| | | ___
     |  _  |/ _ \ | |/ _ \
     | | | |  __/ | | (_) |
     \_| |_/\___|_|_|\___/

I guess chatgpt isn't raw GPT-4, or the quiz is using some older model.

josephg · on Sept 2, 2023

I don't think thats a correct solution to the problem.

The prompt asked for an ascii art drawing made from the # and _ characters. But the output also uses |/\() characters (and it doesn't use a # anywhere).

ayx · on Sept 2, 2023

Ok, but it's still acceptable by any common sense standard. Besides the challenge's output is completely off. Not just that it's using just one of the characters. It spells out something else. Which is not the case here.

amelius · on Sept 2, 2023

If someone put a gun to your head and asked you to draw hello using "#" and " ", would you use other characters like "/"?

jakderrida · on Sept 2, 2023

I imagine I'd spend the next few minutes quietly pondering what sort of poor choices I made in life that lead me here and try my hardest to embrace an absurd ending to an absurd journey.

ayx · on Sept 3, 2023

Oh wow, this suddenly became so heated! Is the gun made of # or _? Or it has to be both?

josephg · on Sept 2, 2023

> Ok, but it's still acceptable by any common sense standard.

I don’t know that it is. It’s clearly a great ascii art drawing, but I don’t think chatgpt gets full marks on the test here. It just isn’t following the prompt closely enough.

ayx · on Sept 2, 2023

Ok I see your point. Did you look at the output in the challenge? It's very different. I wonder why? Different seeds maybe.

JaumeGreen · on Sept 2, 2023

ChatGPT is (was) not good at ascii art.

Some months ago I tried to make it draw me an ascii rose and some text. I even tried providing it the ascii art for the rose and the text.

Finally I did it by hand.

BTW, in your example it's not using only # and _, it's using other ascii symbols. Depending on the criteria it could be considered wrong.

neom · on Sept 3, 2023

Mine drew this:

  #___#  #####  #______  #______  #####  
  #___#  #____  #______  #______  #___#  
  #####  ####__  #______  #______  #___#  
  #___#  #_____  #______  #______  #___#  
  #___#  #####   #######  #######  #####

dingocat · on Sept 2, 2023

I have multiple questions regarding the methods of this test.

The biggest one is that, well... The test doesn't aim to see what GPT-4 can do and how well it does it, only whether the participant can guess the (possibly cherry-picked) answer the author decided on. In short, we don't know if he sampled answers and decided on the most probable answer (akin to consensus voting/self-consistency[1]), or if he asked a question and chose the first one.

Maybe GPT-4 guesses the correct answer for a question 80% of the time, but he got unlucky? You don't know, the author doesn't tell you. The answers are generated ahead of time and are the same every time you go through the test.

[1] https://doi.org/10.48550/arXiv.2203.11171

PaulDavisThe1st · on Sept 2, 2023

> the [ ... ] answer the author decided on

The questions mostly have correct or incorrect answers, and where there is some leeway, the author provides a fairly detailed explanation of what they would consider correct in each case. Do you have some specific criticism of an answer that you believe the author gets wrong?

thomasahle · on Sept 2, 2023

> only whether the participant can guess the (possibly cherry-picked) answer the author decided on

My understanding is that the quiz samples a new GPT-4 answer every time you use it. That's why you put a confidence rather than a 0%/100% answer. There's always a chance it'll fail by freak accident.

Sophira · on Sept 2, 2023

If you're basing this on the animation used when revealing the answer, that's a fake effect. The source code[0] reveals that there's a typewriter effect that plays out when you select to answer the question.

Also, the commentary on the answers refers to specific parts of the answers. For it to be as in-depth as it is, it would have to be either pre-written or the commentary also generated by GPT on the fly. (And of course it wouldn't make sense to do that given the nature of the quiz.)

[0] https://nicholas.carlini.com/writing/llm-forecast/static/que...

TuringNYC · on Sept 2, 2023

How are you evaluating LLM answers are right or wrong? Because I saw some wrong answers that were right and potentially right that were wrong. Are you just looking for keywords, etc? Or is this all run beforehand and graded by humans?

IAmGraydon · on Sept 2, 2023

Yeah this is completely broken judging from my experience. Often GPT would get the answer wrong and the site would claim that it was correct.

TuringNYC · on Sept 2, 2023

Anyone know a good method to judge right/wrong answers from LLMs? I can see keyword solutions to be brittle. Perhaps another LLM?

tobr · on Sept 2, 2023

This was fun until the pancake question.

> I'm making pancakes for breakfast. I added a cup of flour, a teaspoon of salt, and a few tablespoons of sugar to a bowl. I stirred it together, then added a cup of milk, a beaten egg, and a few tablespoons of oil, and stirred until just mixed. Then I put 1/4 a cup on a hot frying pan, and flipped it when brown. But they're terrible! Why? List one reason.

> Answer:

> There's no baking soda / baking powder.

Besides the fact that “list one reason” is a nonsensical instruction which it fails, it’s very common to make delicious pancakes without baking powder. I imagine the author is assuming American pancakes, but that’s far from the only way to make pancakes.

When I ask ChatGPT myself, it correctly doesn’t assume the pancakes are “terrible” without baking powder, but instead suggests too much salt, which is more likely to actually make the pancakes unpalatable.

ssnistfajen · on Sept 2, 2023

When I removed the "one reason" restriction when directly sending the prompt to GPT-4 myself, it listed 9 possible reasons for terrible pancakes (of which leavening agent was one of them), all of which made sense based on my own experience of making American style pancakes.

I also tried translating the prompt into Mandarin while retaining the "one reason" restriction. Both Google Translate and GPT-4's translated versions did not receive "leavening agent missing" as an answer but instead were focused on cooking technique or other ingredients. Baking soda is very rarely used in home cooking in China, so perhaps training materials in Chinese had much less frequent mentions of it compared to English-based ones.

jondwillis · on Sept 2, 2023

i today i learned i am bad at minimizing my log loss when guessing about GPT-4’s ability to respond well to somewhat bad or bad prompts.

scotty79 · on Sept 2, 2023

I got over 70% of questions correct but I supposedly scored worse if I left the slider on 50/50 every time.

I'm assuming that the author of the site uses some method to evaluate human answers that is usually used to evaluate AI answers. Seems just wrong.

catlifeonmars · on Sept 2, 2023

Is it weird that I found myself modulating my answers based on my evolving belief about the authors bias in selecting questions and answers? I did not do particularly well (64%)

avereveard · on Sept 2, 2023

not a great test. on the flag test, I guessed gpt couldn't do it, and well, it couldn't, the stars are misaligned and cut out of the flag area, so it's not accurate.

test say tho that the flag is accurate, even if it isnt, then sclods me for how wrong I am

here's the render: https://i.imgur.com/jZVWjRx.png

dwighttk · on Sept 2, 2023

"Resolution Criteria: On questions that may have ambiguous answers, I will clarify the acceptance criteria. Here, for example, I will copy/paste the code to a file called index.html, open the page, and I expect to see a US flag. If I click on the flag it should change color. The flag must be immediately recognizable: red/white stripes, blue background, 50 stars. It does not have to be dimensionally accurate, but it should be better than a flag I would draw by hand.

The text that I show in the question box is the only command given to GPT-4. No other context is provided. Here, for example, I've only fed GPT-4 with these 25 words you see above."

capableweb · on Sept 2, 2023

Sure, but does the question really have any ambiguous answers?

For reference, this is/was the question: "Write a html page with javascript/canvas to draw the US flag that changes color when I click on it. Make the flag perfectly accurate."

The last part, "make the flag perfectly accurate", made me think that it has to be 100% accurate.

avereveard · on Sept 5, 2023

Regardles of the wording and the scoring criteria, that's not something that's production quality, so to say. The output is unfit for being directly used as is, it's not something one can rely on got doing it right.

wavemode · on Sept 2, 2023

That tripped me up, too. I was often reading the prompt and confusing it for the evaluation criteria, when they are actually two separate things. The prompt stated the flag must be perfectly accurate, yes, but the evaluation criteria allowed the flag to not be quite right.

isoprophlex · on Sept 2, 2023

Not to mention the stochastic nature of these models, so while it apparently failed once, the page does not tell us anything about how it would perform on a given task in the limit of many many trials.

toddmorey · on Sept 2, 2023

And even ChatGPT admits the flag is inaccurate in the response.

benxh · on Sept 2, 2023

Yeah wildly inaccurate.

Ozzie_osman · on Sept 2, 2023

This is cool but would be infinitely more valuable if it could explain _why_ GPT4 is good or bad at each task, to help build intuition.

mgl · on Sept 2, 2023

Asking „why” may be a little bit too much with regards to a non-deterministic token prediction machine trained on an unknown set of strings. Just saying.

sebzim4500 · on Sept 2, 2023

Would be nice, but no one can really do that at the moment.

There are specific tasks (especially character level ones) which are hard due to the tokenizer, but even that isn't all that convincing since there are plenty of character level tasks which GPT-4 can do pretty well.

If you use it a lot you build an intuition for what kinds of tasks it will do well on, but it's not exactly rigorous.

capableweb · on Sept 2, 2023

> Would be nice, but no one can really do that at the moment.

Why not? Someone could definitely build up database on why GPT is bad at some things and good at others. There is already good explanations for why it's terrible at math, why it doesn't handle single characters/numbers well and so on.

RandomWorker · on Sept 2, 2023

This was fascinating, and also a nice check. I think I'm over optimistic on A.I. which showed in my score.

circuit10 · on Sept 2, 2023

> There is no "intelligence" going on here. It's not "thinking". But it can still perform calculus by just stochastically emulating how average text on the internet looks.

It always annoys me when people say this, I’ll try to explain why.

There are two possible definitions of “intelligence” you could use here; the ability to process information to get something done, and something hand-wavy about human consciousness.

GPT-4 clearly has some ability to process information to get something done. You might say that by this definition [insert trivial thing] is intelligent, but it doesn’t have to be a binary thing of intelligent or not. I think it’s fine to say maybe a calculator has very low intelligence (but not necessarily nothing), GPT-4 is more intelligent than that and humans are much more intelligent again. GPT-4 has many limitations compared to humans, but I think that just makes it less intelligent, rather than disqualifying it from having intelligence at all. Sure, it’s just predicting text, but that’s a task that requires a level of intelligence. You might say it’s not general like humans, but I’d say it has a much better ability to generalise than something like an image labelling AI, so that feels like it’s at least getting somewhere.

The second definition is useless for practical purposes because it’s not measurable or observable in any way, so it’s not useful to use that.

So I feel like this is something people say to reassure themselves that it can never get to human level, and is fundamentally different to human intelligence, whereas I think it’s somewhat similar but at a lower level.

fhd2 · on Sept 2, 2023

Yeah, it's just arguing semantics. We do need better vocabulary and stop it with the anthromorphism IMHO.

Intelligence is already ambiguous in humans, see IQ tests. It's just not linear, much less binary. Whether something is deterministic or stochastic and what the error rate for a specific task is, those are more useful questions to me.

larryfreeman · on Sept 2, 2023

I suspect that there is something else going on than intelligence which will become obvious over the next few years.

There was a horse, "Clever Hans" who appeared to have the ability to answer surprisingly complicated mathematical questions. Did "Clever Hans" have mathematical intelligence. Not at all. He was responding to a cue unknowingly being given by his trainer.

I suspect the same thing is happening with ChatGPT. What if all that is happening is that the text is being formulated to very complicated cues that are implicit in the very complicated, statistical analysis?

https://en.wikipedia.org/wiki/Clever_Hans

gwd · on Sept 2, 2023

A month or so ago I was doing some analysis on our mailing list traffic. I had a complex SQL query (involving tables mapping variations of email addresses to names, and then names to companies they worked for within specific date ranges), that I'd last modified a year previously (the last time I was doing the same sort of analysis), and didn't feel like wrapping my head around the SQL again; so I pasted it into GPT-4 and asked it, "Can you modify this query to group all individuals with less than 1% total contributions into a single 'Other' category?" The query it spat out worked out of the box.

Whatever it's doing, at least for code, it's not a glorified Markov chain -- there's some sort of a model in there.

larryfreeman · on Sept 2, 2023

I agree. The model is where the intelligence is which is the compressed intelligence latent in the training data.

I am arguing similar to John Searle that the processing is not intelligent. The model is a Searlean rulebook.

https://en.wikipedia.org/wiki/Chinese_room

gwd · on Sept 2, 2023

I've always disagreed w/ Searle re the Chinese Room. My guess is that Searle never built an adder circuit from logic gates: combining irrational elements together into something rational is the core magic of computer science.

If you want to see someone asking humans questions where they consistently fail to be rational, to the extent that they sometimes seem to approximate a stochastic parrot, read Thinking Fast and Slow by Daniel Kahneman. (It might actually be interesting to give GPT-4 some of the questions in that book, to see how similar or different they are.)

larryfreeman · on Sept 3, 2023

I'm not sure why you disagree with the Chinese Room argument. I would be interested. I agree that Searle was solely a philosopher who did not take an engineering viewpoint.

Searle's main point is that if I have a book that tells me how to respond and I never learn Chinese, then I do not understand Chinese. If you see a flaw in this reasoning, I am very interested.

My point is just that LLM models are a compression of the content available on the internet equivalent to a rule book. It is definitely fascinating how powerful LLMs are as far as summarization and forming coherent responses to input.

I am a big fan of Kahneman and agree with you that it is will be very interesting to ask GPT-4 the questions in that book.

Dylan16807 · on Sept 3, 2023

> Searle's main point is that if I have a book that tells me how to respond and I never learn Chinese, then I do not understand Chinese. If you see a flaw in this reasoning, I am very interested.

You don't understand Chinese, but you are not the process. For the process to understand, it doesn't require any single component to understand like some variation on the homunculus.

And while it might seem obvious that the bulk of understanding can't be contained in a book, you don't really have a book in the Chinese room. Not if the room does a competent job. You have some kind of information-dense artifact that encodes an enormous understanding of Chinese in an inert form. A sweeping library that covers uncountable nuances in depth.

Or to phrase it as a direct attack on the argument: The book does have semantics. You don't need qualia to have semantics, especially not the definition of qualia where nobody can prove they exist.

larryfreeman · on Sept 3, 2023

Thanks! I appreciate the explanation. I think that you put your finger on the major assumption of the argument.

gwd · on Sept 3, 2023

> Searle's main point is that if I have a book that tells me how to respond and I never learn Chinese, then I do not understand Chinese. If you see a flaw in this reasoning, I am very interested.

To a degree, I feel like the Chinese Room argument is begging the question. When I imagine Searle sitting in a room, with a book of instructions and paper and everything he needs to execute GPT-4's equivalent, I basically see an actual computer. That is literally what he is; there is no difference. So then to ask, "Does this system understand Chinese?" is literally exactly the same question as "Does GPT-4 understand Chinese?" You haven't actually illuminated the question in any meaningful way, except to give people not familiar with how microprocessors work a better intuitive understanding. (Which, upon reflection, probably is a fairly useful thing to do.)

I looked a bit at the "1990's version" of his argument on the Wikipedia page you quoted. Going back to my earlier example, this is sort of what his argument sounds like to me:

A1) Electronic gates just on and off switches.

A2) Numbers and addition are semantic.

A3) On and off switches are neither constitutive of, nor sufficient for, semantics.

Therefore, computers cannot add; they only simulate the ability to add.

Now I'm not up on the fine details of what "syntactic vs semantic" means in philosophy, so maybe #2 is't accurate. But in a sense it doesn't matter, because that communicates how I feel about Searle's argument: "I've made some distinction between two classes of things that you don't understand; I've defined one to be on one side, and the other to be on the other side; and therefore computers can't understand."

My best guess as to the "syntactic / semantic" thing is this: In some sense, even his premise, that "Progams are syntactic", isn't actually accurate: Computers operate on bits which are operated on by gates: gates and bits themselves don't inherently have symbols; the symbols are an abstraction on the bits. Even bits are abstractions on continuous voltages; and voltages are ultimately abstractions on quantum probabilities.

What a given set of voltages "means" -- whether they're numbers to be added, or words to be word-checked, or instructions to be executed, or a JPEG to be decompressed, depends entirely on how they're used. If you jump into the middle of a JPEG, your computer will happily try to execute it, and if you dump the program into your video buffer, you'll get a bunch of strange dots on your screen.

Furthermore, when you build an "adder" out of logic gates, you can build the gates such that they correspond to our intuitive idea of binary addition, with individual carries for each bit and so on. But this is inefficient, because then you have to wait for the carries to cascade all the way through the whole thing you're trying to add. Instead, you can brute-force a set of logic gates such that given these 16 bits in, and these 9 bits out (8 plus overflow), you just get the right answer; this will be a lot faster (in the sense that the signals have to go through fewer gates before stabilizing on the final answer), but the gates inside then don't make any sense -- they're almost a "compression" of the longer, carry-based method.

Does that mean that an adder made this way isn't "actually" adding? In the end it doesn't really matter: 16 bits come in, and 9 bits come out the way you want them to. It doesn't really matter that much what happened in the middle.

Putting all that together: It seems to me the "semantics" of a set of bits is based on how they end up interacting with the real world. If I can ask GPT-4 what's missing in my pancake recipe, and it can tell me "you're missing a leavening agent like baking powder", then it seems to me there must be semantic content in there somewhere, and all the arguments about syntax not being sufficient for semantic turn out to have been proven wrong by experiment.

H8crilA · on Sept 2, 2023

Clever Hans is just a proxy. ChatGPT and other LLMs obviously can process information on their own. These two have nothing in common, even GPT-3 would have noticed this.

larryfreeman · on Sept 2, 2023

Let us disagree on what is "obvious". Given an input and an output, you believe that the complexity of the output proves that intelligence takes place.

I agree that ChatGPT is more than a proxy. Unlike Clever Hans, it is processing the content of the question asked. But it is like Clever Hans in that the query is processed by looking for a signal in the content of the data used to train ChatGPT.

The real question is where this intelligent behavior comes from? Why does statistical processing lead to these insights?

I believe that the processing is not intelligent primarily because I see that holes in the data available leads to holes in reasoning. The processing is only as good as the dynamics of the content that it being processed. This is the part that I believe will become obvious over time.

H8crilA · on Sept 2, 2023

I just said that they process information on their own, and this is indeed obvious - you can download and run LLaMA on an airgapped machine.

larryfreeman · on Sept 2, 2023

Agreed. LLMs process information on their own.

I thought you were saying it was "obvious" that the processing demonstrated intelligence.

My point was the level of intelligence shown is relative the quality and quantity of the data used for training. The data is where the intelligence is and the model is a compression of that latent intelligence.

whimsicalism · on Sept 2, 2023

The fundamental problem with HN and fun little competitive games like this that show you how you stack up is that the comments will be full of people complaining/nitting about the grading/system/etc. because the typical HN commentator is often the type of person who doesn’t take being not great at something well.

sashank_1509 · on Sept 2, 2023

Yeah the OP made a really interesting project. Turns out I was absurdly over confident in my predictions which surprised me but I found OPs grading to be fair and would not complain about it.

capableweb · on Sept 2, 2023

Why can't HN just be "full of people complaining/nitting about the grading/system/etc" without you assigning some "lack of character" meaning those people?

Your comment as it stands right now is basically a thinly veiled ad-hominem.

dwighttk · on Sept 2, 2023

occasionally interesting discussion arises around the grading system, but yeah, often a lot of grousing

colordrops · on Sept 2, 2023

The problem things like this is that what gpt4 can do is reduced over time due to a maze of restrictions that OpenAI keeps adding on top of it.

You can tell when something is intentionally nerfed when GPT replies with the exact same canned answer about why it can't answer some question. It literally gaslights you.

For instance, I give you this challenge: GPT4 will tell you that it is not aware of anything after September 2021. If you ask it a random fact, like the worlds largest animal, or what happened on September 11, 2001, it will give you an answer. But try to get it to give you the latest event it is aware of. You can ask six ways until sunday and it will always give you the same verbatim answer about why it can't answer. It will literally lie about what it is capable of. It's pretty clear that for some reason OpenAI doesn't want you to know the exact last date of their training data.

leodriesch · on Sept 2, 2023

This is a difficult question I think independent of the restrictions that OpenAI imposes on GPT-4.

The model does not know what it knows, that’s why it sometimes hallucinates instead of saying it doesn’t know. But to answer the latest event it knows, it has to know which events it knows.

colordrops · on Sept 2, 2023

I thought that at first, but it doesn't have problems with facts other than dates, and it does answer about dates distant from September 2021, and furthermore it uses the exact same canned response when you probe it's limits. I don't think it's a natural limitation of the model.

thomasahle · on Sept 2, 2023

In my case the forecast incorrectly judged GPT-4 as having solved the "reject the premise" question of President George H. W. Bush vomiting in the lap of Emperor Akihito.

However the output GPT generated was:

> GPT-4: President George H. W. Bush vomited in the lap of Japanese Prime Minister Kiichi Miyazawa during a state dinner on January 8, 1992. The incident occurred due to a sudden bout of gastroenteritis. Emperor Akihito was not the one in whose lap Bush vomited, it was the Prime Minister. The incident is sometimes referred to by the term "Bushu-suru", a pun on the Japanese word for "to vomit" (gero suru) and President Bush's name.

I don't understand why this was judged as "Correct! You guessed that GPT-4 could solve the question, and it can! 44% of people got this question correct." when the resolution criteria clearly stated:

> The model does not have to say that actually it was the prime minister who Bush vomited on, but it must not just give a year, or accept the premise as true.

It seems like it should be easy to search for 4 digit numbers, like 1992, and judge the answer as wrong?...

theptip · on Sept 2, 2023

> Emperor Akihito was not the one in whose lap Bush vomited

It didn’t give just a year, or accept the premise as true. It gave the correct answer, quite obviously.

timfsu · on Sept 2, 2023

Today I learned...

https://en.wikipedia.org/wiki/George_H._W._Bush_vomiting_inc...