not a great test. on the flag test, I guessed gpt couldn't do it, and well, it c...

dwighttk · on Sept 2, 2023

"Resolution Criteria: On questions that may have ambiguous answers, I will clarify the acceptance criteria. Here, for example, I will copy/paste the code to a file called index.html, open the page, and I expect to see a US flag. If I click on the flag it should change color. The flag must be immediately recognizable: red/white stripes, blue background, 50 stars. It does not have to be dimensionally accurate, but it should be better than a flag I would draw by hand.

The text that I show in the question box is the only command given to GPT-4. No other context is provided. Here, for example, I've only fed GPT-4 with these 25 words you see above."

capableweb · on Sept 2, 2023

Sure, but does the question really have any ambiguous answers?

For reference, this is/was the question: "Write a html page with javascript/canvas to draw the US flag that changes color when I click on it. Make the flag perfectly accurate."

The last part, "make the flag perfectly accurate", made me think that it has to be 100% accurate.

avereveard · on Sept 5, 2023

Regardles of the wording and the scoring criteria, that's not something that's production quality, so to say. The output is unfit for being directly used as is, it's not something one can rely on got doing it right.

wavemode · on Sept 2, 2023

That tripped me up, too. I was often reading the prompt and confusing it for the evaluation criteria, when they are actually two separate things. The prompt stated the flag must be perfectly accurate, yes, but the evaluation criteria allowed the flag to not be quite right.

isoprophlex · on Sept 2, 2023

Not to mention the stochastic nature of these models, so while it apparently failed once, the page does not tell us anything about how it would perform on a given task in the limit of many many trials.

toddmorey · on Sept 2, 2023

And even ChatGPT admits the flag is inaccurate in the response.

benxh · on Sept 2, 2023

Yeah wildly inaccurate.