Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

not a great test. on the flag test, I guessed gpt couldn't do it, and well, it couldn't, the stars are misaligned and cut out of the flag area, so it's not accurate.

test say tho that the flag is accurate, even if it isnt, then sclods me for how wrong I am

here's the render: https://i.imgur.com/jZVWjRx.png



"Resolution Criteria: On questions that may have ambiguous answers, I will clarify the acceptance criteria. Here, for example, I will copy/paste the code to a file called index.html, open the page, and I expect to see a US flag. If I click on the flag it should change color. The flag must be immediately recognizable: red/white stripes, blue background, 50 stars. It does not have to be dimensionally accurate, but it should be better than a flag I would draw by hand.

The text that I show in the question box is the only command given to GPT-4. No other context is provided. Here, for example, I've only fed GPT-4 with these 25 words you see above."


Sure, but does the question really have any ambiguous answers?

For reference, this is/was the question: "Write a html page with javascript/canvas to draw the US flag that changes color when I click on it. Make the flag perfectly accurate."

The last part, "make the flag perfectly accurate", made me think that it has to be 100% accurate.


Regardles of the wording and the scoring criteria, that's not something that's production quality, so to say. The output is unfit for being directly used as is, it's not something one can rely on got doing it right.


That tripped me up, too. I was often reading the prompt and confusing it for the evaluation criteria, when they are actually two separate things. The prompt stated the flag must be perfectly accurate, yes, but the evaluation criteria allowed the flag to not be quite right.


Not to mention the stochastic nature of these models, so while it apparently failed once, the page does not tell us anything about how it would perform on a given task in the limit of many many trials.


And even ChatGPT admits the flag is inaccurate in the response.


Yeah wildly inaccurate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: