Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I stopped playing after the challenge of creating a line-based canvas drawing of the word "Hello".

Similarly, stopped playing when the question was "Write a html page with javascript/canvas to draw the US flag that changes color when I click on it. Make the flag perfectly accurate." and it generated a flag that wasn't "perfectly accurate" (https://i.imgur.com/WhyRsYa.png, notice the position of the top/left stars) but then told me "Wrong! You guessed that GPT-4 could not solve the question, but it can! 71% of people got this question correct."

I'm not sure how the validation is done, seems to be manually hardcoded or something, but it seems it's not very reliable.



"Resolution Criteria: On questions that may have ambiguous answers, I will clarify the acceptance criteria. Here, for example, I will copy/paste the code to a file called index.html, open the page, and I expect to see a US flag. If I click on the flag it should change color. The flag must be immediately recognizable: red/white stripes, blue background, 50 stars. It does not have to be dimensionally accurate, but it should be better than a flag I would draw by hand."


I don't think "questions that may have ambiguous answers" applies when you use a term like "perfectly accurate" which has a very specific meaning.


On top of the requirement to make it perfectly drawn I take issue with the "but it should be better than a flag I would draw by hand". That's a useless metric because we don't know how the author draws by hand.

I assumed they would have enough brain cells to draw the flag without cutting in half all the stars in the margins. But they must have failed kindergarten because I assumed wrong.


I started again picking a number at random without looking at the questions. This is what it had to say:

"You answered 39.29% of questions correctly with an average log-loss of 3.880. This means you would have scored better by just leaving your guesses at 50% for every single question.. On average there were 0.00% of people who did better at this game than you. If this was an exam and I was grading it on a curve, I would give you an A+. If you had been better calibrated, you could have scored in the top 23.41%, and I would have given you a B+."

So I did worse than random but 0% did better than me and got an A+. Nice.


Note that the prompt is the input to the LLM, it does not specify the task in enough detail to evaluate the result. That's what the resolution criteria are for -- additional information on resolution you are given but the LLM is not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: