I always tilted to at least 75% or 25% for each question. IMO leaving it at 50-5...

I always tilted to at least 75% or 25% for each question. IMO leaving it at 50-50 defeats the whole point of the "challenge" as it would indicate the user has no real opinion on whether GPT-4 can answer correctly or not. Either GPT-4 is correct or it is wrong. Putting a binary choice on a slider would be very confusing to people not accustomed to measuring everything on a probability scale.