I’d note they explicitly document they rev GPT-4 every two weeks and provide fixed snapshots of the prior periods model for reference. One could reasonably benchmark the evolution of the model performance and publish the results. But certainly you’re right - ChatGPT != GPT4, and I would expect that ChatGPT performs worse than GPT4 as it’s likely extremely constrained in its guidance, tunings, and whatever else they do to form ChatGPT’s behavior. It might also very well be that to scale and revenue follow costs they’ve dumbed down the ChatGPT plus. I’ve found it increasingly less useful over time but I sincerely feel like it’s mostly because of the layers of sandbox protection they’re adding constraining the model into non optimal spaces. I do find that the classical iterative prompt engineering still helps a great deal - give it a new identity aligned to the subject matter. Insist on depth. Insist on checking the work and repeating itself. Asking it if it’s sure about a response. Periodically reinforcing the context you want to boost the signal. Etc.
Heh, this kind of reminds me of the process of enterprise support.
Working with the customer in dev: "Ok, run this SQL query and restart the service. Done, ok does the test case pass?" Done in 15 minutes.
Working with customer in production: "Ok, here is a 35 point checklist of what's needed to run the SQL query and restart the service. Have your compliance officer check it and get VP approval, then we'll run implementation testing and verification" --same query and restart now takes 6 hours.
https://platform.openai.com/docs/models/gpt-4