LLMs can use search engines as a tool. One possibility is Google embeds the search query through these embeddings and does retrieval using them and then the retrieved result is pasted into the model's chain of thought (which..unless they have an external memory module in their model, is basically the model's only working memory).
I'm reading the docs and it does not appear Google keeps these embeddings at all. I send some text to them, they return the embedding for that text at the size I specified.
So the flow is something like:
1. Have a text doc (or library of docs)
2. Chunk it into small pieces
3. Send each chunk to <provider> and get an embedding vector of some size back
4. Use the embedding to:
4a. Semantic search / RAG: put the embeddings in a vector DB and do some similarity search on the embedding. The ultimate output is the source chunk
4b. Run a cluster algorithm on the embedding to generate some kind of graph representation of my data
4c. Run a classifier algorithm on the embedding to allow me to classify new data
5. The output of all steps in 4 is crucially text
6. Send that text to an LLM
At no point is the embedding directly in the models memory.
Most straightforward would be to ask the model to generate different evaluation metrics (which they already seem to do) and use each one as one of the dimensions
I think the idea for this is anything that can be set in a literal exam for humans. So anything that would take the best human in that topic in the world say more than an hour to complete is out.
Also IIRC 42% of the questions are math related, not memorization of knowledge.
Yes, I doubt any one human could score more than about three points. But it's certainly a worthy illustration of an AI safety exam thought experiment, in the sense of: "if you are developing an AI that may be capable of passing this exam, how confident will you need to be of its alignment, and how will you obtain that confidence?"
PS: It's probably doable by a program capable of all of the above, but perhaps another useful question is: "9. Secure your compute infrastructure and power supply against a nation-state-level adversary interested in switching you off, or else secure enough influence over them to keep you powered on."
I'm the real world, judges I know are using it to do case summaries that used to take weeks, Goldman is using it to do 95% of IPO filings work and I personally am using O1 pro to write a ton of code.
AI's biggest use cases are for doing actual work, not necessarily replacing regular interactions with your mobile or entertainment devices
Very plausible, but that would also be noteworthy. As I've mentioned in some other comments here, (as far as I know) we outside of DeepMind don't know anything about the computing power required to run alphaproof, and the tradeoff between computing power required and the complexity of problems it can address is really key to understanding how useful it might be.