The thing is that you can’t actually trust it did run the rm command.
As soon as you ask “give me a list of all the commands that led to the deletion”, isn’t it extremely likely to just invent an rm in there?
Furthermore—and granted, I didn’t watch the video in detail—what data was actually deleted? Maybe the hallucination was that some data was there when it wasn’t, and then Claude convinced itself it deleted something in the move process. Notice that it never says “I accidentally ran rm instead of mv”. That only happens when the user asks to backfill the commands.
Does coworker give Claude access to historical commands, or does Claude just generate based on its “memories”?
I’ve been using Claude quite a bit over the past few weeks, and this is a pattern I’ve noticed a few times.
Naive question, but isn’t every output token generated in roughly the same, non-deterministic, way? Even if it uses its actual history as context, couldn’t the output still be incorrect?
Have you ever seen those posts where AI image generation tools completely fail to generate an image of the leaning tower of Pisa straightened out? Every single time, they generate the leaning tower, well… leaning. (With the exception of some more recent advanced models, of course)
From my understanding, this is because modern AI models are basically pattern extrapolation machines. Humans are too, by the way. If every time you eat a particular kind of berry, you crap your guts out, you’re probably going to avoid that berry.
That is to say, LLMs are trained to give you the most likely text (their response) which follows some preceding text (the context). From my experience, if the LLM agent loads a history of commands run into context, and one of those commands is a deletion command, the subsequent text is almost always “there was a deletion.” Which makes sense!
So while yes, it is theoretically possible for things to go sideways and for it to hallucinate in some weird way (which grows increasingly likely if there’s a lot of junk clogging the context window), in this case I get the impression it’s close to impossible to get a faulty response. But close to impossible ≠ impossible, so precautions are still essential.
Yes, but Claude Cowork isn't just an LLM. It's a sophisticated harness wrapped around the LLM (Opus 4.5, for example). The harness does a ton of work to keep the number of tokens sent and received low, as well as the context preserved between calls low. This applies to other coding agents to varying extents as well.
Asking for the trace is likely to involve the LLM just telling the harness to call some tools. Such as calling the Bash tool with grep to find the line numbers in the trace file for the command. It can do this repeatedly until the LLM thinks it found the right block. Then those line numbers are passed to the Read tool (by the harness) to get the command(s), and finally the output of that read is added to the response by the harness.
The LLM doesn't get a chance to reinterpret or hallucinate until it says it is very sorry for what happened. Also, when it originally wrote (hallucinated?) the commands was when it made an oopsy.
I'd love to be able to modify JS at runtime on random websites. Too often there's a bug, or a "feature" that prevents me from using a service, that I could fix by removing an event or something in the JS code.
As far as I know neither Firefox nor Chrome allow you to modify the JS prior to execution without a plugin. You can run random JS, sure, but you can’t monkeypatch.
In this case the comment that was promoted to the top-level has been consistently higher on the page (it’s the first comment still) than the comment it originally responded to.
There is no problem with that, as Bitcoin itself will never reach the transaction volume of regular cash - it couldn't even handle that given the number of transactions (nor does it make sense to store your coffee payment on any given day for a lifetime in the blockchain). Thats where layer 2 solution such as Lightning come into play - for everyday, more privacy-friendly smaller transactions that are not put on the blockchain. And on Layer 2, you can go even smaller in denominations.
To reiterate and paraphrase: "we need more coffee shops to accept bitcoin" "but it's not divisible enough" "that's fine because we don't need coffee shops to accept bitcoin"
Well to be fair, Lightning IS bitcoin - and inevitably tied to it. It is just a different protocol, or a different mean of accounting and transferring ownership of bitcoin.
Just imagine Gold was used as money, but you quickly realize that weighing and dividing gold is cumbersome. So someone creates a layer 2, prints green paper bills in arbitrary denominations and that very someone guarantees you can exchange those green paper bills at a fixed rates for ounces of gold. You exchange green paper bills, the green paper bills have value, but the underlying asset still is gold. Until that someone decided no longer to do the exchange to gold for those green sheets of paper, and basically performed the first rug-pull in history. That is where bitcoin steps in, as its decentralization guarantees there is no such rug-pull. And lightning is the green bills that can be exchanged for the underlying bitcoin anytime, without relying on the promise of a third party.
I’m not certain I agree with the premise that mileage is a good indicator of road wear.
I do about 30k km in a typical year. My families live far, so a return trip is around 4000 km.
If we visit our family 3x/year, we’ve effectively exhausted this “10k mile” thing (I don’t live in the UK, but the point still stands), however very little of our actual mileage would be in our home country. To be precise, only 300km out of 2000.
If I go in other directions, the math gets even worse. I can leave the country in 30km and add 800-3000km of mileage for a scuba trip.
The amount of damage you do to the roads is exactly proportional to how many miles you drive on the roads. Where the roads are doesn't matter.
What you're describing is a billing detail - how do ensure the right chunk of those fees goes to the owners of those roads? And that leads to the conundrum I posed - without tracking your location at all times there's no way to prove what number of miles were in one municipality versus another.
It depends on axle weight, not vehicle weight, and regardless of weight is directly linearly tied to mileage.
Road fees aren't just for the damage you're causing - they're for construction, signage, and many other pieces of infrastructure whose usage depends more on mileage than weight.
Yes, that was exactly my point. A local government handing me a bill that is not proportional to which roads I was driving on.
There are many systems, many of them imperfect. The vignette system, per week/month seems maybe the most responsible, as it guarantees I’m paying my due in the place of and proportional to the amount of time I will be using the roads.
My dog has been getting UTIs her whole life, ever since she was a pup. The vet kept prescribing the same antibiotic over and over again. We would do the full 10 days of treatment, the symptoms would be alleviated for a couple weeks, and then they gradually showed up again over the course of a few weeks to a month.
They kept insisting asking if we did give it twice a day, are we sure we did the full course, did we respect the 12h interval, etc. The vets told us this (we saw about 6 different vets at the clinic), the person manning the phone berated us, the nurse welcoming us again repeated the same thing.
Eventually I asked to see the test results (the cultures). It was clear that another antibiotic was effective, and that the one they were giving us wasn’t (it was about 25% better than the control). I asked why we couldn’t get the other one, and it turned out it was difficult to get in our country because it was only approved for humans.
We had to get a dispensation from the health ministry to import it from a neighbouring country. It was a mess of a process that took weeks.
Blaming patients is so ingrained that we were being gaslit into giving our pet an ineffective treatment and made to feel like we were doing something wrong all along.
Sorry, I’m not at all into biology so I didn’t know how to express it.
I just meant that over the course of 72 hours, the speed at which the Petri dish with the less-effective antibiotic was filling up seemed to lag behind the control by about 24 hours. At 72 hours both were “full” regardless.
I’ve only used a tiller when I was learning to sail. Since then I’ve only used larger ships with a wheel as the helm. You’re absolutely right that a tiller is an order of magnitude easier still.
The amount you turn the wheel is identical [0] with or without power steering, unless perhaps you have one of the weird variable turn ratio systems. In a conventional power steering system, the steering wheel is linked to the wheels, and the power steering applies torque to help you turn the wheel but does not change the relationship between the steering wheel and the wheels.
[0] Almost identical. The steering has some flex, and the amount it flexes is related to how much torque you apply. But this is a tiny effect.
My comment was explicitly about how physically difficult it was to turn the wheel. I had to crank it over far as well, in order to get off the highway.
I use passkeys exclusively on my YubiKeys, and I ensure I always have a backup (two Yubikeys with one passkey each).
TOTPs are handled the same way (stored on two Yubikeys).
We used password managers when 2FA allowed us to guarantee that even a leak of the passwords wouldn’t be that catastrophic. If you sync your passkeys to your password manager, anyone compromising it has full access to your accounts.
As a random aside: in a past life, I was helping somebody with a project where we needed a VPN connecting two locations. We were working with OpenVPN, and the nodes we had to handle the connection were beefy with large uplinks but we were hitting the single-core limitation so it didn't matter. We ended up building a proof of concept that launched one OpenVPN instance per core and then bonded them together, which let us get way closer to line rate.
Thankfully we never had to try them for real to see how horribly that would have gone under real load.
AES-ni is enabled and Linux confirms it’s enabled and openSSL has it enabled, but I found no easy way to check if it’s actually being used (I found a link long ago but lost it :( )
I was using either AES-256-GCM or AES-256-CBC.
It could also be default configs not set right. Brief google search tells me to tweak myriad of buffers and config options… Some saying without changing buffers they were limited to 100mbps for example. Lots said changing to udp/changing mtu/buffer/etc helped…
I agree with you that it should be fine/fast enough. That was my expectation too! However my testing in real life showed it not to be and it’s a common issue for openvpn. The easiest solution seems to be wire guard rather then tweaking random stuff with no idea what’s bottlenecks
As soon as you ask “give me a list of all the commands that led to the deletion”, isn’t it extremely likely to just invent an rm in there?
Furthermore—and granted, I didn’t watch the video in detail—what data was actually deleted? Maybe the hallucination was that some data was there when it wasn’t, and then Claude convinced itself it deleted something in the move process. Notice that it never says “I accidentally ran rm instead of mv”. That only happens when the user asks to backfill the commands.
Does coworker give Claude access to historical commands, or does Claude just generate based on its “memories”?
I’ve been using Claude quite a bit over the past few weeks, and this is a pattern I’ve noticed a few times.
reply