Makes me wonder if the Go v3 client has the same problem. If yes, that would be ...

mdaniel · on Aug 8, 2024

At the very real risk of "talk is cheap," my understanding is that is part of why Jepsen publishes the test suites (e.g. https://github.com/jepsen-io/etcd ) so it's not "take my word for it" but rather "lein run test-all" and watch the fireworks. So, a sufficiently motivated actor, say for example one of the deep-pocketed stewards of the Kubernetes project could run the tests themselves

Between my indescribable hatred for etcd and my long-held lust for a pluggable KV backend (https://github.com/kubernetes/kubernetes/issues/1957 et al) it'd be awesome if any provable KV safety violations were finally the evidence required for them to take that request seriously

protosam · on Aug 8, 2024

Having looked at the test suite already, I know enough to know that I don't understand it well enough to be that guy to do this. It's for this reason, I'm personally going to pull out the popcorn and see what happens over the next few weeks.

jhgg · on Aug 8, 2024

I'm currently working on a Rust v3 client, and have been reading the Go v3 source code, and the code definitely is hard to follow so I would be unsurprised if there were issues lurking.

tjungblut · on Aug 8, 2024

could you be more specific on what's so hard to follow? it's quite literally just the implementation of the GRPC interface [1].

[1] https://github.com/etcd-io/etcd/blob/main/client/v3/kv.go#L3...

silverlyra · on Aug 8, 2024

I was curious and dug into the Go client code. You linked to the definition of KV – the easiest way to create one is with NewKV [1], which internally creates a RetryKV [2] wrapper around the Client you give it.

RetryKV implements the KV methods by delegating to the underlying client. But before it delegates an immutable request (e.g., range), it sets the request retry policy to repeatable [3].

Retries are implemented with a gRPC interceptor, which checks the retry policy when deciding whether a request should be retried [4].

The Jepsen writeup says a client can retry a request when “the client can prove the first request could never execute, or that the request is idempotent”. In my (cold) read of the code, the Go client stays within those bounds.

For non-idempotent requests, the Go client only retries when it knows the request was never sent in the first place [5]. For idempotent requests, any response with gRPC status unavailable will be retried [6].

Unlike jetcd, the Go client’s retry behavior is safe.

[1] https://github.com/etcd-io/etcd/blob/main/client/v3/kv.go#L9... [2] https://github.com/etcd-io/etcd/blob/main/client/v3/retry.go... [3] https://github.com/etcd-io/etcd/blob/main/client/v3/retry_in... [4] https://github.com/etcd-io/etcd/blob/main/client/v3/retry_in... [5] https://github.com/etcd-io/etcd/blob/main/client/v3/retry.go... [6] https://github.com/etcd-io/etcd/blob/main/client/v3/retry.go...

protosam · on Aug 8, 2024

Just dropping a comment to express my gratitude for sharing a breakdown of your interpretation. :)

jhgg · on Aug 8, 2024

Look at the code for the watcher client[1] and lease management[2].

[1]: https://github.com/etcd-io/etcd/blob/main/client/v3/watch.go

[2]: https://github.com/etcd-io/etcd/blob/main/client/v3/lease.go

tjungblut · on Aug 9, 2024

The watch is a simple processing loop that receives and sends on a bi-directional GRPC channel. Leases have a similar loop for keep-alive messages, everything else is quite literally delegated.

I get that it's difficult to translate this 1:1 into Rust without channels and select primitives, but saying it's complex is wild. Try the server-side code for leases ;)