bwarminski's comments

bwarminski · on Sept 19, 2023

I am a part of Olga's team and would be happy to answer any questions. This undertaking was highly successful and has become the new model for implementing Vitess cluster-wide changes. Previously, such changes involved cumbersome in-place updates, expensive roll backs and lengthy bake periods.

bwarminski · on March 3, 2023

This article does a great job describing the investment required to pull this off. At HubSpot, my team is running a large Vitess/MySql deployment (500+ distinct databases some sharded, multi region) atop k8s today and had to learn a lot of those same lessons and primitives. We opted to write our own operator(s) to do it. In the end, the investment has paid off in terms of being able to build self service functionality for the rest of the business and write the kinds of tools workflows that allow us to support it with a relatively small team. The value is in the operator pattern itself and being able to manipulate things on a common control plane. Compared to the alternative of managing this with Terraform and Puppet/Ansible/Chef directly on EC2, which I've also done before, it's a better experience and much more maintainable even at the fixed expense of additional training and tooling.

I won't disagree with others that RDS is probably worth it until you need something very specific or have reached a certain scale.

Happy to share tips or pointers for anyone going down this path specifically with MySql or database workloads in general.

matesz · on March 3, 2023

The first question which comes to my mind is what are performance implications of running database like you do inside k8s vs EC2 vs bare metal? And how did you solve multitenancy? Does the operator handle lifecycle of database per customer simply or is it something more complicated?

ps. And how you deal with migrations? ps. Forgive me if I'm asking for too much!

bwarminski · on March 3, 2023

No worries, happy to share more details. For the databases where performance is a concern, we use constraints and reservation requests to all but guarantee it will be the only tenant on the node and we actively monitor CPU throttle and will autoscale in cases where it is sustained for a long period of time. We're actually achieving better overall utilization with this setup vs bare metal and arent dealing with a lot of issues with resource contention.

There is a main operator responsible for all the databases. It handles configuration changes, provisioning pods and slowly rolling out changes. In kube we model this with a custom resource we've defined called a KeyspaceShard which represents a named set of database instances that should participate in replication together. Once provisioned, the pods know how to hook up and detach from Vitess without requiring further involvement from the operator. Vitess handles backups and maintains the replication topology. "Complicated" is an apt description of what it does but not "complex". Evicting a database pod and letting the system reschedule and converge is a routine operation that doesn't cause much concern.

Migrations are done with gh-ost, which has its own custom operator that manages the lifecycle of the migration and ties into self service tooling we provide that is integrated with our build and deploy system.

matesz · on March 4, 2023

One more question - did you have a chance to see where scylladb is going?

On latest conference their CEO said it's all about serverless and virtualization. Having kubernetes doing all the work [1] - "we are doing it automatically for you with our management which is based on using multi-tenant kuebrentes deployment". Even more surprising is that instead of using NVMEs they want to use s3 for backend storage and NVME is only cache [2] :o I am not database expert, but this is very interesting.

[1] https://youtu.be/ZX7rA78BYS0?t=1303

[2] https://youtu.be/ZX7rA78BYS0?t=2086

matesz · on March 3, 2023

Thanks for answering!

bwarminski · on Nov 11, 2018

We're using Argo as a replacement for our GoCD based continuous delivery system and it's been fantastic. We were about to move onto Concourse and realized we would need to set up the and manage the equivalent of a Kubernetes cluster just to support it. We took a few extra weeks to prototype Argo first and were glad we did.

In our use case, the primary deployment model is blue/green deploys of VM based microservices on AWS with a bunch of Terraform managed infrastructure gluing them together. The newer class of service we're developing right now is container based and has even more infrastructure complexity. Teams manage their own service releases and each group has slight variations in their deploy steps. We need a deploy system that can support the old and new style and remain flexible while providing a migration path.

Teams love the flexibility and expressiveness of the workflow definitions. They've begun to move cron jobs onto it as well. Argo is much more lightweight and easy to operate than any other system we had worked with because it leans so heavily on Kubernetes primitives that we already needed to understand. The codebase is also relatively easy to understand, so we've been able to contribute things back to the project while working through the migration.

We've taken a cue from the project and begun to consolidate our control plane on CRDs so they can seamlessly integrate with Argo. CRD Operators + Argo are allowing us to consolidate all of the custom deploy/config tooling we built over the years onto a common system that is testable and well integrated.

It's a little early in the project to measure the full effect, but the internal project has a lot of momentum.