Amundsen has two databases and three services in its architecture diagram. For me, that's a smell that you now have risk of inconsistency between the two, and you may have to learn how to tune elasticsearch and Neo4j...
Versus the conceptually simpler "one binary, one container, one storage volume/database" model.
I acknowledge it's a false choice and a semi-silly thing to fixate on (how do you perf-tune ingestion queue problems vs write problems vs read problems for a go binary?)..
But, like, I have 10 different systems I'm already debugging.
Adding another one like a data catalog that is supposed to make life easier and discovering I now have 5-subsystems-in-a-trenchcoat to possibly need to debug means I'm spending even more time on babysitting the metadata manager rather than doing data engineering _for the business_
off topic, but prometheus pushgateway is such a bad implementation (once you push the metrics, it always stays there until it's restarted, like counter does not increase, it just pushes a new metric with the new value) that we had to write our own metrics collector endpoint.
That is literally how it is supposed to work. Prometheus grabs metrics --- that is how it works. If you for some reason find yourself unable to host an endpoint with metrics, you can use the fallback pushgateway to push metrics where yes they will stay until restarted. Ask yourself how it could ever work if they are subsequently deleted after read. How would multiple prometheus agents be able to read from the same source?
It sounds like you are using it for the wrong job. It’s supposed to be a solution for jobs / short running processes that don’t expose a /metrics endpoint for Prometheus long enough to be scraped and there you exactly want that kind of behavior.
The pushgateway is itself a horrible hack for the fact that prometheus is designed only for metrics scraping. Unfortunately the whole ecosystem around it is an utter mess.
Remote Write is a viable alternative in Prometheus and its drop-in replacements. I'm not a massive fan of it myself as I feel the pull-based approach is superior overall but still make heavy use of it.
The pushgateway's documentation itself calls out that there are only very limited cirumstances where it makes sense.
I personally only used it in $old_job and only for batch jobs that could not use the node_exporter's textfile collector. I would not use it again and would even advise against it.
> "But I got it all working; now I can finally stop explaining to my boss why we need to re-structure the monitoring stack every year."
Prometheus and Grafana have been progressing in their own ways and each of them is trying to have a fullstack solution and then the OTEL thingy came and ruined the party for everyone
I think OTEL has made things worse for metrics. Prometheus was so simple and clean before the long journey toward OTEL support began. Now Prometheus is much more complicated:
- all the delta-vs-cumulative counter confusion
- push support for Prometheus, and the resulting out-of-order errors
- the {"metric_name"} syntax changes in PromQL
- resource attributes and the new info() function needed to join them
I just don’t see how any of these OTEL requirements make my day-to-day monitoring tasks easier. Everything has only become more complicated.
I still haven't got my head around how OTEL fits into a good open-source monitoring stack. Afaik, it is a protocol for metrics, traces, and logs. And we want our open-source monitoring services/dbs to support it, so they become pluggable. But, afaik, there's no one good DB for logs and metrics, so most of us use Prometheus for metrics and OpenSearch for logs.
Does OTEL mean we just need to replace all our collectors (like logstash for logs and all the native metrics collectors and pushgateway crap) and then reconfigure Prometheus and OpenSearch?
logs, spans and metrics are stored as time-stamped stuff. sure simple fixed-width columnar storage is faster, and makes sense to special case for numbers (add downsampling and aggregations, and histogram maintenance and whatnot), but any write-optimized storage engine can handle this, it's not the hard part (basically LevelDB, and if there's need for scaling out it'll look like Cassandra, Aerospike, ScyllaDB, or ClickHouse ... see also https://docs.greptime.com/user-guide/concepts/data-model/ and specialized storage engines https://docs.greptime.com/reference/about-greptimedb-engines... )
I think the answer is it doesn't fit in any definition of a _good_ monitoring stack, but we are stuck with it. It has largely become the blessed protocol, specification, and standard for OSS monitoring, along every axis (logging, tracing, collecting, instrumentation, etc)...its a bit like the efforts that resulted in J2EE and EJBs back in the day, only more diffuse and with more varied implementations.
And we don't really have a simpler alternative in sight...at least in the java days there was the disgust and reaction via struts, spring, EJB3+, and of course other languages and communities.
Not sure how we exactly we got into such an over-engineered mono-culture in terms of operations and monitoring and deployment for 80%+ of the industry (k8s + graf/loki/tempo + endless supporting tools or flavors), but it is really a sad state.
Then you have endless implementations handling bits and pieces of various parts of the spec, and of course you have the tools to actually ingest and analyze and report on them.
for filters, create a set of pre-defined tags and let the LLM choose one of your pre-defined tags from the paper's summary.