I've been doing monitoring of our ~120ish machines for 3-4 years now using Influx+Telegraf+Grafana, and have been really happy with it. Prior to that we were using collectd+graphite and with 1 minute stats it was adding some double-digits %age utilization on our infrastructure (I don't remember exactly how much, but I want to say 30% CPU+disk).
Influxdb has been a real workhorse. We suffered through some of their early issues, but since then it's been extremely solid. It just runs, is space efficient, and very robust.
We almost went with Prometheus instead of Influx (as I said, early growing pains), but I had just struggled through managing a central inventory and hated it, so I really wanted a push rather than pull architecture. But from my limited playing with it, Prometheus seemed solid.
It's so much easier to write incorrect/misleading queries in influxql than in promql. And you can't perform operations between two different series names in influxdb, last I looked. That makes it impossible to do things like ratios or percentages unless you have control over your metrics, and structure them the way influx likes. Also, no support for calculating percentiles from Prometheus histogram buckets.
I just wanted to echo this sentiment. Influx had its fair share of issues a few years ago (v0.8 migration, changing storage engines, tag cardinality issues), but the latest v1.x releases have been solid. I have been using the TIK stack (I use Grafana instead of Chronograf) for monitoring several dozen production-facing machines for 2 years now without a single issue, which I would very much count as a win.
I just hope they learned their lesson for the v2.0 release...
> I really wanted a push rather than pull architecture
Then try VictoriaMetrics - it supports both pull and push (including Influx line protocol) [1], it works out of the box and it requires lower amounts of CPU and RAM when working with big number of time series (aka high cardinality) [2].
Influxdb has been a real workhorse. We suffered through some of their early issues, but since then it's been extremely solid. It just runs, is space efficient, and very robust.
We almost went with Prometheus instead of Influx (as I said, early growing pains), but I had just struggled through managing a central inventory and hated it, so I really wanted a push rather than pull architecture. But from my limited playing with it, Prometheus seemed solid.