feat: add blog post for --enable-feature=use-uncached-io#2869
feat: add blog post for --enable-feature=use-uncached-io#2869machine424 wants to merge 3 commits intoprometheus:mainfrom
Conversation
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
nwanduka
left a comment
There was a problem hiding this comment.
Thanks for sharing this @machine424. LGTM.
bwplotka
left a comment
There was a problem hiding this comment.
Nice! Added some suggestions, but great explanation of this feature!
|
|
||
| <!-- more --> | ||
|
|
||
| Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? It gets worse when you pick the wrong one to set a memory limit, interpret benchmark results, or debug an OOMKilled container. |
There was a problem hiding this comment.
"It gets worse" is bit fuzzy on what do you mean
| Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? It gets worse when you pick the wrong one to set a memory limit, interpret benchmark results, or debug an OOMKilled container. | |
| Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? Do you know which one to use for memory limits, benchmark result intepretation, OOMKilled debugging? |
|
|
||
| Do you find yourself constantly looking up the difference between `container_memory_usage_bytes`, `container_memory_working_set_bytes`, and `container_memory_rss`? It gets worse when you pick the wrong one to set a memory limit, interpret benchmark results, or debug an OOMKilled container. | ||
|
|
||
| You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others. |
There was a problem hiding this comment.
| You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others. | |
| You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of users. |
|
|
||
| You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others. | ||
|
|
||
| The explanation is simple: RAM is not used in just one way. One of the easiest things to miss is the page cache, and for some containers it can make up most of the memory usage, creating large gaps between those metrics. |
There was a problem hiding this comment.
Should we narrow down OS? I'd add this blog post applies to Linux only (both AMD and ARM)
|
|
||
| <!-- more --> | ||
|
|
||
| The [use-uncached-io](https://prometheus.io/docs/prometheus/latest/feature_flags/#use-uncached-io) feature flag was built for exactly this. Prometheus is a database and it does a lot of disk writes, but not every write benefits from the page cache. Compaction writes are a good example, because once written, that data is unlikely to be read again soon. |
There was a problem hiding this comment.
tiny nit, but for the future: this flag probably could be called "uncached-io" - we don't call our flags "enable-"
|
|
||
| <!-- more --> | ||
|
|
||
| The [use-uncached-io](https://prometheus.io/docs/prometheus/latest/feature_flags/#use-uncached-io) feature flag was built for exactly this. Prometheus is a database and it does a lot of disk writes, but not every write benefits from the page cache. Compaction writes are a good example, because once written, that data is unlikely to be read again soon. |
There was a problem hiding this comment.
This section calls page cache without actually defining what it is? Is it worth to educate reader what page cache is? (or at least link to wikipedia etc?)
|
|
||
| To deal with that, a [`bufio.Writer`](https://pkg.go.dev/bufio#Writer)-like writer, [`directIOWriter`](https://github.com/prometheus/prometheus/blob/ac12e30f99df9d2f68025f0238c0aef95146e94b/tsdb/fileutil/direct_io_writer.go#L46), was implemented. On kernels `v6.1` or newer, Prometheus gets the exact alignment values from [statx](https://man7.org/linux/man-pages/man2/statx.2.html); otherwise, conservative defaults are used. | ||
|
|
||
| The `directIOWriter` is currently limited to chunk writes, but that is already a substantial amount of I/O. Benchmarks show a 20-50% reduction in page cache usage, as measured by `container_memory_cache`. |
There was a problem hiding this comment.
My first question is .. does it have impact on other metrics? Performance of other stuff? Is it useful to mention in this blog post?
|
|
||
| <!-- more --> | ||
|
|
||
| The [use-uncached-io](https://prometheus.io/docs/prometheus/latest/feature_flags/#use-uncached-io) feature flag was built for exactly this. Prometheus is a database and it does a lot of disk writes, but not every write benefits from the page cache. Compaction writes are a good example, because once written, that data is unlikely to be read again soon. |
There was a problem hiding this comment.
Compaction writes are a good example, because once written, that data is unlikely to be read again soon.
Can we explain why unlikely? This data is used for long term storage queries. It's worth to mention that in practice, majoritiy of queries hit only 24h or even 1h
|
|
||
| ### Experimenting with `RWF_DONTCACHE` | ||
|
|
||
| Introduced in Linux kernel `v6.14`, `RWF_DONTCACHE` enables uncached buffered I/O, where data still goes through the page cache, but the corresponding pages are dropped afterwards. It would be worth benchmarking whether this can deliver similar benefits without direct I/O's alignment constraints. |
|
|
||
| You're not alone. There is even a [9-year-old Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/43916) that captures the frustration of many others. | ||
|
|
||
| The explanation is simple: RAM is not used in just one way. One of the easiest things to miss is the page cache, and for some containers it can make up most of the memory usage, creating large gaps between those metrics. |
There was a problem hiding this comment.
Can we mention clearly that this page cache is meant for best-effort data - the moment kernel needs memory for other processes it should be able to clean this cache. But for large box with unused memory, memory can be marked as "used" to the limit of that box which can be scary and confusing - despite this memory can be cleaned on demand.
No description provided.