Abhishek Sah

cAdvisor high cardinality

November 18, 2023

cAdvisor is a popular utility that provides resource usage and performance characteristics of running containers. It collects, aggregates and exports metrics about running containers. It comes integrated inside the Kubelet binary inside Kubernetes clusters. We can collect the cAdvisor metrics from the Kubelet API Endpoint /metrics/cadvisor. cAdvisor metrics can be beneficial to look at the resource consumption of your workload to achieve optimal resource utilisation, debug issues with containers and usage trends. These metrics are represented in the Prometheus Exposition format.

Here is an example metric for CPU Utilisation:

# HELP container_cpu_system_seconds_total Cumulative system cpu time consumed in seconds.
# TYPE container_cpu_system_seconds_total counter
container_cpu_system_seconds_total{container="",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod05b3ee07_a2bd_47e7_9599_7961201f20dd.slice",image="",name="",namespace="test",pod="test-workload-7756958c5b-qtkgc"} 22.65 1699453605805

Since cAdvisor or Kubelet doesn’t store these metrics, one can not query historical data, which means users should periodically scrape these metrics and store them in some observability server like Prometheus or Grafana Mimir, etc.

Scraping cAdvisor

Since one cAdvisor instance is running per node, we will need to scrape all of them to get the complete picture of the resource utilisation. At Pixxel, we deploy Open Telemetry collectors as Daemonset to scrape metrics from pods running on all the nodes. That means each node will have one OTEL collector pod scraping the cAdvisor metrics(along with other endpoints).

The following snippet describes how you can scrape cAdvisor metrics using OTEL pods:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: cadvisor
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: true
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          kubernetes_sd_configs:
            - role: node
          relabel_configs:
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - target_label: __address__
              replacement: kubernetes.default.svc.cluster.local:443
            - source_labels: [__meta_kubernetes_node_name]
              regex: (.+)
              target_label: __metrics_path__
              replacement: /api/v1/nodes/$$${1}/proxy/metrics/cadvisor

The above should look familiar if you have experience configuring Prometheus. OTEL collector can be a drop-in replacement for Prometheus for scraping with some added benefits - like env variable support. We are scraping the cAdvisor target by doing Kubernetes Node Service discovery. The kubernetes_sd_configs retrievs scrape targets from Kubernetes REST API. The role: node discovers one target per cluster node with the address defaulting to the Kubelet’s HTTP port. We are explicitly setting the address(host and port) and path of the scrape target to the local API Server. I’ll quote Kubernetes documentation on how this address is formed:

The API server’s in-cluster address is also published to a Service named kubernetes in the default namespace so that pods may reference kubernetes.default.svc as a DNS name for the local API server.

Pods running in the cluster should authenticate with the API server with service account credentials. An equivalent curl request would be:

curl -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"  [https://kubernetes.default.svc.cluster.local:443/api/v1/nodes/ip-10-1-81-147.us-east-2.compute.internal/proxy/metrics/cadvisor](https://kubernetes.default.svc.cluster.local/api/v1/nodes/ip-10-1-4-5.us-west-1.compute.internal/proxy/metrics/cadvisor)

PS: Kubernetes also provides a handy kubectl command to access the API Server, using which we can query cAdvisor metrics:

kubectl get --raw /api/v1/nodes/ip-10-1-4-5.us-west-1.compute.internal/proxy/metrics/cadvisor

Metrics Cardinality

The metrics prepared by cAdvisor sometimes exceed the label limit set by the observability server. For example, in Grafana Mimir, we see the max label for any series set to 30 (by default) via the config option: max_label_names_per_series ref

After observing OTEL Collector logs after a few hours of cAdvisor scrape setup, I found the Mimir server rejected multiple series because they had more labels than the acceptable limit. Some had 32 and 35. I wanted to see what labels were getting published in cAdvisor. For container_oom_events_total metric, I saw the following labels:

beta_kubernetes_io_arch
beta_kubernetes_io_instance_type
beta_kubernetes_io_os
eks_amazonaws_com_capacityType
eks_amazonaws_com_nodegroup
eks_amazonaws_com_nodegroup_image
eks_amazonaws_com_sourceLaunchTemplateld
eks_amazonaws_com_sourceLaunchTemplateVersion
failure_domain_beta_kubernetes_io_region
failure_domain_beta_kubernetes_io_zone
id
k8s_io_cloud_provider_aws
kubernetes_io_arch
kubernetes_io_hostname
kubernetes_io_os
node_kubernetes_io_instance_type
topology_ebs_csi_aws_com_zone
topology_kubernetes_io_region
topology_kubernetes_io_zone

cAdviosr added these 20 labels(and more) in every series in addition to the global demographic labels such as cluster name, AWS account, etc that we add explicitly. In some metrics, this count reached 25 or so. We quickly concluded that not all of these are required - so instead of changing server limits, we could delete these labels without harming anything. Luckily, Prometheus provides a way to rewrite labelset using metric_relabel_config. We decided to drop all eks_ prefixed labels since these value can be derived from hostname alone. The way to do that is:

relabel_configs: 
  ...
  ...

metric_relabel_configs:
  - action: labeldrop
    regex: 'eks.*'

This change brought down the label count to acceptable limits. Practically we may not need even the topology_ prefixed labels. But we did the least amount of label rewriting to get the system running. If needed in future, we can look into further reducing the labelset using the same approach.

References:


Written by Abhishek Sah
👨‍💻Ψ ☮️
Twitter