Docs/Host & cluster metrics

Host & cluster metrics

Every Lighthouse agent reports the vitals of the host it runs on back to Status Harbor over the same outbound HTTPS connection it uses for monitor results. On Kubernetes, the DaemonSet flavour of the chart also covers every node in the cluster, plus per-PVC capacity and cluster-wide aggregates.

There is no separate collector to install, no second token to manage, and no extra port to open. If the agent is already running, metrics are already arriving.

The Metrics workspace in the console (console.statusharbor.io/metrics) is where charts, threshold rules, snoozes and firing alerts all live.

What the agent collects

Host metrics (every Lighthouse)

MetricExamples
CPUcpu_busy_percent, plus user / system / iowait breakdown
Memorymem_used_percent, mem_used_bytes, available, swap usage
Load1m, 5m and 15m runnable + uninterruptible counts
Diskdisk_used_percent per mount, used / free bytes, inode pressure, read / write throughput, IO busy
Networkper-interface receive and transmit, in bytes, packets and errors

The source is /proc on the host (and on the node, when running as a Kubernetes DaemonSet pod with LIGHTHOUSE_PROC_ROOT=/host/proc). No process lists, environment variables or file contents are read or shipped.

Kubernetes cluster metrics (DaemonSet flavour)

When you enable the DaemonSet in the Helm chart, the central Lighthouse pod additionally fans out to the kubelet on each node and surfaces:

  • Per-node CPU / memory / disk / network - stamped with the node name so "which node is unhealthy?" is one filter away.
  • Per-PVC disk usage - one series per persistent volume claim, so cert-manager and Prometheus stores never run out of disk silently.
  • Cluster aggregates - whole-cluster CPU and memory headroom on one chart.

Per-node metrics use a k8s_node_* namespace; per-PVC metrics use k8s_pvc_*. Cardinality is capped so a single misbehaving cluster can't blow the budget for the rest of your fleet.

Turn it on

Shell (Linux / VMs)

Host metrics ship as soon as the agent is running. No extra flag, no second daemon - the same binary that drives your monitors streams /proc counters on the same outbound HTTPS connection:

curl -fsSL https://lighthouse.statusharbor.io/install.sh \
  | LIGHTHOUSE_TOKEN=<your-token> sh

The install script drops the binary at /usr/local/bin/lighthouse, registers a systemd unit (or launchd plist on macOS) and starts it. Re-running it upgrades in place.

Helm (Kubernetes)

Host metrics are on by default for the central pod too. To also turn on the DaemonSet (per-node host metrics) and the k8sstats fan-out (per-node, per-PVC and cluster aggregates), set two flags at install:

helm install lighthouse oci://ghcr.io/statusharbor/charts/lighthouse \
  --namespace status-harbor --create-namespace \
  --set-string token=<your-token> \
  --set k8sstats.enabled=true \
  --set daemonset.enabled=true
  • daemonset.enabled=true runs one agent pod per node with LIGHTHOUSE_ROLE=host_metrics and LIGHTHOUSE_PROC_ROOT=/host/proc, so per-host CPU / memory / disk / network reflect the node rather than the central pod's container.
  • k8sstats.enabled=true lets the central pod call the kubelet via the apiserver proxy to surface per-node and per-PVC capacity plus the cluster aggregates. The chart's …-k8sstats ClusterRole grants the required nodes/proxy: get RBAC.

Already running the chart? Flip the flags in place without re-keying the token:

helm upgrade lighthouse oci://ghcr.io/statusharbor/charts/lighthouse \
  --namespace status-harbor --reuse-values \
  --set k8sstats.enabled=true \
  --set daemonset.enabled=true

Docker and Terraform

See Install Lighthouse for the Docker one-liner and the Terraform deployment modules. Host metrics ride along the same way - no extra flag.

The Metrics workspace

The workspace has four tabs.

Overview

Per-host or per-cluster charts for every metric above. Pick a time range (15 minutes through 7 days), narrow to one Lighthouse, or compare across the fleet. Auto-refresh is on by default at 30 seconds and can be paused from the top-right toggle.

Rules

Threshold-alert rules. Each rule is:

  • Name - what shows up in the alert payload.
  • Metric - any host or cluster metric (e.g. cpu_busy_percent, mem_used_percent, disk_used_percent).
  • Comparator + threshold - >, <, >=, <=, == against a number.
  • For - how long the breach has to hold before the alert opens. Spike-protects against one-minute noise.
  • Severity - warning or critical. Status Harbor delivers every firing alert to every wired channel; severity drives label / colour, not routing.
  • Scope - all Lighthouses or pinned to one (useful for a database host with a different profile from your app servers).

The new-rule dialog previews the last hour of real data from the agent the rule will target. Drag the threshold up and down to see whether your number would have been noisy; a red marker appears the moment the breach holds for the duration you picked.

Default rules

Every new team gets five rules seeded automatically. You can edit them like any other rule, or reset to defaults from the button at the top of the tab.

NameConditionForSeverity
CPU above 90%cpu_busy_percent > 9010mwarning
Disk above 85%disk_used_percent > 855mwarning
Disk above 95%disk_used_percent > 955mcritical
Memory above 90%mem_used_percent > 905mwarning
Memory above 95%mem_used_percent > 955mcritical

Snoozes

Silence a noisy host without disabling the rule for the rest of your fleet. A snooze can be:

  • Temporary - one hour, one day, a week, or a custom expiry.
  • Indefinite - until you explicitly unmute it.

Scope a snooze to one rule + one host, or to a host across every rule. Resolved alerts always pass through, so pre-existing incidents close cleanly even while the snooze is active.

Use a snooze instead of cloning a rule when only one host is flapping. Use a per-Lighthouse rule when the host genuinely needs a different threshold.

Alerts

Timeline of every threshold breach that has fired on your team's Lighthouses. Columns: rule name, severity, status (firing / resolved), host and start time. Filter by status or severity from the dropdowns at the top.

Alerts open automatically when a threshold trips and close automatically when the underlying metric clears. There is no manual acknowledge here on purpose - the metric clearing is the acknowledge. If a check goes back to healthy and you didn't see the page, that's exactly what the alerts timeline is for.

Alerts join the existing pipeline

Metric breaches fire through the same delivery channels as your uptime alerts: Slack, Telegram, email or your webhook. The payload shape is the same too, with the metric name, the threshold, the observed value and the host that tripped it. One Slack channel for both flavours of incident, not two.

Retention

Metric chart history is kept for 7 days for every team. Firing alerts and their resolution timestamps are kept independently in the alerts timeline.

Multi-instance Lighthouses

When you run the Helm chart with daemonset.enabled=true, each node hosts its own agent pod that shares the Lighthouse's lighthouse:write token. The agents register independently and appear in the Active Agents table on the Lighthouse detail page. Per-node metrics carry the node name so a noisy worker is easy to isolate from a healthy control-plane node.

A per-node watchdog opens an alert if a node stops reporting for 60 seconds (EventLighthouseAgentOffline) and closes it on the next successful heartbeat (EventLighthouseAgentRecovered).