Docs/Host & cluster metrics

Host & cluster metrics

Every Lighthouse agent reports the vitals of the host it runs on back to Status Harbor over the same outbound HTTPS connection it uses for monitor results. On Kubernetes, the DaemonSet flavour of the chart also covers every node in the cluster, plus per-PVC capacity and cluster-wide aggregates.

There is no separate collector to install, no second token to manage, and no extra port to open. If the agent is already running, metrics are already arriving.

The Metrics workspace in the console (console.statusharbor.io/metrics) is where charts, threshold rules, snoozes and firing alerts all live.

What the agent collects

Host metrics (every Lighthouse)

Metric	Examples
CPU	`cpu_busy_percent`, plus user / system / iowait breakdown
Memory	`mem_used_percent`, `mem_used_bytes`, available, swap usage
Load	1m, 5m and 15m runnable + uninterruptible counts
Disk	`disk_used_percent` per mount, used / free bytes, inode pressure, read / write throughput, IO busy
Network	per-interface receive and transmit, in bytes, packets and errors

The source is /proc on the host (and on the node, when running as a Kubernetes DaemonSet pod with LIGHTHOUSE_PROC_ROOT=/host/proc). No process lists, environment variables or file contents are read or shipped.

Kubernetes cluster metrics (DaemonSet flavour)

When you enable the DaemonSet in the Helm chart, the central Lighthouse pod additionally fans out to the kubelet on each node and surfaces:

Per-node CPU / memory / disk / network - stamped with the node name so "which node is unhealthy?" is one filter away.
Per-PVC disk usage - one series per persistent volume claim, so cert-manager and Prometheus stores never run out of disk silently.
Cluster aggregates - whole-cluster CPU and memory headroom on one chart.

Per-node metrics use a k8s_node_* namespace; per-PVC metrics use k8s_pvc_*. Cardinality is capped so a single misbehaving cluster can't blow the budget for the rest of your fleet.

Turn it on

Shell (Linux / VMs)

Host metrics ship as soon as the agent is running. No extra flag, no second daemon - the same binary that drives your monitors streams /proc counters on the same outbound HTTPS connection:

curl -fsSL https://lighthouse.statusharbor.io/install.sh \
  | LIGHTHOUSE_TOKEN=<your-token> sh

The install script drops the binary at /usr/local/bin/lighthouse, registers a systemd unit (or launchd plist on macOS) and starts it. Re-running it upgrades in place.

Helm (Kubernetes)

Host metrics are on by default for the central pod too. To also turn on the DaemonSet (per-node host metrics) and the k8sstats fan-out (per-node, per-PVC and cluster aggregates), set two flags at install:

helm install lighthouse oci://ghcr.io/statusharbor/charts/lighthouse \
  --namespace status-harbor --create-namespace \
  --set-string token=<your-token> \
  --set k8sstats.enabled=true \
  --set daemonset.enabled=true

daemonset.enabled=true runs one agent pod per node with LIGHTHOUSE_ROLE=host_metrics and LIGHTHOUSE_PROC_ROOT=/host/proc, so per-host CPU / memory / disk / network reflect the node rather than the central pod's container.
k8sstats.enabled=true lets the central pod call the kubelet via the apiserver proxy to surface per-node and per-PVC capacity plus the cluster aggregates. The chart's …-k8sstats ClusterRole grants the required nodes/proxy: get RBAC.

Already running the chart? Flip the flags in place without re-keying the token:

helm upgrade lighthouse oci://ghcr.io/statusharbor/charts/lighthouse \
  --namespace status-harbor --reuse-values \
  --set k8sstats.enabled=true \
  --set daemonset.enabled=true

Docker and Terraform

See Install Lighthouse for the Docker one-liner and the Terraform deployment modules. Host metrics ride along the same way - no extra flag.

The Metrics workspace

The workspace has four tabs.

Overview

Per-host or per-cluster charts for every metric above. Pick a time range (15 minutes through 7 days), narrow to one Lighthouse, or compare across the fleet. Auto-refresh is on by default at 30 seconds and can be paused from the top-right toggle.

Rules

Threshold-alert rules. Each rule is:

Name - what shows up in the alert payload.
Metric - any host or cluster metric (e.g. cpu_busy_percent, mem_used_percent, disk_used_percent).
Comparator + threshold - >, <, >=, <=, == against a number.
For - how long the breach has to hold before the alert opens. Spike-protects against one-minute noise.
Severity - warning or critical. Status Harbor delivers every firing alert to every wired channel; severity drives label / colour, not routing.
Scope - all Lighthouses or pinned to one (useful for a database host with a different profile from your app servers).

The new-rule dialog previews the last hour of real data from the agent the rule will target. Drag the threshold up and down to see whether your number would have been noisy; a red marker appears the moment the breach holds for the duration you picked.

Default rules

Every new team gets five rules seeded automatically. You can edit them like any other rule, or reset to defaults from the button at the top of the tab.

Name	Condition	For	Severity
CPU above 90%	`cpu_busy_percent > 90`	10m	warning
Disk above 85%	`disk_used_percent > 85`	5m	warning
Disk above 95%	`disk_used_percent > 95`	5m	critical
Memory above 90%	`mem_used_percent > 90`	5m	warning
Memory above 95%	`mem_used_percent > 95`	5m	critical

Snoozes

Silence a noisy host without disabling the rule for the rest of your fleet. A snooze can be:

Temporary - one hour, one day, a week, or a custom expiry.
Indefinite - until you explicitly unmute it.

Scope a snooze to one rule + one host, or to a host across every rule. Resolved alerts always pass through, so pre-existing incidents close cleanly even while the snooze is active.

Use a snooze instead of cloning a rule when only one host is flapping. Use a per-Lighthouse rule when the host genuinely needs a different threshold.

Alerts

Timeline of every threshold breach that has fired on your team's Lighthouses. Columns: rule name, severity, status (firing / resolved), host and start time. Filter by status or severity from the dropdowns at the top.

Alerts open automatically when a threshold trips and close automatically when the underlying metric clears. There is no manual acknowledge here on purpose - the metric clearing is the acknowledge. If a check goes back to healthy and you didn't see the page, that's exactly what the alerts timeline is for.

Alerts join the existing pipeline

Metric breaches fire through the same delivery channels as your uptime alerts: Slack, Telegram, email or your webhook. The payload shape is the same too, with the metric name, the threshold, the observed value and the host that tripped it. One Slack channel for both flavours of incident, not two.

Retention

Metric chart history is kept for 7 days for every team. Firing alerts and their resolution timestamps are kept independently in the alerts timeline.

Multi-instance Lighthouses

When you run the Helm chart with daemonset.enabled=true, each node hosts its own agent pod that shares the Lighthouse's lighthouse:write token. The agents register independently and appear in the Active Agents table on the Lighthouse detail page. Per-node metrics carry the node name so a noisy worker is easy to isolate from a healthy control-plane node.

A per-node watchdog opens an alert if a node stops reporting for 60 seconds (EventLighthouseAgentOffline) and closes it on the next successful heartbeat (EventLighthouseAgentRecovered).

Install Lighthouse - get the agent running.
What the agent sends - the full list of what does and does not leave your network.
Notifications - wire up Slack, Telegram, email or a webhook to receive metric alerts.
Plans & limits - how Lighthouse count and retention map to plans.