Docs/Host & cluster metrics
Host & cluster metrics
Every Lighthouse agent reports the vitals of the host it runs on back to Status Harbor over the same outbound HTTPS connection it uses for monitor results. On Kubernetes, the DaemonSet flavour of the chart also covers every node in the cluster, plus per-PVC capacity and cluster-wide aggregates.
There is no separate collector to install, no second token to manage, and no extra port to open. If the agent is already running, metrics are already arriving.
The Metrics workspace in the console (console.statusharbor.io/metrics) is where charts, threshold rules, snoozes and firing alerts all live.
What the agent collects
Host metrics (every Lighthouse)
| Metric | Examples |
|---|---|
| CPU | cpu_busy_percent, plus user / system / iowait breakdown |
| Memory | mem_used_percent, mem_used_bytes, available, swap usage |
| Load | 1m, 5m and 15m runnable + uninterruptible counts |
| Disk | disk_used_percent per mount, used / free bytes, inode pressure, read / write throughput, IO busy |
| Network | per-interface receive and transmit, in bytes, packets and errors |
The source is /proc on the host (and on the node, when running as
a Kubernetes DaemonSet pod with LIGHTHOUSE_PROC_ROOT=/host/proc).
No process lists, environment variables or file contents are read
or shipped.
Kubernetes cluster metrics (DaemonSet flavour)
When you enable the DaemonSet in the Helm chart, the central Lighthouse pod additionally fans out to the kubelet on each node and surfaces:
- Per-node CPU / memory / disk / network - stamped with the node name so "which node is unhealthy?" is one filter away.
- Per-PVC disk usage - one series per persistent volume claim, so cert-manager and Prometheus stores never run out of disk silently.
- Cluster aggregates - whole-cluster CPU and memory headroom on one chart.
Per-node metrics use a k8s_node_* namespace; per-PVC metrics use
k8s_pvc_*. Cardinality is capped so a single misbehaving cluster
can't blow the budget for the rest of your fleet.
Turn it on
Shell (Linux / VMs)
Host metrics ship as soon as the agent is running. No extra flag,
no second daemon - the same binary that drives your monitors
streams /proc counters on the same outbound HTTPS connection:
curl -fsSL https://lighthouse.statusharbor.io/install.sh \
| LIGHTHOUSE_TOKEN=<your-token> shThe install script drops the binary at /usr/local/bin/lighthouse,
registers a systemd unit (or launchd plist on macOS) and starts it.
Re-running it upgrades in place.
Helm (Kubernetes)
Host metrics are on by default for the central pod too. To also turn on the DaemonSet (per-node host metrics) and the k8sstats fan-out (per-node, per-PVC and cluster aggregates), set two flags at install:
helm install lighthouse oci://ghcr.io/statusharbor/charts/lighthouse \
--namespace status-harbor --create-namespace \
--set-string token=<your-token> \
--set k8sstats.enabled=true \
--set daemonset.enabled=truedaemonset.enabled=trueruns one agent pod per node withLIGHTHOUSE_ROLE=host_metricsandLIGHTHOUSE_PROC_ROOT=/host/proc, so per-host CPU / memory / disk / network reflect the node rather than the central pod's container.k8sstats.enabled=truelets the central pod call the kubelet via the apiserver proxy to surface per-node and per-PVC capacity plus the cluster aggregates. The chart's…-k8sstatsClusterRole grants the requirednodes/proxy: getRBAC.
Already running the chart? Flip the flags in place without re-keying the token:
helm upgrade lighthouse oci://ghcr.io/statusharbor/charts/lighthouse \
--namespace status-harbor --reuse-values \
--set k8sstats.enabled=true \
--set daemonset.enabled=trueDocker and Terraform
See Install Lighthouse for the Docker one-liner and the Terraform deployment modules. Host metrics ride along the same way - no extra flag.
The Metrics workspace
The workspace has four tabs.
Overview
Per-host or per-cluster charts for every metric above. Pick a time range (15 minutes through 7 days), narrow to one Lighthouse, or compare across the fleet. Auto-refresh is on by default at 30 seconds and can be paused from the top-right toggle.
Rules
Threshold-alert rules. Each rule is:
- Name - what shows up in the alert payload.
- Metric - any host or cluster metric (e.g.
cpu_busy_percent,mem_used_percent,disk_used_percent). - Comparator + threshold -
>,<,>=,<=,==against a number. - For - how long the breach has to hold before the alert opens. Spike-protects against one-minute noise.
- Severity -
warningorcritical. Status Harbor delivers every firing alert to every wired channel; severity drives label / colour, not routing. - Scope - all Lighthouses or pinned to one (useful for a database host with a different profile from your app servers).
The new-rule dialog previews the last hour of real data from the agent the rule will target. Drag the threshold up and down to see whether your number would have been noisy; a red marker appears the moment the breach holds for the duration you picked.
Default rules
Every new team gets five rules seeded automatically. You can edit them like any other rule, or reset to defaults from the button at the top of the tab.
| Name | Condition | For | Severity |
|---|---|---|---|
| CPU above 90% | cpu_busy_percent > 90 | 10m | warning |
| Disk above 85% | disk_used_percent > 85 | 5m | warning |
| Disk above 95% | disk_used_percent > 95 | 5m | critical |
| Memory above 90% | mem_used_percent > 90 | 5m | warning |
| Memory above 95% | mem_used_percent > 95 | 5m | critical |
Snoozes
Silence a noisy host without disabling the rule for the rest of your fleet. A snooze can be:
- Temporary - one hour, one day, a week, or a custom expiry.
- Indefinite - until you explicitly unmute it.
Scope a snooze to one rule + one host, or to a host across every rule. Resolved alerts always pass through, so pre-existing incidents close cleanly even while the snooze is active.
Use a snooze instead of cloning a rule when only one host is flapping. Use a per-Lighthouse rule when the host genuinely needs a different threshold.
Alerts
Timeline of every threshold breach that has fired on your team's Lighthouses. Columns: rule name, severity, status (firing / resolved), host and start time. Filter by status or severity from the dropdowns at the top.
Alerts open automatically when a threshold trips and close automatically when the underlying metric clears. There is no manual acknowledge here on purpose - the metric clearing is the acknowledge. If a check goes back to healthy and you didn't see the page, that's exactly what the alerts timeline is for.
Alerts join the existing pipeline
Metric breaches fire through the same delivery channels as your uptime alerts: Slack, Telegram, email or your webhook. The payload shape is the same too, with the metric name, the threshold, the observed value and the host that tripped it. One Slack channel for both flavours of incident, not two.
Retention
Metric chart history is kept for 7 days for every team. Firing alerts and their resolution timestamps are kept independently in the alerts timeline.
Multi-instance Lighthouses
When you run the Helm chart with daemonset.enabled=true, each
node hosts its own agent pod that shares the Lighthouse's
lighthouse:write token. The agents register independently and
appear in the Active Agents table on the Lighthouse detail
page. Per-node metrics carry the node name so a noisy worker is
easy to isolate from a healthy control-plane node.
A per-node watchdog opens an alert if a node stops reporting for
60 seconds (EventLighthouseAgentOffline) and closes it on the
next successful heartbeat (EventLighthouseAgentRecovered).
Related
- Install Lighthouse - get the agent running.
- What the agent sends - the full list of what does and does not leave your network.
- Notifications - wire up Slack, Telegram, email or a webhook to receive metric alerts.
- Plans & limits - how Lighthouse count and retention map to plans.