Observability¶

github-runner provides structured logging, Prometheus metrics, and health endpoints for production monitoring.

Logging¶

Configuration¶

[global]
log_level = "info"     # debug, info, warn, error
log_format = "json"    # json, text

CLI flags override the config file:

github-runner start --log-level debug --log-format text

Structured logging¶

All logging uses Go's log/slog package with consistent field names:

Field	Description
`job_id`	GitHub job identifier
`repository`	Repository in `owner/repo` format
`workflow`	Workflow name
`runner_name`	Runner pool name
`pool_name`	Pool identifier
`executor`	Executor type (shell, docker, etc.)
`step`	Step name or ID
`duration`	Operation duration
`error`	Error message
`component`	Subsystem name

Example JSON log output:

{
  "time": "2026-03-03T12:00:00Z",
  "level": "INFO",
  "msg": "job completed",
  "job_id": 12345,
  "repository": "nficano/github-runner",
  "workflow": "CI",
  "duration": "45.2s",
  "component": "worker"
}

Component loggers¶

Each subsystem creates a child logger with its component name:

logger := log.WithComponent(parentLogger, "poller")
logger := log.WithJobContext(parentLogger, jobID, repo, workflow)

Secret masking in logs¶

The MaskingHandler wraps the slog handler to redact secret values from log output. It masks:

Log message text
String attribute values
Attributes within nested groups

This is enabled automatically when secrets are registered with the masker via log.SetupWithMask().

Metrics¶

Endpoint¶

Prometheus metrics are served on a configurable address:

[global]
metrics_listen = "127.0.0.1:9252"

Scrape the /metrics path:

curl http://127.0.0.1:9252/metrics

Available metrics¶

Job metrics¶

Metric	Type	Labels	Description
`github_runner_jobs_total`	Counter	`pool`, `status`	Total jobs processed
`github_runner_job_duration_seconds`	Histogram	`pool`, `executor`	Job execution duration
`github_runner_jobs_active`	Gauge	`pool`	Currently running jobs
`github_runner_job_errors_total`	Counter	`pool`, `error_type`	Job execution errors

Step metrics¶

Metric	Type	Labels	Description
`github_runner_step_duration_seconds`	Histogram	`pool`, `step`	Individual step duration

Cache metrics¶

Metric	Type	Labels	Description
`github_runner_cache_hit_ratio`	Gauge	`backend`	Cache hit/miss ratio
`github_runner_cache_operation_duration`	Histogram	`backend`, `operation`	Cache operation latency

Infrastructure metrics¶

Metric	Type	Labels	Description
`github_runner_executor_prepare_duration`	Histogram	`executor`	Executor preparation time
`github_runner_poll_duration`	Histogram	`pool`	Job poll request duration
`github_runner_poll_errors_total`	Counter	`pool`	Poll request failures
`github_runner_heartbeat_errors_total`	Counter	`pool`	Heartbeat send failures

Runtime metrics¶

The runtime collector exports Go process metrics:

Metric	Type	Description
`github_runner_goroutines`	Gauge	Current goroutine count
`github_runner_threads`	Gauge	OS thread count
`github_runner_heap_alloc_bytes`	Gauge	Heap allocation in bytes
`github_runner_heap_inuse_bytes`	Gauge	Heap memory in use
`github_runner_gc_pause_seconds`	Summary	GC pause durations
`github_runner_open_fds`	Gauge	Open file descriptors
`github_runner_uptime_seconds`	Gauge	Process uptime

Custom registry¶

Metrics use a private Prometheus registry to avoid polluting the global default registry. This prevents conflicts when embedding github-runner as a library.

Prometheus scrape configuration¶

scrape_configs:
  - job_name: "github-runner"
    static_configs:
      - targets: ["127.0.0.1:9252"]
    scrape_interval: 15s

Grafana dashboard¶

Key panels to create:

Job throughput — rate(github_runner_jobs_total[5m]) by pool and status
Active jobs — github_runner_jobs_active by pool
Job duration p99 — histogram_quantile(0.99, github_runner_job_duration_seconds)
Error rate — rate(github_runner_job_errors_total[5m])
Poll latency — histogram_quantile(0.95, github_runner_poll_duration)
Cache hit rate — github_runner_cache_hit_ratio
Goroutines — github_runner_goroutines
Memory — github_runner_heap_alloc_bytes

Health endpoints¶

Configuration¶

[global]
health_listen = "127.0.0.1:8484"

Liveness: `/healthz`¶

Returns 200 OK if the process is running. Use this for Kubernetes liveness probes.

curl http://127.0.0.1:8484/healthz

{
  "status": "ok"
}

Readiness: `/readyz`¶

Returns 200 OK when the runner is ready to accept jobs. Returns 503 Service Unavailable during startup or shutdown drain.

curl http://127.0.0.1:8484/readyz

{
  "status": "ok",
  "checks": {
    "github_api": "ok",
    "disk_space": "ok",
    "executor": "ok"
  }
}

Registered checks¶

Check	Description
`github_api`	HEAD request to GitHub API base URL
`disk_space`	Minimum free disk space via statfs
`executor`	Executor backend operational (e.g., `docker info`)

Kubernetes probes¶

livenessProbe:
  httpGet:
    path: /healthz
    port: 8484
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /readyz
    port: 8484
  initialDelaySeconds: 10
  periodSeconds: 5

Readiness lifecycle¶

Startup — Health server starts, readiness is false.
Ready — After pools are initialised, SetReady(true) is called.
Shutdown — On receiving SIGTERM/SIGINT, SetReady(false) is called immediately to stop receiving traffic while in-flight jobs drain.

pprof¶

The metrics server includes pprof endpoints for runtime profiling at /debug/pprof/. These are useful for diagnosing performance issues:

go tool pprof http://127.0.0.1:9252/debug/pprof/heap
go tool pprof http://127.0.0.1:9252/debug/pprof/goroutine

Alerting recommendations¶

Alert	Condition	Severity
High error rate	`rate(github_runner_job_errors_total[5m]) > 0.1`	Warning
No jobs processed	`increase(github_runner_jobs_total[30m]) == 0`	Info
Pool saturated	`github_runner_jobs_active == <concurrency>`	Warning
Poll failures	`rate(github_runner_poll_errors_total[5m]) > 0`	Warning
Heartbeat failures	`rate(github_runner_heartbeat_errors_total[5m]) > 0`	Critical
High memory	`github_runner_heap_alloc_bytes > 1e9`	Warning
Readiness down	`/readyz` returns 503	Critical

Observability¶

Logging¶

Configuration¶

Structured logging¶

Component loggers¶

Secret masking in logs¶

Metrics¶

Endpoint¶

Available metrics¶

Job metrics¶

Step metrics¶

Cache metrics¶

Infrastructure metrics¶

Runtime metrics¶

Custom registry¶

Prometheus scrape configuration¶

Grafana dashboard¶

Health endpoints¶

Configuration¶

Liveness: /healthz¶

Readiness: /readyz¶

Registered checks¶

Kubernetes probes¶

Readiness lifecycle¶

pprof¶

Alerting recommendations¶

Liveness: `/healthz`¶

Readiness: `/readyz`¶