Architecture¶

This document describes the internal architecture of github-runner, including its component structure, data flow, concurrency model, and shutdown behavior.

System overview¶

github-runner is a single-binary process that manages one or more runner pools. Each pool maps to a [[runners]] entry in the config file and operates independently with its own executor type, concurrency level, and GitHub API credentials.

flowchart TB
    CLI["CLI (cobra)"] --> Manager["Runner Manager (1 per process)"]

    subgraph Pools["Runner pools (one per [[runners]] entry)"]
      W1["Worker 1 (job)"]
      W2["Worker 2 (idle)"]
      WN["Worker N (job)"]
    end

    Manager --> Pools

    subgraph Layers["Execution + services"]
      Executor["Executor layer<br/>shell / docker / kubernetes / firecracker"]
      Cache["Cache layer<br/>local / s3 / gcs"]
      Artifact["Artifact manager"]
      GitHubAPI["GitHub API client"]
      Metrics["Metrics"]
      Health["Health"]
      Logger["Logger"]
    end

    Pools --> Executor
    Pools --> Cache
    Pools --> Artifact
    Manager --> GitHubAPI
    Manager --> Metrics
    Manager --> Health
    Manager --> Logger

Component diagram¶

graph TD
    CLI["CLI (cobra)"] --> Manager["Runner Manager"]
    Manager --> Pool1["Runner Pool: docker-fast"]
    Manager --> Pool2["Runner Pool: shell-local"]
    Manager --> MetricsSrv["Metrics Server :9252"]
    Manager --> HealthSrv["Health Server :8484"]
    Manager --> SignalHandler["Signal Handler"]

    Pool1 --> Poller1["Poller"]
    Pool1 --> Worker1A["Worker 1"]
    Pool1 --> Worker1B["Worker 2"]
    Pool1 --> Worker1N["Worker N"]

    Poller1 --> GitHubAPI["GitHub API Client"]
    Worker1A --> Executor["Executor (docker)"]
    Worker1A --> Heartbeat["Heartbeat Reporter"]
    Worker1A --> SecretMasker["Secret Masker"]
    Worker1A --> Hooks["Hook Chain"]

    Executor --> Docker["Docker Engine API"]

    Pool2 --> Poller2["Poller"]
    Pool2 --> Worker2A["Worker 1"]
    Worker2A --> ShellExec["Executor (shell)"]

    GitHubAPI --> RateLimit["Rate Limiter"]
    GitHubAPI --> Retry["Retry + Backoff"]

    Manager --> ConfigWatcher["Config Watcher (fsnotify)"]

Components¶

Runner Manager (`internal/runner/manager.go`)¶

The top-level orchestrator. One instance per process. Responsibilities:

Creates and supervises runner pools based on config
Starts metrics and health HTTP servers
Handles OS signals (SIGTERM/SIGINT for shutdown, SIGHUP for reload)
Coordinates graceful shutdown with timeout enforcement

Runner Pool (`internal/runner/pool.go`)¶

One pool per [[runners]] config entry. Each pool:

Runs a Poller goroutine that queries the GitHub API for available jobs
Maintains a buffered channel of jobs (buffer size = concurrency)
Spawns Worker goroutines that pull from the channel
Tracks active job count via atomic.Int64

Worker (`internal/runner/worker.go`)¶

A worker executes a single job through its lifecycle:

Registers secrets for log masking
Starts heartbeat reporter in a background goroutine
Transitions through the lifecycle state machine
Runs pre-job hooks
Calls Executor.Prepare() to set up the environment
Executes steps sequentially via Executor.Run()
Runs post-job hooks
Reports final status to GitHub
Calls Executor.Cleanup() (always, even on failure)

Poller (`internal/runner/poller.go`)¶

Polls the GitHub API at a configured interval. Features:

Exponential backoff on consecutive errors (up to 5 minutes)
Automatic interval reset after successful poll
Blocks on the job channel when all workers are busy

Lifecycle (`internal/runner/lifecycle.go`)¶

A state machine that enforces valid job state transitions:

stateDiagram-v2
    [*] --> Queued
    Queued --> Claimed
    Claimed --> Preparing
    Preparing --> Running
    Running --> PostExec
    PostExec --> Completed
    PostExec --> Failed
    Completed --> Cleanup
    Failed --> Cleanup
    Cleanup --> [*]

    Queued --> Cancelled
    Claimed --> Cancelled
    Claimed --> Failed
    Preparing --> Failed
    Preparing --> Cancelled
    Running --> Failed
    Running --> Cancelled
    PostExec --> Cancelled
    Cancelled --> Cleanup

Every transition is validated, logged, and reported to GitHub.

Concurrency model¶

Goroutine topology¶

main goroutine
└── Manager.Start()
    ├── signal handler         (1 goroutine)
    ├── metrics server         (1 goroutine)
    ├── health server          (1 goroutine)
    │
    ├── Pool "docker-fast"     (1 goroutine)
    │   ├── Poller             (1 goroutine)
    │   ├── Worker 0           (1 goroutine per active job)
    │   │   ├── heartbeat      (1 goroutine)
    │   │   └── executor I/O   (managed by executor)
    │   ├── Worker 1
    │   └── ...Worker N
    │
    └── Pool "shell-local"     (1 goroutine)
        ├── Poller             (1 goroutine)
        └── Worker 0..M

Synchronization strategy¶

Resource	Mechanism	Notes
Job dispatch channel	Buffered channel	Size = pool concurrency
Active job count	`atomic.Int64`	Lock-free reads for metrics
Config state	`sync.RWMutex`	Writer: config watcher. Readers: pools.
Lifecycle state	`sync.RWMutex`	Per-job, no sharing between workers
Shutdown coordination	`context.Context` + `sync.WaitGroup`	Cancel propagates to all goroutines
Secret masker patterns	`sync.RWMutex`	Writers: `AddSecret`. Readers: `Write`/`MaskString`.
Cache index (local)	`sync.RWMutex` + file lock	Mutex for in-process safety, flock for multi-process
Metrics counters	Prometheus client internals	Atomic internally, no additional sync
Rate limit state	`sync.RWMutex`	Updated from response headers

Design rules¶

Workers never share mutable state. Each worker gets its own masker, executor instance, and lifecycle tracker.
All inter-goroutine communication uses channels or context cancellation.
Every goroutine respects ctx.Done() for clean shutdown.
defer is used for all cleanup to ensure resources are released on every code path.

Shutdown sequence¶

1. SIGTERM or SIGINT received
2. Root context cancelled → propagates to all pools and workers
3. Pollers stop accepting new jobs immediately
4. Health server reports not-ready (/readyz returns 503)
5. In-flight workers:
   a. Current step completes (bounded by shutdown_timeout)
   b. Executor.Cleanup() called
   c. Job status reported to GitHub as cancelled
6. WaitGroup.Wait() blocks until all workers finish
7. Metrics and health servers shut down
8. Process exits with code 0

If shutdown_timeout expires:
  - Warning logged with list of still-running jobs
  - Process exits with code 1

Data flow: job lifecycle¶

sequenceDiagram
    participant GH as GitHub API
    participant R as Runner
    participant E as Executor

    R->>GH: Poll jobs (GET /jobs)
    GH-->>R: Job payload
    R->>E: Prepare()
    R->>GH: Status in_progress

    loop For each workflow step
      R->>E: Run(step)
      E-->>R: StepResult
      R->>GH: Step status + heartbeat
    end

    R->>GH: Status completed
    R->>E: Cleanup()

Package dependency graph¶

cmd/github-runner
  └── internal/cli
        ├── internal/config
        ├── internal/runner
        │     ├── internal/executor
        │     ├── internal/github
        │     ├── internal/hook
        │     └── internal/secret
        ├── internal/metrics
        ├── internal/health
        ├── internal/log
        └── internal/version

internal/executor
  ├── internal/executor/shell
  ├── internal/executor/docker
  ├── internal/executor/kubernetes
  └── internal/executor/firecracker

internal/cache (standalone)
internal/artifact (standalone)
internal/job (depends on executor, api)
pkg/api (no internal dependencies)

No circular dependencies exist. pkg/api is the leaf package that all others import. Internal packages import downward only.