← All projects
Active

Homelab

Host-level infrastructure and model lifecycle contract

Status
Active
Primary Stack
Python · Bash · systemd
Depends On
Forge

Every project I ship has to run on hardware I own. Homelab is the contract that makes the manifest, the ports, and the on-disk files agree with each other before any commit lands.

Homelab schematic — three rack chassis stacked and linked by a single amber Thunderbolt bus

Every project I ship has to run on hardware I own. After a year of bolting machines together and installing models by hand, the failure mode became clear: two Linux boxes, dozens of GGUF files per box, systemd units that outlived their models, ports that got reused across different roles in different weeks. I’d pull up /v1/models on Forge and find entries pointing at files that weren’t there anymore. Homelab is the ruleset that makes that impossible now — a manifest that names every model on every host, a port registry that enforces one-service-per-port-per-host, and a lint that fails a commit when the rules aren’t kept.

Five hosts, each named after blacksmithing gear. Furnace and Crucible are GMKTec EVO-X2 boxes (AMD Ryzen AI Max+ 395, Strix Halo — 128 GB RAM / 96 GB VRAM on Furnace, 64 / 32 on Crucible) doing the actual inference. Anvil is an M4 Mac Mini for remote dev and ARIA’s Relay app. Ember is an iDowell UPS monitored over NUT that auto-shuts the rack on low battery. Bellows is a PiKVM for out-of-band console access when one of the Linux boxes wedges. Everything connects over Tailscale. The source of truth is models-manifest.yaml (schema v4) — every model on any host with its lifecycle state (active, on_demand, deprecated, retired, removed), role, port, and file paths. scripts/models-lint.py enforces eighteen contract rules across the manifest, the port registry, Forge’s config, the on-disk files, the running systemd units, and the llama-swap roundtrip. Zero errors before any commit.

The recent proud moment was watching lifecycle hygiene turn into real reclaimed disk. Rule 17 (root-LV disk-headroom) had been warning for a week — the root logical volume had crossed 79% — and rule 18 (disk-to-manifest drift) had flagged eighteen orphan files that no manifest entry claimed. The lifecycle audit found 106 entries already in state: retired sitting out their 30-day remove_after windows. Fast-tracking eighteen of them — the LLM base weights, which are cheap to re-fetch from Hugging Face; not the Civitai LoRAs, which aren’t — through model-remove.py --force reclaimed 207 GB in a single afternoon and flipped rule 17 back to clean. The next pass is closing a tooling gap in model-remove.py for the v4 generative schema’s quantizations[] shape — five ComfyUI embeddings had to be removed by a bespoke Python block because the tool only reads files:. And then onboarding the Crucible-side deprecated alias pair on its normal 2026-04-29 sweep.

PythonBashsystemdPostgreSQLPrometheusGrafanaROCmTailscale