gigafibre-fsm/docs/features/cpe-management.md
louispaulb 0f8d2b0565 docs: bring all docs in sync with the May 2026 reality
Mass refresh — the docs were last touched 2026-04-22, two weeks behind
shipped reality. This commit updates 9 files to reflect current truth.

WHAT CHANGED IN THE PRODUCT (since 22 Apr) THAT THE DOCS NOW REFLECT:

  • Oktopus CE / TR-369 stack decommissioned (containers + volumes +
    images all removed; broker had filled /dev/sdb with 75 GB of debug
    logs and took ERPNext down for 4 days). Hub gates the integration
    behind OKTOPUS_DISABLED=1 — modules retained, no-op'd at runtime.
  • dispatch.gigafibre.ca (legacy PHP SPA) replaced by an nginx 301
    redirect to /ops/#/dispatch.
  • Top toolbar of the dispatch module: collapsed to single-color
    Lucide icons + ⋯ overflow menu + "Vue principale ▾" + "[👥 N ▾]"
    resource type chip (defaults to techs, materials in the dropdown
    only when relevant).
  • Tech home base / departure point: editable per-tech via 📍 button,
    address geocode (Nominatim) or click-on-map picker, right-click
    on tech pin opens the same actions. Map defaults centered on
    Gigafibre HQ (1867 chemin de la Rivière, Sainte-Clotilde) instead
    of downtown Montreal.
  • POST /auth/users invite flow on the hub: creates the Authentik
    user, sets a temp password, mails it via Mailjet (Authentik's
    own recovery flow isn't configured), creates the matching ERPNext
    System User. Surfaced in ops Settings → Utilisateurs → Inviter.
  • Two Authentik instances clarified as parallel-and-permanent (not
    a migration): auth.targo.ca for staff, id.gigafibre.ca for clients.

FILES TOUCHED:

  README.md — service table refreshed, arch diagram redrawn (no
    Oktopus row), auth section explains the invite flow + two
    parallel instances.
  docs/architecture/overview.md — new "Decommissioned" section,
    correct retirement status for dispatch-app + apps/field, two
    Authentik instances explicitly distinguished, dev-gotchas list
    rewritten (drops MongoDB AVX, adds log-rotation hard-learned
    lesson, adds note about Authentik recovery flow).
  docs/architecture/data-model.md — Step 5 hardware provisioning
    now describes the GenieACS path (TR-069 Inform → preset push)
    instead of the dead TR-369 path.
  docs/architecture/module-interactions.md — oktopus.js and
    oktopus-mqtt.js entries marked as gated, provision.js note
    updated, GenieACS row in external-integrations updated, MQTT
    row removed from real-time channels, interaction matrix loses
    the Oktopus column and gains an Authentik admin REST cell.
  docs/features/dispatch.md — Top bar section completely rewritten
    to match the current chrome (left/center/right regions,
    single-color Lucide, dropdowns); new Tech home base section
    documenting the 📍 + map-pick + right-click flows; retirement
    note now reads as a status, not a plan.
  docs/features/cpe-management.md — full rewrite. Oktopus migration
    plan replaced by a "decommissioned" note + the existing GenieACS
    + modem-bridge architecture as the steady state. TP-Link XX230v
    deep-dive sections preserved (still accurate).
  docs/README.md, docs/features/README.md, docs/roadmap.md —
    intent-table descriptions and live-URLs table corrected.

The docs/archive/ snapshots (2026-04-18, 2026-04-19) are untouched —
they're historical and should remain that way.
2026-05-05 20:10:40 -04:00

4.8 KiB

Gigafibre FSM — CPE Hardware Management

Managing the customer-premises equipment fleet (ONTs, routers, mesh nodes). Covers TR-069 (GenieACS), the modem-bridge for deep TP-Link diagnostics, and the diagnostic-swap workflow.


1. Protocol & Tooling Stack

The current ACS is GenieACS (TR-069 / CWMP — HTTP/SOAP polling). It's external to the prod box (separate VM managed by the network team). The hub talks to it via the GenieACS NBI (Northbound Interface) on the internal network.

About TR-369 / USP — we ran a parallel Oktopus CE deployment for USP (TR-369 over MQTT/WebSocket) but decommissioned it in May 2026 after the broker filled the disk with debug spam. The integration modules in the hub (lib/oktopus.js, lib/oktopus-mqtt.js) remain in the tree gated behind OKTOPUS_DISABLED=1 so we can re-enable them later if we settle on a different USP controller. For now, all CPE management goes through TR-069 + GenieACS.

Polling cadence — GenieACS receives an Inform from each ONT every ~5 minutes (configurable in the device firmware). For real-time deep dives that can't wait for the next Inform window, the hub falls back to:

  • modem-bridge (services/modem-bridge/) — Playwright-driven HTTP client that scrapes encrypted TR-181 parameters from the modem's native admin UI. Specifically targets the TP-Link XX230v / Deco mesh.
  • OLT SNMP — direct SNMPv2 walk against the OLT for ONT optical status, when even the modem itself is unreachable.

The XX230v exposes a rich TR-181 data model. When customers report "WiFi issues", CSRs and techs should not blindly swap the hardware. Poll these endpoints first to find the actual root cause.

A. Optical Signal (Is it the Fibre?)

Device.Optical.Interface.1.Stats.SignalRxPower   → target: -8 to -25 dBm
Device.Optical.Interface.1.Stats.ErrorsSent

Diagnosis — RxPower < -25 dBm = dirty connector or fibre break. Not an ONT hardware fault. Don't swap; dispatch a fibre tech.

B. WiFi Radio & Topology (Is it Interference?)

Device.WiFi.Radio.1.Stats.Noise                          → 2.4 GHz interference
Device.WiFi.Radio.2.Stats.Noise                          → 5 GHz interference
Device.WiFi.MultiAP.APDevice.{i}.Radio.{j}.Utilization   → mesh backhaul load

Diagnosis — high noise/errors on a band = environmental channel congestion (neighbours, microwave, baby monitor). High backhaul utilization on a satellite Deco = customer needs to move it closer to the main unit.

C. Live Speed Test (Is it the Client Device?)

Device.IP.Diagnostics.DownloadDiagnostics.DiagnosticsState = "Requested"

Diagnosis — kicks off a server-to-ONT speed test, which eliminates WiFi-side latency variables. ONT speed test fast + customer's iPhone slow → the iPhone or the WiFi link is the bottleneck, not the line.


3. The "Diagnostic Swap" Workflow

A common gap occurs when techs swap equipment simply because they aren't sure what is defective. This creates inventory chaos.

We use a 3-way diagnostic status instead of a binary Défectueux:

  1. Remplacement définitif — the equipment is dead. (Old → Défectueux, New → Actif)
  2. Swap diagnostic — swapping to test if the problem resolves. (Old → En diagnostic, New → Actif (temporary))
  3. Retour de diagnostic — the old unit was actually fine. (Old → Actif (returned to use), Test unit → Retourné)

If a tech chooses Swap diagnostic, an ERPNext Task is automatically generated scheduling a follow-through test on the removed hardware within 7 days. If the unit tests fine at the warehouse, it goes back to En inventaire instead of being trashed.


4. Hub Endpoints (/devices/*)

The ops "Diagnostiquer" button on a customer or equipment row hits the hub, which orchestrates GenieACS / OLT / modem-bridge in the right order:

  1. /devices/lookup?serial=X — finds the ONT by serial in GenieACS.
  2. The hub returns the latest Inform snapshot (interface, mesh, wifi, opticalStatus). If older than 60s, it kicks an immediate Refresh task to GenieACS.
  3. For TP-Link models specifically, deeper params (encrypted TR-181) are fetched via modem-bridge if needed.
  4. Final response is a consolidated JSON consumed by apps/ops/src/components/equipment/EquipmentDiagnostic.vue.

See module-interactions.md for the full call graph.


5. Cross-references