gigafibre-fsm/docs/architecture/overview.md
louispaulb c31a9e029e docs: recommend frappe_pg community app for ERPNext PostgreSQL compat
ERPNext was built for MariaDB; we run it on PostgreSQL because that's
what fit the legacy migration. Frappe's SQL generator is loose on
MariaDB (missing GROUP BY columns OK, double-quoted strings OK,
HAVING without GROUP BY OK) but strict on Postgres, so we end up
hand-patching files in `patches/fix_pg_groupby.py` after every
ERPNext upgrade. The community has packaged a comprehensive fix as
a Frappe app — `frappe_pg` — that covers the same bugs in one
place. The cleaner path long-term is to install that app instead
of growing our own patch set.

Two doc updates:

  - docs/architecture/overview.md §6 item 8 — full background:
    the 3 SQL patterns that break (GROUP BY, HAVING, double-quoted
    string literals), the 12 hotspots we've already patched, the
    4 known remaining (bank_clearance, bank_reconciliation_tool,
    accounts/utils L1660, gross_profit), and the install
    recommendation with trade-offs (pin a commit, validate on
    staging, keep our patches as backup for 4-6 weeks).

  - docs/SETUP.md §7 — quick-start install commands for whoever
    decides to flip the switch, plus the warning about pinning
    rather than tracking main. Also notes that custom Server
    Scripts with raw SQL (like `customer_balance`) need the same
    single-quote vs double-quote vigilance even after installing
    frappe_pg, and the export-fixtures hint to version-control
    them.
2026-05-21 14:40:36 -04:00

11 KiB

Gigafibre FSM — Ecosystem Architecture

Unified reference document for infrastructure, platform strategy, and application architecture on the remote Docker environment.

1. Executive Summary & Platform Strategy

Gigafibre FSM is the operations platform for Gigafibre. It replaces a legacy PHP/MariaDB stack with a real-time push ecosystem (Vue 3, Node.js, ERPNext) running on a single Proxmox VM at 96.125.196.67.

Core pillars:

  • ERPNext v16 — undisputed Source of Truth (CRM, billing, ticketing).
  • Ops SPA at erp.gigafibre.ca/ops/ — single pane of glass for internal teams (dispatch, clients, settings, agent flows).
  • targo-hub at msg.gigafibre.ca — real-time API gateway (SMS, SSE, AI, OAuth admin, Stripe webhooks, Traccar proxy).
  • Client portal at client.gigafibre.ca — customer self-service.

Decommissioned (May 2026):

  • Oktopus CE (TR-369 stack at oss.gigafibre.ca) — broker spammed 75 GB of debug logs over 13 days, took ERPNext down for 4. Stack removed (containers + volumes + images). The hub gates the integration behind OKTOPUS_DISABLED=1 so the modules can be re-enabled later if we deploy a different USP controller.
  • dispatch-app (legacy PHP SPA at dispatch.gigafibre.ca) — now 301-redirects to /ops/#/dispatch. nginx config at /opt/dispatch-app/nginx.conf on the prod box.
  • apps/field — replaced by the lightweight mobile tech page at /t/{token} (server-rendered by services/targo-hub/lib/tech-mobile.js).

Two Authentik instances, in parallel — not a migration:

  • auth.targo.ca (staff) — protects /ops/, n8n, Gitea; OAuth provider for ERPNext sign-in.
  • id.gigafibre.ca (clients) — protects the customer portal.

2. Infrastructure & Docker Networks

All services are containerized and housed on a single Proxmox VM (96.125.196.67), managed via Traefik.

Internet
  │
96.125.196.67 (Proxmox VM, Ubuntu 24.04)
  │
  ├─ Traefik v2.11 (:80/:443, Let's Encrypt, ForwardAuth)
  │
  ├─ Authentik (auth.targo.ca)        → SSO for staff (ops, n8n, Gitea, ERPNext OAuth)
  ├─ Authentik (id.gigafibre.ca)      → SSO for client portal
  │
  ├─ ERPNext v16.10.1 (erp.gigafibre.ca) → 9 containers (db, redis, backend, queues, scheduler, websocket, n8n, n8n-proxy)
  │
  ├─ Ops SPA (erp.gigafibre.ca/ops/)  → Served via nginx:alpine from /opt/ops-app/
  ├─ Dispatch redirect (dispatch.gigafibre.ca) → 301 → /ops/#/dispatch (former dispatch-app, decommissioned)
  │
  ├─ targo-hub (msg.gigafibre.ca)     → Node 20, /opt/targo-hub/
  ├─ DocuSeal (sign.gigafibre.ca)     → Contract e-signature
  ├─ traccar-proxy                    → nginx relay for Traccar UI
  │
  └─ Marketing site (www.gigafibre.ca) → React/Vite/Tailwind

DNS Configuration (Cloudflare):

  • Domain gigafibre.ca is strictly DNS-only (no Cloudflare proxy) to allow Traefik Let's Encrypt generation.
  • Email via Mailjet + Google Workspace records configured on root.

Docker Networks:

  • proxy: Public-facing network connected to Traefik.
  • erpnext_erpnext: Internal network for Frappe, Postgres, Redis, and targo-hub routing.

3. Core Services

ERPNext (The Backend)

  • Database: PostgreSQL (erpnext-db-1).
  • Extensions: Custom doctypes for Dispatch Job, Technician, Tag, Service Location, Service Equipment, Subscription.
  • API Token Auth: targo-hub and the Ops PWA interact with Frappe via a highly-privileged service token (Authorization: token ...).

Targo-Hub (API Gateway)

  • Stack: Node.js 20 (msg.gigafibre.ca:3300).
  • Purpose: Acts as the middleman for all heavy or real-time workflows out of ERPNext's scope.
  • Key Abilities:
    • Real-time Server-Sent Events (SSE) for timeline/chat updates.
    • Twilio SMS / Voice (IVR) routing.
    • Modem polling (GenieACS, OLT SNMP proxy).
    • Webhooks handling (Stripe payments, Uptime-Kuma, 3CX).

Modem-Bridge

  • Stack: Playwright/Chromium (:3301 internal).
  • Purpose: Allows reading encrypted TR-181 parameters from TP-Link XX230v modems by leveraging the modem's native JS cryptography. Exposes a simple JSON REST API locally to targo-hub.

Vision / OCR (Gemini via targo-hub)

  • Model: Gemini 2.5 Flash (Google) — no local GPU, all inference remote.
  • Endpoints (hub): /vision/barcodes, /vision/equipment, /vision/invoice.
  • Why centralized: ops VM has no GPU, so the legacy Ollama llama3.2-vision install was retired. All three frontends (ops, field-as-ops /j, future client portal) hit the hub, which enforces JSON responseSchema per endpoint.
  • Client-side resilience: barcode scans use an 8s timeout + IndexedDB retry queue so techs in weak-LTE zones don't lose data. See ../features/vision-ocr.md for the full pipeline.

4. Security & Authentication Flow

Staff user → erp.gigafibre.ca/ops/  (or n8n, Gitea)
  → Traefik checks session via ForwardAuth middleware
  → Outpost validates with Authentik staff (auth.targo.ca)
  → Authorized? Request forwarded to upstream container
    with X-Authentik-Email + X-Authentik-Groups headers
  → Ops SPA reads X-Authentik-Email; useUserGroups maps groups
    to in-app capabilities

Customer user → client.gigafibre.ca
  → Traefik checks session via separate ForwardAuth chain
  → Outpost validates with Authentik client (id.gigafibre.ca)

Two distinct ForwardAuth middlewares:

  • authentik@file → backed by auth.targo.ca (staff)
  • authentik-client@file → backed by id.gigafibre.ca (customers)

ERPNext OAuthauth.targo.ca is also configured as a Frappe Social Login Key (provider name Authentik). The login page at /login shows both the password form and the "Login with Authentik" button. OAuth client_id P0rFFdq2hhun7hOLwkF5zm87vvDqcVYAhLtoZnFX, redirect_uri /api/method/frappe.integrations.oauth2_logins.custom/authentik.

Adding new users is centralized through the hub, not the Authentik admin UI. The ops Settings page (Settings → Utilisateurs → Inviter) hits POST /auth/users on msg.gigafibre.ca which:

  1. Creates the Authentik user (random username from local-part of email, password set explicitly), assigns OPS_GROUPS.
  2. Sets a temp password (readable, no look-alikes) and emails it via the hub's Mailjet SMTP — Authentik's own recovery flow isn't wired (flow_recovery=None on the brand) and its global SMTP is unset, so the hub does it directly.
  3. Creates the matching ERPNext User (System User, social_logins = [{provider:authentik, userid:email}]) so OAuth finds it on first login.

The temp password is also returned to the admin (UI shows it with a copy button) so they can hand it over manually if Mailjet drops the message. See services/targo-hub/lib/auth.js for the full flow.

API Security: frontends rely on the Authentik session cookie forwarded by Traefik. Backend scripts and the hub use Authorization: token <ERP_SERVICE_TOKEN> Bearer headers.


5. Network Intelligence & CPE Flow

Device Diagnostics (targo-hub → GenieACS / OLT) When a CSR clicks "Diagnostiquer" in the Ops app:

  1. Ops app asks /devices/lookup?serial=X.
  2. targo-hub polls GenieACS NBI.
  3. If deep data is needed, targo-hub queries modem-bridge (for TP-Link) or the OLT SNMP directly.
  4. Returns consolidated interface, mesh, wifi, and opticalStatus array to the UI.

Future: QR Code Flow

  • Tech applies QR sticker to modem (msg.gigafibre.ca/q/{mac}).
  • Client scans QR → targo-hub identifies customer via MAC matching in ERPNext.
  • Triggers SMS OTP → Client views diagnostic portal.

6. Development Gotchas

  1. Traefik v3 is incompatible with Docker 29 due to API changes. Stay on v2.11.

  2. Never click "Generate Keys" for the Administrator user in ERPNext — it breaks the targo-hub API token (silently).

  3. Traccar API supports only one deviceId per request. Use parallel polling (Promise.allSettled) — see services/targo-hub/lib/traccar.js.

  4. Docker log rotation is set globally via /etc/docker/daemon.json (max-size=100m, max-file=3). Applied at container creation — old containers keep their previous (uncapped) policy until you compose up -d --force-recreate them. We learned this the hard way when the Oktopus broker filled /var/sdb with 75 GB of debug logs in 13 days.

  5. Weekly prune runs via /etc/cron.d/docker-prune Sunday 03:00 ET — clears anything not used in 30 days. Don't add a stack you only run monthly without restart: always or it'll get pruned out.

  6. PostgreSQL transaction-aborted errors in the backend log — usually benign (one bad query in the Frappe scheduler) but if persistent, it's the connection pool needing a recycle. docker restart erpnext-backend-1 resolves.

  7. Authentik recovery flow isn't configured on the brand. Don't use recovery_email/ from the API — use the hub invite flow described in §4 instead.

  8. ERPNext was built for MariaDB; we run it on PostgreSQL. Frappe and ERPNext generate SQL that's tolerant under MariaDB but strict under Postgres — three recurring incompat patterns:

    • GROUP BY clauses missing non-aggregated columns (Postgres rejects, MariaDB doesn't)
    • HAVING without a GROUP BY (same)
    • "Customer" interpreted as a column reference under Postgres (it's a string literal under MariaDB)

    We've hand-patched 12 hotspots (see feedback_erpnext_postgres.md in the working memory + patches/fix_pg_groupby.py), but 4 known issues remain in accounts/utils.py ~L1660, bank_clearance.py, bank_reconciliation_tool.py, and gross_profit.py. Symptom on the UI: a report or doctype list returns "column "X" does not exist" or stays blank.

    Recommendation — install the frappe_pg community app. It bundles a comprehensive set of PostgreSQL compatibility patches for Frappe + ERPNext as an external app — one bench install-app frappe_pg instead of patching files one by one on every ERPNext upgrade. Trade-off: a third-party app can lag behind ERPNext releases and may introduce its own issues, so:

    • Evaluate it first on staging (re-run the smoke test on the 4 known-broken UIs to confirm coverage)
    • Pin a known-good frappe_pg commit in apps.txt rather than tracking main
    • Keep our patches/fix_pg_groupby.py as a backup; remove it only after frappe_pg has been stable for 4-6 weeks

    When we apply our own patches, they go in patches/ (Python files that run during bench update) so they survive ERPNext upgrades. Never edit ERPNext source files in-place inside the container — the next bench update clobbers it.

    Custom Server Scripts with raw SQL (e.g. our customer_balance endpoint) need the same vigilance: use 'Customer' not "Customer" for string literals. Add a bench export-fixtures step to version-control any Server Script we tweak so the fix isn't lost if ERPNext is re-deployed elsewhere.