ERPNext was built for MariaDB; we run it on PostgreSQL because that's
what fit the legacy migration. Frappe's SQL generator is loose on
MariaDB (missing GROUP BY columns OK, double-quoted strings OK,
HAVING without GROUP BY OK) but strict on Postgres, so we end up
hand-patching files in `patches/fix_pg_groupby.py` after every
ERPNext upgrade. The community has packaged a comprehensive fix as
a Frappe app — `frappe_pg` — that covers the same bugs in one
place. The cleaner path long-term is to install that app instead
of growing our own patch set.
Two doc updates:
- docs/architecture/overview.md §6 item 8 — full background:
the 3 SQL patterns that break (GROUP BY, HAVING, double-quoted
string literals), the 12 hotspots we've already patched, the
4 known remaining (bank_clearance, bank_reconciliation_tool,
accounts/utils L1660, gross_profit), and the install
recommendation with trade-offs (pin a commit, validate on
staging, keep our patches as backup for 4-6 weeks).
- docs/SETUP.md §7 — quick-start install commands for whoever
decides to flip the switch, plus the warning about pinning
rather than tracking main. Also notes that custom Server
Scripts with raw SQL (like `customer_balance`) need the same
single-quote vs double-quote vigilance even after installing
frappe_pg, and the export-fixtures hint to version-control
them.
11 KiB
Gigafibre FSM — Ecosystem Architecture
Unified reference document for infrastructure, platform strategy, and application architecture on the remote Docker environment.
1. Executive Summary & Platform Strategy
Gigafibre FSM is the operations platform for Gigafibre. It replaces a
legacy PHP/MariaDB stack with a real-time push ecosystem (Vue 3,
Node.js, ERPNext) running on a single Proxmox VM at 96.125.196.67.
Core pillars:
- ERPNext v16 — undisputed Source of Truth (CRM, billing, ticketing).
- Ops SPA at
erp.gigafibre.ca/ops/— single pane of glass for internal teams (dispatch, clients, settings, agent flows). - targo-hub at
msg.gigafibre.ca— real-time API gateway (SMS, SSE, AI, OAuth admin, Stripe webhooks, Traccar proxy). - Client portal at
client.gigafibre.ca— customer self-service.
Decommissioned (May 2026):
- ✗
Oktopus CE(TR-369 stack atoss.gigafibre.ca) — broker spammed 75 GB of debug logs over 13 days, took ERPNext down for 4. Stack removed (containers + volumes + images). The hub gates the integration behindOKTOPUS_DISABLED=1so the modules can be re-enabled later if we deploy a different USP controller. - ✗
dispatch-app(legacy PHP SPA atdispatch.gigafibre.ca) — now 301-redirects to/ops/#/dispatch. nginx config at/opt/dispatch-app/nginx.confon the prod box. - ✗
apps/field— replaced by the lightweight mobile tech page at/t/{token}(server-rendered byservices/targo-hub/lib/tech-mobile.js).
Two Authentik instances, in parallel — not a migration:
auth.targo.ca(staff) — protects /ops/, n8n, Gitea; OAuth provider for ERPNext sign-in.id.gigafibre.ca(clients) — protects the customer portal.
2. Infrastructure & Docker Networks
All services are containerized and housed on a single Proxmox VM (96.125.196.67), managed via Traefik.
Internet
│
96.125.196.67 (Proxmox VM, Ubuntu 24.04)
│
├─ Traefik v2.11 (:80/:443, Let's Encrypt, ForwardAuth)
│
├─ Authentik (auth.targo.ca) → SSO for staff (ops, n8n, Gitea, ERPNext OAuth)
├─ Authentik (id.gigafibre.ca) → SSO for client portal
│
├─ ERPNext v16.10.1 (erp.gigafibre.ca) → 9 containers (db, redis, backend, queues, scheduler, websocket, n8n, n8n-proxy)
│
├─ Ops SPA (erp.gigafibre.ca/ops/) → Served via nginx:alpine from /opt/ops-app/
├─ Dispatch redirect (dispatch.gigafibre.ca) → 301 → /ops/#/dispatch (former dispatch-app, decommissioned)
│
├─ targo-hub (msg.gigafibre.ca) → Node 20, /opt/targo-hub/
├─ DocuSeal (sign.gigafibre.ca) → Contract e-signature
├─ traccar-proxy → nginx relay for Traccar UI
│
└─ Marketing site (www.gigafibre.ca) → React/Vite/Tailwind
DNS Configuration (Cloudflare):
- Domain
gigafibre.cais strictly DNS-only (no Cloudflare proxy) to allow Traefik Let's Encrypt generation. - Email via Mailjet + Google Workspace records configured on root.
Docker Networks:
proxy: Public-facing network connected to Traefik.erpnext_erpnext: Internal network for Frappe, Postgres, Redis, and targo-hub routing.
3. Core Services
ERPNext (The Backend)
- Database: PostgreSQL (
erpnext-db-1). - Extensions: Custom doctypes for Dispatch Job, Technician, Tag, Service Location, Service Equipment, Subscription.
- API Token Auth:
targo-huband the Ops PWA interact with Frappe via a highly-privileged service token (Authorization: token ...).
Targo-Hub (API Gateway)
- Stack: Node.js 20 (
msg.gigafibre.ca:3300). - Purpose: Acts as the middleman for all heavy or real-time workflows out of ERPNext's scope.
- Key Abilities:
- Real-time Server-Sent Events (SSE) for timeline/chat updates.
- Twilio SMS / Voice (IVR) routing.
- Modem polling (GenieACS, OLT SNMP proxy).
- Webhooks handling (Stripe payments, Uptime-Kuma, 3CX).
Modem-Bridge
- Stack: Playwright/Chromium (
:3301internal). - Purpose: Allows reading encrypted TR-181 parameters from TP-Link XX230v modems by leveraging the modem's native JS cryptography. Exposes a simple JSON REST API locally to targo-hub.
Vision / OCR (Gemini via targo-hub)
- Model: Gemini 2.5 Flash (Google) — no local GPU, all inference remote.
- Endpoints (hub):
/vision/barcodes,/vision/equipment,/vision/invoice. - Why centralized: ops VM has no GPU, so the legacy Ollama
llama3.2-visioninstall was retired. All three frontends (ops, field-as-ops/j, future client portal) hit the hub, which enforces JSONresponseSchemaper endpoint. - Client-side resilience: barcode scans use an 8s timeout + IndexedDB retry queue so techs in weak-LTE zones don't lose data. See ../features/vision-ocr.md for the full pipeline.
4. Security & Authentication Flow
Staff user → erp.gigafibre.ca/ops/ (or n8n, Gitea)
→ Traefik checks session via ForwardAuth middleware
→ Outpost validates with Authentik staff (auth.targo.ca)
→ Authorized? Request forwarded to upstream container
with X-Authentik-Email + X-Authentik-Groups headers
→ Ops SPA reads X-Authentik-Email; useUserGroups maps groups
to in-app capabilities
Customer user → client.gigafibre.ca
→ Traefik checks session via separate ForwardAuth chain
→ Outpost validates with Authentik client (id.gigafibre.ca)
Two distinct ForwardAuth middlewares:
authentik@file→ backed byauth.targo.ca(staff)authentik-client@file→ backed byid.gigafibre.ca(customers)
ERPNext OAuth — auth.targo.ca is also configured as a Frappe
Social Login Key (provider name Authentik). The login page at
/login shows both the password form and the "Login with Authentik"
button. OAuth client_id P0rFFdq2hhun7hOLwkF5zm87vvDqcVYAhLtoZnFX,
redirect_uri /api/method/frappe.integrations.oauth2_logins.custom/authentik.
Adding new users is centralized through the hub, not the Authentik
admin UI. The ops Settings page (Settings → Utilisateurs → Inviter)
hits POST /auth/users on msg.gigafibre.ca which:
- Creates the Authentik user (random username from local-part of email, password set explicitly), assigns OPS_GROUPS.
- Sets a temp password (readable, no look-alikes) and emails it via
the hub's Mailjet SMTP — Authentik's own recovery flow isn't wired
(
flow_recovery=Noneon the brand) and its global SMTP is unset, so the hub does it directly. - Creates the matching ERPNext User (System User, social_logins = [{provider:authentik, userid:email}]) so OAuth finds it on first login.
The temp password is also returned to the admin (UI shows it with a
copy button) so they can hand it over manually if Mailjet drops the
message. See services/targo-hub/lib/auth.js for the full flow.
API Security: frontends rely on the Authentik session cookie
forwarded by Traefik. Backend scripts and the hub use
Authorization: token <ERP_SERVICE_TOKEN> Bearer headers.
5. Network Intelligence & CPE Flow
Device Diagnostics (targo-hub → GenieACS / OLT)
When a CSR clicks "Diagnostiquer" in the Ops app:
- Ops app asks
/devices/lookup?serial=X. targo-hubpolls GenieACS NBI.- If deep data is needed,
targo-hubqueriesmodem-bridge(for TP-Link) or the OLT SNMP directly. - Returns consolidated interface, mesh, wifi, and opticalStatus array to the UI.
Future: QR Code Flow
- Tech applies QR sticker to modem (
msg.gigafibre.ca/q/{mac}). - Client scans QR →
targo-hubidentifies customer via MAC matching in ERPNext. - Triggers SMS OTP → Client views diagnostic portal.
6. Development Gotchas
-
Traefik v3 is incompatible with Docker 29 due to API changes. Stay on v2.11.
-
Never click "Generate Keys" for the Administrator user in ERPNext — it breaks the
targo-hubAPI token (silently). -
Traccar API supports only one
deviceIdper request. Use parallel polling (Promise.allSettled) — seeservices/targo-hub/lib/traccar.js. -
Docker log rotation is set globally via
/etc/docker/daemon.json(max-size=100m, max-file=3). Applied at container creation — old containers keep their previous (uncapped) policy until youcompose up -d --force-recreatethem. We learned this the hard way when the Oktopus broker filled/var/sdbwith 75 GB of debug logs in 13 days. -
Weekly prune runs via
/etc/cron.d/docker-pruneSunday 03:00 ET — clears anything not used in 30 days. Don't add a stack you only run monthly withoutrestart: alwaysor it'll get pruned out. -
PostgreSQL transaction-aborted errors in the backend log — usually benign (one bad query in the Frappe scheduler) but if persistent, it's the connection pool needing a recycle.
docker restart erpnext-backend-1resolves. -
Authentik recovery flow isn't configured on the brand. Don't use
recovery_email/from the API — use the hub invite flow described in §4 instead. -
ERPNext was built for MariaDB; we run it on PostgreSQL. Frappe and ERPNext generate SQL that's tolerant under MariaDB but strict under Postgres — three recurring incompat patterns:
GROUP BYclauses missing non-aggregated columns (Postgres rejects, MariaDB doesn't)HAVINGwithout aGROUP BY(same)"Customer"interpreted as a column reference under Postgres (it's a string literal under MariaDB)
We've hand-patched 12 hotspots (see
feedback_erpnext_postgres.mdin the working memory +patches/fix_pg_groupby.py), but 4 known issues remain inaccounts/utils.py~L1660,bank_clearance.py,bank_reconciliation_tool.py, andgross_profit.py. Symptom on the UI: a report or doctype list returns "column "X" does not exist" or stays blank.Recommendation — install the
frappe_pgcommunity app. It bundles a comprehensive set of PostgreSQL compatibility patches for Frappe + ERPNext as an external app — onebench install-app frappe_pginstead of patching files one by one on every ERPNext upgrade. Trade-off: a third-party app can lag behind ERPNext releases and may introduce its own issues, so:- Evaluate it first on staging (re-run the smoke test on the 4 known-broken UIs to confirm coverage)
- Pin a known-good
frappe_pgcommit inapps.txtrather than trackingmain - Keep our
patches/fix_pg_groupby.pyas a backup; remove it only after frappe_pg has been stable for 4-6 weeks
When we apply our own patches, they go in
patches/(Python files that run duringbench update) so they survive ERPNext upgrades. Never edit ERPNext source files in-place inside the container — the nextbench updateclobbers it.Custom Server Scripts with raw SQL (e.g. our
customer_balanceendpoint) need the same vigilance: use'Customer'not"Customer"for string literals. Add abench export-fixturesstep to version-control any Server Script we tweak so the fix isn't lost if ERPNext is re-deployed elsewhere.