ERPNext was built for MariaDB; we run it on PostgreSQL because that's
what fit the legacy migration. Frappe's SQL generator is loose on
MariaDB (missing GROUP BY columns OK, double-quoted strings OK,
HAVING without GROUP BY OK) but strict on Postgres, so we end up
hand-patching files in `patches/fix_pg_groupby.py` after every
ERPNext upgrade. The community has packaged a comprehensive fix as
a Frappe app — `frappe_pg` — that covers the same bugs in one
place. The cleaner path long-term is to install that app instead
of growing our own patch set.
Two doc updates:
- docs/architecture/overview.md §6 item 8 — full background:
the 3 SQL patterns that break (GROUP BY, HAVING, double-quoted
string literals), the 12 hotspots we've already patched, the
4 known remaining (bank_clearance, bank_reconciliation_tool,
accounts/utils L1660, gross_profit), and the install
recommendation with trade-offs (pin a commit, validate on
staging, keep our patches as backup for 4-6 weeks).
- docs/SETUP.md §7 — quick-start install commands for whoever
decides to flip the switch, plus the warning about pinning
rather than tracking main. Also notes that custom Server
Scripts with raw SQL (like `customer_balance`) need the same
single-quote vs double-quote vigilance even after installing
frappe_pg, and the export-fixtures hint to version-control
them.
191 lines
11 KiB
Markdown
191 lines
11 KiB
Markdown
# Gigafibre FSM — Ecosystem Architecture
|
|
|
|
> Unified reference document for infrastructure, platform strategy, and application architecture on the remote Docker environment.
|
|
|
|
## 1. Executive Summary & Platform Strategy
|
|
|
|
Gigafibre FSM is the operations platform for Gigafibre. It replaces a
|
|
legacy PHP/MariaDB stack with a real-time push ecosystem (Vue 3,
|
|
Node.js, ERPNext) running on a single Proxmox VM at `96.125.196.67`.
|
|
|
|
Core pillars:
|
|
- **ERPNext v16** — undisputed Source of Truth (CRM, billing, ticketing).
|
|
- **Ops SPA** at `erp.gigafibre.ca/ops/` — single pane of glass for
|
|
internal teams (dispatch, clients, settings, agent flows).
|
|
- **targo-hub** at `msg.gigafibre.ca` — real-time API gateway (SMS,
|
|
SSE, AI, OAuth admin, Stripe webhooks, Traccar proxy).
|
|
- **Client portal** at `client.gigafibre.ca` — customer self-service.
|
|
|
|
**Decommissioned (May 2026):**
|
|
- ✗ `Oktopus CE` (TR-369 stack at `oss.gigafibre.ca`) — broker spammed
|
|
75 GB of debug logs over 13 days, took ERPNext down for 4. Stack
|
|
removed (containers + volumes + images). The hub gates the integration
|
|
behind `OKTOPUS_DISABLED=1` so the modules can be re-enabled later if
|
|
we deploy a different USP controller.
|
|
- ✗ `dispatch-app` (legacy PHP SPA at `dispatch.gigafibre.ca`) — now
|
|
301-redirects to `/ops/#/dispatch`. nginx config at
|
|
`/opt/dispatch-app/nginx.conf` on the prod box.
|
|
- ✗ `apps/field` — replaced by the lightweight mobile tech page at
|
|
`/t/{token}` (server-rendered by `services/targo-hub/lib/tech-mobile.js`).
|
|
|
|
**Two Authentik instances, in parallel — not a migration:**
|
|
- `auth.targo.ca` (staff) — protects /ops/, n8n, Gitea; OAuth provider
|
|
for ERPNext sign-in.
|
|
- `id.gigafibre.ca` (clients) — protects the customer portal.
|
|
|
|
---
|
|
|
|
## 2. Infrastructure & Docker Networks
|
|
|
|
All services are containerized and housed on a single Proxmox VM (`96.125.196.67`), managed via Traefik.
|
|
|
|
```text
|
|
Internet
|
|
│
|
|
96.125.196.67 (Proxmox VM, Ubuntu 24.04)
|
|
│
|
|
├─ Traefik v2.11 (:80/:443, Let's Encrypt, ForwardAuth)
|
|
│
|
|
├─ Authentik (auth.targo.ca) → SSO for staff (ops, n8n, Gitea, ERPNext OAuth)
|
|
├─ Authentik (id.gigafibre.ca) → SSO for client portal
|
|
│
|
|
├─ ERPNext v16.10.1 (erp.gigafibre.ca) → 9 containers (db, redis, backend, queues, scheduler, websocket, n8n, n8n-proxy)
|
|
│
|
|
├─ Ops SPA (erp.gigafibre.ca/ops/) → Served via nginx:alpine from /opt/ops-app/
|
|
├─ Dispatch redirect (dispatch.gigafibre.ca) → 301 → /ops/#/dispatch (former dispatch-app, decommissioned)
|
|
│
|
|
├─ targo-hub (msg.gigafibre.ca) → Node 20, /opt/targo-hub/
|
|
├─ DocuSeal (sign.gigafibre.ca) → Contract e-signature
|
|
├─ traccar-proxy → nginx relay for Traccar UI
|
|
│
|
|
└─ Marketing site (www.gigafibre.ca) → React/Vite/Tailwind
|
|
```
|
|
|
|
**DNS Configuration (Cloudflare):**
|
|
- Domain `gigafibre.ca` is strictly DNS-only (no Cloudflare proxy) to allow Traefik Let's Encrypt generation.
|
|
- Email via Mailjet + Google Workspace records configured on root.
|
|
|
|
**Docker Networks:**
|
|
- `proxy`: Public-facing network connected to Traefik.
|
|
- `erpnext_erpnext`: Internal network for Frappe, Postgres, Redis, and targo-hub routing.
|
|
|
|
---
|
|
|
|
## 3. Core Services
|
|
|
|
### ERPNext (The Backend)
|
|
- **Database:** PostgreSQL (`erpnext-db-1`).
|
|
- **Extensions:** Custom doctypes for Dispatch Job, Technician, Tag, Service Location, Service Equipment, Subscription.
|
|
- **API Token Auth:** `targo-hub` and the Ops PWA interact with Frappe via a highly-privileged service token (`Authorization: token ...`).
|
|
|
|
### Targo-Hub (API Gateway)
|
|
- **Stack:** Node.js 20 (`msg.gigafibre.ca:3300`).
|
|
- **Purpose:** Acts as the middleman for all heavy or real-time workflows out of ERPNext's scope.
|
|
- **Key Abilities:**
|
|
- Real-time Server-Sent Events (SSE) for timeline/chat updates.
|
|
- Twilio SMS / Voice (IVR) routing.
|
|
- Modem polling (GenieACS, OLT SNMP proxy).
|
|
- Webhooks handling (Stripe payments, Uptime-Kuma, 3CX).
|
|
|
|
### Modem-Bridge
|
|
- **Stack:** Playwright/Chromium (`:3301` internal).
|
|
- **Purpose:** Allows reading encrypted TR-181 parameters from TP-Link XX230v modems by leveraging the modem's native JS cryptography. Exposes a simple JSON REST API locally to targo-hub.
|
|
|
|
### Vision / OCR (Gemini via targo-hub)
|
|
- **Model:** Gemini 2.5 Flash (Google) — no local GPU, all inference remote.
|
|
- **Endpoints (hub):** `/vision/barcodes`, `/vision/equipment`, `/vision/invoice`.
|
|
- **Why centralized:** ops VM has no GPU, so the legacy Ollama `llama3.2-vision` install was retired. All three frontends (ops, field-as-ops `/j`, future client portal) hit the hub, which enforces JSON `responseSchema` per endpoint.
|
|
- **Client-side resilience:** barcode scans use an 8s timeout + IndexedDB retry queue so techs in weak-LTE zones don't lose data. See [../features/vision-ocr.md](../features/vision-ocr.md) for the full pipeline.
|
|
|
|
---
|
|
|
|
## 4. Security & Authentication Flow
|
|
|
|
```text
|
|
Staff user → erp.gigafibre.ca/ops/ (or n8n, Gitea)
|
|
→ Traefik checks session via ForwardAuth middleware
|
|
→ Outpost validates with Authentik staff (auth.targo.ca)
|
|
→ Authorized? Request forwarded to upstream container
|
|
with X-Authentik-Email + X-Authentik-Groups headers
|
|
→ Ops SPA reads X-Authentik-Email; useUserGroups maps groups
|
|
to in-app capabilities
|
|
|
|
Customer user → client.gigafibre.ca
|
|
→ Traefik checks session via separate ForwardAuth chain
|
|
→ Outpost validates with Authentik client (id.gigafibre.ca)
|
|
```
|
|
|
|
**Two distinct ForwardAuth middlewares**:
|
|
- `authentik@file` → backed by `auth.targo.ca` (staff)
|
|
- `authentik-client@file` → backed by `id.gigafibre.ca` (customers)
|
|
|
|
**ERPNext OAuth** — `auth.targo.ca` is also configured as a Frappe
|
|
Social Login Key (provider name `Authentik`). The login page at
|
|
`/login` shows both the password form and the "Login with Authentik"
|
|
button. OAuth client_id `P0rFFdq2hhun7hOLwkF5zm87vvDqcVYAhLtoZnFX`,
|
|
redirect_uri `/api/method/frappe.integrations.oauth2_logins.custom/authentik`.
|
|
|
|
**Adding new users** is centralized through the hub, not the Authentik
|
|
admin UI. The ops Settings page (`Settings → Utilisateurs → Inviter`)
|
|
hits `POST /auth/users` on `msg.gigafibre.ca` which:
|
|
1. Creates the Authentik user (random username from local-part of email,
|
|
password set explicitly), assigns OPS_GROUPS.
|
|
2. Sets a temp password (readable, no look-alikes) and emails it via
|
|
the hub's Mailjet SMTP — Authentik's own recovery flow isn't wired
|
|
(`flow_recovery=None` on the brand) and its global SMTP is unset,
|
|
so the hub does it directly.
|
|
3. Creates the matching ERPNext User (System User, social_logins =
|
|
[{provider:authentik, userid:email}]) so OAuth finds it on first
|
|
login.
|
|
|
|
The temp password is also returned to the admin (UI shows it with a
|
|
copy button) so they can hand it over manually if Mailjet drops the
|
|
message. See `services/targo-hub/lib/auth.js` for the full flow.
|
|
|
|
**API Security**: frontends rely on the Authentik session cookie
|
|
forwarded by Traefik. Backend scripts and the hub use
|
|
`Authorization: token <ERP_SERVICE_TOKEN>` Bearer headers.
|
|
|
|
---
|
|
|
|
## 5. Network Intelligence & CPE Flow
|
|
|
|
**Device Diagnostics (`targo-hub → GenieACS / OLT`)**
|
|
When a CSR clicks "Diagnostiquer" in the Ops app:
|
|
1. Ops app asks `/devices/lookup?serial=X`.
|
|
2. `targo-hub` polls GenieACS NBI.
|
|
3. If deep data is needed, `targo-hub` queries `modem-bridge` (for TP-Link) or the OLT SNMP directly.
|
|
4. Returns consolidated interface, mesh, wifi, and opticalStatus array to the UI.
|
|
|
|
**Future: QR Code Flow**
|
|
- Tech applies QR sticker to modem (`msg.gigafibre.ca/q/{mac}`).
|
|
- Client scans QR → `targo-hub` identifies customer via MAC matching in ERPNext.
|
|
- Triggers SMS OTP → Client views diagnostic portal.
|
|
|
|
---
|
|
|
|
## 6. Development Gotchas
|
|
|
|
1. **Traefik v3** is incompatible with Docker 29 due to API changes. Stay on v2.11.
|
|
2. **Never click "Generate Keys"** for the Administrator user in ERPNext — it breaks the `targo-hub` API token (silently).
|
|
3. **Traccar API** supports only one `deviceId` per request. Use parallel polling (`Promise.allSettled`) — see `services/targo-hub/lib/traccar.js`.
|
|
4. **Docker log rotation** is set globally via `/etc/docker/daemon.json` (`max-size=100m, max-file=3`). Applied at container creation — old containers keep their previous (uncapped) policy until you `compose up -d --force-recreate` them. We learned this the hard way when the Oktopus broker filled `/var/sdb` with 75 GB of debug logs in 13 days.
|
|
5. **Weekly prune** runs via `/etc/cron.d/docker-prune` Sunday 03:00 ET — clears anything not used in 30 days. Don't add a stack you only run monthly without `restart: always` or it'll get pruned out.
|
|
6. **PostgreSQL transaction-aborted errors** in the backend log — usually benign (one bad query in the Frappe scheduler) but if persistent, it's the connection pool needing a recycle. `docker restart erpnext-backend-1` resolves.
|
|
7. **Authentik recovery flow** isn't configured on the brand. Don't use `recovery_email/` from the API — use the hub invite flow described in §4 instead.
|
|
8. **ERPNext was built for MariaDB; we run it on PostgreSQL.** Frappe and ERPNext generate SQL that's tolerant under MariaDB but strict under Postgres — three recurring incompat patterns:
|
|
- `GROUP BY` clauses missing non-aggregated columns (Postgres rejects, MariaDB doesn't)
|
|
- `HAVING` without a `GROUP BY` (same)
|
|
- `"Customer"` interpreted as a column reference under Postgres (it's a string literal under MariaDB)
|
|
|
|
We've hand-patched 12 hotspots (see `feedback_erpnext_postgres.md` in the working memory + `patches/fix_pg_groupby.py`), but 4 known issues remain in `accounts/utils.py` ~L1660, `bank_clearance.py`, `bank_reconciliation_tool.py`, and `gross_profit.py`. Symptom on the UI: a report or doctype list returns *"column "X" does not exist"* or stays blank.
|
|
|
|
**Recommendation — install the [`frappe_pg`](https://github.com/the-commit-company/frappe_pg) community app.** It bundles a comprehensive set of PostgreSQL compatibility patches for Frappe + ERPNext as an external app — one `bench install-app frappe_pg` instead of patching files one by one on every ERPNext upgrade. Trade-off: a third-party app can lag behind ERPNext releases and may introduce its own issues, so:
|
|
- Evaluate it first on staging (re-run the smoke test on the 4 known-broken UIs to confirm coverage)
|
|
- Pin a known-good `frappe_pg` commit in `apps.txt` rather than tracking `main`
|
|
- Keep our `patches/fix_pg_groupby.py` as a backup; remove it only after frappe_pg has been stable for 4-6 weeks
|
|
|
|
When we apply our own patches, they go in `patches/` (Python files that run during `bench update`) so they survive ERPNext upgrades. Never edit ERPNext source files in-place inside the container — the next `bench update` clobbers it.
|
|
|
|
Custom Server Scripts with raw SQL (e.g. our `customer_balance` endpoint) need the same vigilance: use `'Customer'` not `"Customer"` for string literals. Add a `bench export-fixtures` step to version-control any Server Script we tweak so the fix isn't lost if ERPNext is re-deployed elsewhere.
|