Update architecture risk review

2026-02-18 19:24:36 +01:00
parent 06ae85f427
commit 608c5368b3
1 changed files with 99 additions and 97 deletions
--- a/REVIEW.md
+++ b/REVIEW.md
@@ -1,163 +1,165 @@
-# Review - Pick'n'Play
+# Review - Pick'n'Play (State-of-the-Art Assessment)

-Date: 2026-02-08
+Date: 2026-02-18

-## Scope and baseline
+## Scope

- Reviewed backend, frontend, data model, tests, CI, deployment scripts, and documentation.
- Ran local CI checks via `scripts/ci-local.ps1 -SkipNpmInstall` (pass: lint, format, build, tests).
- Ran coverage collection (`dotnet test --collect:"XPlat Code Coverage"`).
- Latest collected coverage: line-rate `92.75%`, branch-rate `74.88%` (`GameList.Tests/TestResults/*/coverage.cobertura.xml:2`).
+- Evaluated backend (`Program.cs`, `Endpoints/*`, `Infrastructure/*`, `Data/*`), frontend (`wwwroot/*`), and CI/deployment scripts.
+- Focused on risks in maintainability, extensibility, scalability, and security.

 ## Executive summary

-The app is in a solid functional state for a small group and has good safety basics (auth, rate limits, tests, security headers).  
-However, several design and reliability issues will slow down extension work and make this repo hard to use as a reusable foundation for new apps unless addressed first.
+The app is solid for a small private group, but it still has several architectural bottlenecks compared to current best practice for long-term product growth. The largest risks are read amplification from client polling, hard-coded workflow/permission modeling, and security hardening gaps (CSRF posture and crypto modernization).

-## Findings (highest priority first)
+## Findings

-### P0 - Core invariants are not concurrency-safe
+### 1) High - Scalability - Single-node SQLite bottleneck

 Evidence:
- Owner bootstrap is check-then-insert (`Endpoints/AuthEndpoints.cs:45`, `Endpoints/AuthEndpoints.cs:70`, `Endpoints/AuthEndpoints.cs:71`) without a DB-level uniqueness guard for owner role (`Data/AppDbContext.cs:25`).
- Suggestion cap (<=5) is count-then-insert (`Endpoints/SuggestionWorkflowService.cs:63`, `Endpoints/SuggestionWorkflowService.cs:64`, `Endpoints/SuggestionWorkflowService.cs:82`, `Endpoints/SuggestionWorkflowService.cs:92`).
- Vote upsert is read-then-insert/update (`Endpoints/VoteWorkflowService.cs:70`, `Endpoints/VoteWorkflowService.cs:79`, `Endpoints/VoteWorkflowService.cs:92`) and can race into unique-key conflicts.
+- SQLite is the primary DB (`Program.cs:42`).

-Impact:
- Under concurrent requests, business rules can be violated (multiple owners, over-limit suggestions) or requests can fail with server errors.
+Risk:
+- SQLite is excellent for small single-node deployments, but write concurrency and horizontal scale are limited for larger or bursty usage.

-Recommendation:
- Enforce invariants in the DB and handle race paths explicitly (transaction isolation, constrained indexes, retry/409 handling on expected conflicts).
+Alternative:
+- Keep SQLite for local/dev and migrate production to PostgreSQL/SQL Server with provider-specific migrations and connection pooling.

-### P0 - Business layer is tightly coupled to HTTP transport
+### 2) High - Scalability - Polling causes read amplification

 Evidence:
- Workflow services return `Task<IResult>` directly across the board (`Endpoints/SuggestionWorkflowService.cs:10`, `Endpoints/VoteWorkflowService.cs:10`, `Endpoints/ResultsWorkflowService.cs:10`, `Endpoints/StateWorkflowService.cs:10`, `Endpoints/AdminWorkflowService.cs:11`).
+- Frontend polling runs continuously with 3s-20s cadence (`wwwroot/app.js:30`, `wwwroot/app.js:58`).
+- Each refresh can hit multiple endpoints (`wwwroot/js/data.js:21`, `wwwroot/js/data.js:109`).
+- `/api/state` also executes multiple aggregate counts each time (`Endpoints/StateWorkflowService.cs:14`).

-Impact:
- Domain logic cannot be reused cleanly by other apps/services without carrying ASP.NET HTTP result types everywhere.
- Harder to unit test business logic in isolation.
+Risk:
+- As concurrent users increase, backend read load grows quickly and mostly serves unchanged data.

-Recommendation:
- Split into application services (domain result objects/errors) + thin endpoint adapters (HTTP mapping only).
+Alternative:
+- Move to event-driven updates (SSE/WebSocket) plus conditional GET (`ETag`/`If-None-Match`) and/or a consolidated bootstrap endpoint.

-### P1 - Startup/runtime side effects are risky for production scale-out
+### 3) High - Security - CSRF protection is implicit, not explicit

 Evidence:
- Automatic schema migration on app startup (`Program.cs:161`).
- Runtime mutation of static frontend file for base path (`Program.cs:149`, `Program.cs:277`), with silent catch (`Program.cs:313`).
+- Cookie authentication is used for API auth (`Program.cs:100`, `Program.cs:104`).
+- Many state-changing endpoints rely on cookie auth (`Endpoints/SuggestEndpoints.cs:24`, `Endpoints/VoteEndpoints.cs:24`, `Endpoints/AdminEndpoints.cs:14`).
+- No antiforgery middleware/tokens are configured in startup.

-Impact:
- Multi-instance startup races and operational fragility.
- Read-only deployments or immutable artifacts can fail subtly.
+Risk:
+- `SameSite=Strict` helps (`Program.cs:104`) but is not a full long-term CSRF strategy across browser/proxy edge cases.

-Recommendation:
- Move migrations to explicit deployment step; remove runtime file rewrite and make base path purely configuration-driven at build/deploy time.
+Alternative:
+- Add explicit anti-forgery tokens for mutating requests (or move to bearer tokens for API calls) and verify origin headers server-side.

-### P1 - Null-safety gaps in auth input validation can produce 500s
+### 4) High - Extensibility - Workflow is hard-coded across backend and frontend

 Evidence:
- Direct `.Trim()` on potentially null request values (`Endpoints/AuthValidator.cs:15`, `Endpoints/AuthValidator.cs:71`, `Endpoints/AuthEndpoints.cs:26`, `Endpoints/AuthEndpoints.cs:90`).
+- Phase behavior is encoded via enum and many explicit checks/switches (`Domain/Phase.cs:3`, `Endpoints/StateWorkflowService.cs:70`, `Endpoints/EndpointHelpers.cs:97`).
+- Frontend also hard-codes phase logic in many files (`wwwroot/app.js:99`, `wwwroot/js/data.js:47`, `wwwroot/js/votes-ui.js:167`).

-Impact:
- Malformed JSON payloads can bypass intended 400 responses and trigger 500 errors.
+Risk:
+- Adding a new phase or changing transitions requires touching many scattered branches, increasing regression risk.

-Recommendation:
- Treat inbound fields as nullable at boundaries, normalize safely, and fail closed with typed validation errors.
+Alternative:
+- Introduce a shared workflow/state-machine model (transition table) and consume it in both backend and frontend.

-### P1 - Deployment automation is environment-specific and hard-coded
+### 5) High - Extensibility - Role model is fixed to booleans

 Evidence:
- Script contains fixed host/user/path/tooling details (`scripts/deploy-ftp.ps1:2`, `scripts/deploy-ftp.ps1:3`, `scripts/deploy-ftp.ps1:11`, `scripts/deploy-ftp.ps1:15`).
- Uses direct FTP credential in command string (`scripts/deploy-ftp.ps1:102`).
+- Role/state flags are booleans on player (`Domain/Player.cs:22`, `Domain/Player.cs:23`).
+- Admin checks are tightly coupled to that model (`Infrastructure/AdminOnlyFilter.cs:13`, `Endpoints/AdminWorkflowService.cs:93`).

-Impact:
- Not portable, hard to onboard new maintainers/environments, and higher operational/security risk.
+Risk:
+- Future needs (moderator, read-only admin, per-feature permissions) require schema and logic rewrites instead of additive changes.

-Recommendation:
- Externalize all environment values into secure config; provide a generic deploy profile template per environment.
+Alternative:
+- Move to role/permission tables (or claims-based capability model) and policy-based authorization.

-### P1 - Coverage policy and reality are misaligned
+### 6) Medium - Maintainability - Frontend is string-template heavy with global mutable state

 Evidence:
- Documentation claims target of 100% line/branch (`TESTS.md:96`).
- Actual measured coverage is lower (`GameList.Tests/TestResults/*/coverage.cobertura.xml:2`).
- CI/local scripts run tests but do not enforce coverage thresholds (`.github/workflows/ci.yml:43`, `scripts/ci-local.ps1:56`, `scripts/ci-local.ps1:59`).
+- Single global state object (`wwwroot/js/state.js:1`).
+- Heavy `innerHTML` rendering across modules (`wwwroot/js/suggestions-ui.js:115`, `wwwroot/js/votes-ui.js:35`, `wwwroot/js/results-ui.js:72`).

-Impact:
- False confidence in quality gates and unclear definition of done.
+Risk:
+- Harder refactoring, weaker static guarantees, and easy XSS regressions when new contributors add templates.

-Recommendation:
- Decide real threshold policy, enforce it in CI, and keep docs aligned with measured truth.
+Alternative:
+- Incrementally move to TypeScript + componentized rendering (or at minimum typed JSDoc + stricter lint rules + centralized safe render helpers).

-### P2 - Frontend refresh strategy is chatty and DB-heavy
+### 7) Medium - Scalability/Security - In-memory dictionaries are unbounded

 Evidence:
- Global polling every 4s (`wwwroot/app.js:27`, `wwwroot/app.js:49`).
- Each cycle can trigger multiple API reads (`wwwroot/js/data.js:6`, `wwwroot/js/data.js:93`, `wwwroot/js/data.js:101`).
+- Auth attempt monitor stores failures in unbounded `ConcurrentDictionary` (`Infrastructure/AuthAttemptMonitor.cs:14`).
+- Image reachability cache is a static dictionary without size limits (`Endpoints/SuggestionValidator.cs:7`, `Endpoints/SuggestionValidator.cs:42`).

-Impact:
- Scales poorly with player count/sessions; unnecessary backend load.
+Risk:
+- High-cardinality traffic can grow memory and become a denial-of-service vector.

-Recommendation:
- Move to event-driven or adaptive refresh (backoff, ETag/delta, push where needed).
+Alternative:
+- Replace with bounded `MemoryCache` (size limits + eviction) or distributed cache (Redis) with TTL and cardinality controls.

-### P2 - Frontend DOM drift and invalid markup
+### 8) Medium - Scalability - Linking/results workflows load full sets into memory

 Evidence:
- Missing closing `</div>` for status bar around section boundary (`wwwroot/index.html:83`, `wwwroot/index.html:102`).
- JS references elements not present in HTML: `all-suggestions`, `nav-vote-next` (`wwwroot/js/suggestions-ui.js:52`, `wwwroot/js/votes-ui.js:264`).
+- Link/unlink loads all suggestions then computes roots in memory (`Endpoints/AdminWorkflowService.cs:136`, `Endpoints/AdminWorkflowService.cs:188`).
+- Results projection includes full vote lists per suggestion (`Endpoints/ResultsWorkflowService.cs:23`, `Endpoints/ResultsWorkflowService.cs:34`, `Endpoints/ResultsWorkflowService.cs:50`).

-Impact:
- Browser autocorrection masks real bugs, increases regression risk, and confuses future contributors.
+Risk:
+- Memory and query cost rise non-linearly with larger datasets.

-Recommendation:
- Reconcile HTML/JS contracts and remove dead UI paths.
+Alternative:
+- Introduce persisted link-group IDs and push aggregation to SQL; add pagination/windowing for large result sets.

-### P2 - Login flow can fail silently for non-auth errors
+### 9) Medium - Security - Crypto is good but not state-of-the-art

 Evidence:
- In login catch block, non-401/non-auth-handler errors are swallowed without user feedback (`wwwroot/js/app-auth-handlers.js:113`, `wwwroot/js/app-auth-handlers.js:116`).
+- Password hashing uses PBKDF2-SHA256 (`Infrastructure/PasswordHasher.cs:33`) with fixed iteration count (`Infrastructure/PasswordHasher.cs:10`).

-Impact:
- Poor UX and harder support/debugging.
+Risk:
+- PBKDF2 remains acceptable, but modern guidance favors memory-hard KDFs (Argon2id/scrypt) against GPU/ASIC attacks.

-Recommendation:
- Add explicit fallback toast/logging for unexpected errors in login flow.
+Alternative:
+- Add versioned password hashes and migrate to Argon2id on login/re-hash.

-### P2 - Formatting checks do not cover the whole frontend codebase
+### 10) Medium - Security - CSP remains permissive for inline style and mixed image origins

 Evidence:
- ESLint scans all JS (`package.json:6`), but Prettier scripts only include a subset of files (`package.json:7`, `package.json:8`).
+- CSP allows `style-src 'unsafe-inline'` and `img-src ... https: http:` (`Program.cs:138`).

-Impact:
- Style drift and inconsistent diffs across untouched files.
+Risk:
+- Wider policy surface than needed, especially for long-term hardening.

-Recommendation:
- Align formatting scope with lint scope.
+Alternative:
+- Remove inline style dependence (e.g., CSS classes or nonce/hash), and restrict image sources to `https` and/or a media proxy allowlist.

-### P2 - Suggestion validation does synchronous external network checks on write paths
+### 11) Medium - Maintainability/Extensibility - API contract sync is manual

 Evidence:
- Suggestion validation calls image reachability check (`Endpoints/SuggestionValidator.cs:13`).
- Reachability check performs outbound call with timeout per request (`Endpoints/EndpointHelpers.cs:175`, `Endpoints/EndpointHelpers.cs:194`, `Endpoints/EndpointHelpers.cs:197`).
+- Frontend endpoints are hard-coded in JS (`wwwroot/js/api.js:37`, `wwwroot/js/api.js:70`).
+- Contract is also maintained manually in markdown (`API.md`).

-Impact:
- Slower writes and external dependency coupling even for basic edit operations.
+Risk:
+- Contract drift between backend DTOs and frontend consumers over time.

-Recommendation:
- Make this validation less blocking (cache results, only revalidate when URL changes, or async verification workflow).
+Alternative:
+- Generate OpenAPI from endpoints and produce typed client code for frontend consumption.

-## Positive foundations worth keeping
+### 12) Medium - Scalability/Resilience - External image validation is synchronous on write path

- Security baseline is present: cookie hardening + rate limiting + security headers (`Program.cs:100`, `Program.cs:104`, `Program.cs:121`, `Program.cs:137`).
- Global exception handling and health endpoint exist (`Infrastructure/PlayerIdentityExtensions.cs:40`, `Infrastructure/PlayerIdentityExtensions.cs:68`).
- Broad integration test suite exists and is currently green (104 tests).
+Evidence:
+- Suggestion validation calls network reachability checks (`Endpoints/SuggestionValidator.cs:40`).
+- Validation does outbound DNS/connect/HEAD/GET in request path (`Endpoints/EndpointHelpers.cs:196`, `Endpoints/EndpointHelpers.cs:240`, `Endpoints/EndpointHelpers.cs:270`).

-## Suggested initial remediation order
+Risk:
+- User write latency is coupled to third-party host responsiveness.

-1. Concurrency/invariant hardening for owner bootstrap, suggestion limits, and vote upsert.
-2. Service boundary refactor (`IResult` decoupling) to make logic reusable for future apps.
-3. Startup/deployment hardening (migration strategy + remove runtime file rewrite + parameterized deploy script).
-4. Coverage policy enforcement and documentation correction.
-5. Frontend cleanup pass (invalid markup, dead selectors, polling strategy, error handling).
+Alternative:
+- Accept URL quickly, then validate asynchronously (background job + status flag), optionally with trusted media proxying.
+
+## Recommended remediation order
+
+1. Reduce read amplification: event-driven updates + state endpoint optimization.
+2. Harden security baseline: explicit CSRF, CSP tightening, and password-hash migration plan.
+3. Decouple growth hotspots: workflow state machine + role/permission model.
+4. Improve operational scale: replace unbounded in-memory structures and large in-memory link/result operations.
+5. Introduce contract tooling: OpenAPI + generated frontend client to reduce drift.