Update architecture risk review

This commit is contained in:
2026-02-18 19:24:36 +01:00
parent 06ae85f427
commit 608c5368b3

196
REVIEW.md
View File

@@ -1,163 +1,165 @@
# Review - Pick'n'Play
# Review - Pick'n'Play (State-of-the-Art Assessment)
Date: 2026-02-08
Date: 2026-02-18
## Scope and baseline
## Scope
- Reviewed backend, frontend, data model, tests, CI, deployment scripts, and documentation.
- Ran local CI checks via `scripts/ci-local.ps1 -SkipNpmInstall` (pass: lint, format, build, tests).
- Ran coverage collection (`dotnet test --collect:"XPlat Code Coverage"`).
- Latest collected coverage: line-rate `92.75%`, branch-rate `74.88%` (`GameList.Tests/TestResults/*/coverage.cobertura.xml:2`).
- Evaluated backend (`Program.cs`, `Endpoints/*`, `Infrastructure/*`, `Data/*`), frontend (`wwwroot/*`), and CI/deployment scripts.
- Focused on risks in maintainability, extensibility, scalability, and security.
## Executive summary
The app is in a solid functional state for a small group and has good safety basics (auth, rate limits, tests, security headers).
However, several design and reliability issues will slow down extension work and make this repo hard to use as a reusable foundation for new apps unless addressed first.
The app is solid for a small private group, but it still has several architectural bottlenecks compared to current best practice for long-term product growth. The largest risks are read amplification from client polling, hard-coded workflow/permission modeling, and security hardening gaps (CSRF posture and crypto modernization).
## Findings (highest priority first)
## Findings
### P0 - Core invariants are not concurrency-safe
### 1) High - Scalability - Single-node SQLite bottleneck
Evidence:
- Owner bootstrap is check-then-insert (`Endpoints/AuthEndpoints.cs:45`, `Endpoints/AuthEndpoints.cs:70`, `Endpoints/AuthEndpoints.cs:71`) without a DB-level uniqueness guard for owner role (`Data/AppDbContext.cs:25`).
- Suggestion cap (<=5) is count-then-insert (`Endpoints/SuggestionWorkflowService.cs:63`, `Endpoints/SuggestionWorkflowService.cs:64`, `Endpoints/SuggestionWorkflowService.cs:82`, `Endpoints/SuggestionWorkflowService.cs:92`).
- Vote upsert is read-then-insert/update (`Endpoints/VoteWorkflowService.cs:70`, `Endpoints/VoteWorkflowService.cs:79`, `Endpoints/VoteWorkflowService.cs:92`) and can race into unique-key conflicts.
- SQLite is the primary DB (`Program.cs:42`).
Impact:
- Under concurrent requests, business rules can be violated (multiple owners, over-limit suggestions) or requests can fail with server errors.
Risk:
- SQLite is excellent for small single-node deployments, but write concurrency and horizontal scale are limited for larger or bursty usage.
Recommendation:
- Enforce invariants in the DB and handle race paths explicitly (transaction isolation, constrained indexes, retry/409 handling on expected conflicts).
Alternative:
- Keep SQLite for local/dev and migrate production to PostgreSQL/SQL Server with provider-specific migrations and connection pooling.
### P0 - Business layer is tightly coupled to HTTP transport
### 2) High - Scalability - Polling causes read amplification
Evidence:
- Workflow services return `Task<IResult>` directly across the board (`Endpoints/SuggestionWorkflowService.cs:10`, `Endpoints/VoteWorkflowService.cs:10`, `Endpoints/ResultsWorkflowService.cs:10`, `Endpoints/StateWorkflowService.cs:10`, `Endpoints/AdminWorkflowService.cs:11`).
- Frontend polling runs continuously with 3s-20s cadence (`wwwroot/app.js:30`, `wwwroot/app.js:58`).
- Each refresh can hit multiple endpoints (`wwwroot/js/data.js:21`, `wwwroot/js/data.js:109`).
- `/api/state` also executes multiple aggregate counts each time (`Endpoints/StateWorkflowService.cs:14`).
Impact:
- Domain logic cannot be reused cleanly by other apps/services without carrying ASP.NET HTTP result types everywhere.
- Harder to unit test business logic in isolation.
Risk:
- As concurrent users increase, backend read load grows quickly and mostly serves unchanged data.
Recommendation:
- Split into application services (domain result objects/errors) + thin endpoint adapters (HTTP mapping only).
Alternative:
- Move to event-driven updates (SSE/WebSocket) plus conditional GET (`ETag`/`If-None-Match`) and/or a consolidated bootstrap endpoint.
### P1 - Startup/runtime side effects are risky for production scale-out
### 3) High - Security - CSRF protection is implicit, not explicit
Evidence:
- Automatic schema migration on app startup (`Program.cs:161`).
- Runtime mutation of static frontend file for base path (`Program.cs:149`, `Program.cs:277`), with silent catch (`Program.cs:313`).
- Cookie authentication is used for API auth (`Program.cs:100`, `Program.cs:104`).
- Many state-changing endpoints rely on cookie auth (`Endpoints/SuggestEndpoints.cs:24`, `Endpoints/VoteEndpoints.cs:24`, `Endpoints/AdminEndpoints.cs:14`).
- No antiforgery middleware/tokens are configured in startup.
Impact:
- Multi-instance startup races and operational fragility.
- Read-only deployments or immutable artifacts can fail subtly.
Risk:
- `SameSite=Strict` helps (`Program.cs:104`) but is not a full long-term CSRF strategy across browser/proxy edge cases.
Recommendation:
- Move migrations to explicit deployment step; remove runtime file rewrite and make base path purely configuration-driven at build/deploy time.
Alternative:
- Add explicit anti-forgery tokens for mutating requests (or move to bearer tokens for API calls) and verify origin headers server-side.
### P1 - Null-safety gaps in auth input validation can produce 500s
### 4) High - Extensibility - Workflow is hard-coded across backend and frontend
Evidence:
- Direct `.Trim()` on potentially null request values (`Endpoints/AuthValidator.cs:15`, `Endpoints/AuthValidator.cs:71`, `Endpoints/AuthEndpoints.cs:26`, `Endpoints/AuthEndpoints.cs:90`).
- Phase behavior is encoded via enum and many explicit checks/switches (`Domain/Phase.cs:3`, `Endpoints/StateWorkflowService.cs:70`, `Endpoints/EndpointHelpers.cs:97`).
- Frontend also hard-codes phase logic in many files (`wwwroot/app.js:99`, `wwwroot/js/data.js:47`, `wwwroot/js/votes-ui.js:167`).
Impact:
- Malformed JSON payloads can bypass intended 400 responses and trigger 500 errors.
Risk:
- Adding a new phase or changing transitions requires touching many scattered branches, increasing regression risk.
Recommendation:
- Treat inbound fields as nullable at boundaries, normalize safely, and fail closed with typed validation errors.
Alternative:
- Introduce a shared workflow/state-machine model (transition table) and consume it in both backend and frontend.
### P1 - Deployment automation is environment-specific and hard-coded
### 5) High - Extensibility - Role model is fixed to booleans
Evidence:
- Script contains fixed host/user/path/tooling details (`scripts/deploy-ftp.ps1:2`, `scripts/deploy-ftp.ps1:3`, `scripts/deploy-ftp.ps1:11`, `scripts/deploy-ftp.ps1:15`).
- Uses direct FTP credential in command string (`scripts/deploy-ftp.ps1:102`).
- Role/state flags are booleans on player (`Domain/Player.cs:22`, `Domain/Player.cs:23`).
- Admin checks are tightly coupled to that model (`Infrastructure/AdminOnlyFilter.cs:13`, `Endpoints/AdminWorkflowService.cs:93`).
Impact:
- Not portable, hard to onboard new maintainers/environments, and higher operational/security risk.
Risk:
- Future needs (moderator, read-only admin, per-feature permissions) require schema and logic rewrites instead of additive changes.
Recommendation:
- Externalize all environment values into secure config; provide a generic deploy profile template per environment.
Alternative:
- Move to role/permission tables (or claims-based capability model) and policy-based authorization.
### P1 - Coverage policy and reality are misaligned
### 6) Medium - Maintainability - Frontend is string-template heavy with global mutable state
Evidence:
- Documentation claims target of 100% line/branch (`TESTS.md:96`).
- Actual measured coverage is lower (`GameList.Tests/TestResults/*/coverage.cobertura.xml:2`).
- CI/local scripts run tests but do not enforce coverage thresholds (`.github/workflows/ci.yml:43`, `scripts/ci-local.ps1:56`, `scripts/ci-local.ps1:59`).
- Single global state object (`wwwroot/js/state.js:1`).
- Heavy `innerHTML` rendering across modules (`wwwroot/js/suggestions-ui.js:115`, `wwwroot/js/votes-ui.js:35`, `wwwroot/js/results-ui.js:72`).
Impact:
- False confidence in quality gates and unclear definition of done.
Risk:
- Harder refactoring, weaker static guarantees, and easy XSS regressions when new contributors add templates.
Recommendation:
- Decide real threshold policy, enforce it in CI, and keep docs aligned with measured truth.
Alternative:
- Incrementally move to TypeScript + componentized rendering (or at minimum typed JSDoc + stricter lint rules + centralized safe render helpers).
### P2 - Frontend refresh strategy is chatty and DB-heavy
### 7) Medium - Scalability/Security - In-memory dictionaries are unbounded
Evidence:
- Global polling every 4s (`wwwroot/app.js:27`, `wwwroot/app.js:49`).
- Each cycle can trigger multiple API reads (`wwwroot/js/data.js:6`, `wwwroot/js/data.js:93`, `wwwroot/js/data.js:101`).
- Auth attempt monitor stores failures in unbounded `ConcurrentDictionary` (`Infrastructure/AuthAttemptMonitor.cs:14`).
- Image reachability cache is a static dictionary without size limits (`Endpoints/SuggestionValidator.cs:7`, `Endpoints/SuggestionValidator.cs:42`).
Impact:
- Scales poorly with player count/sessions; unnecessary backend load.
Risk:
- High-cardinality traffic can grow memory and become a denial-of-service vector.
Recommendation:
- Move to event-driven or adaptive refresh (backoff, ETag/delta, push where needed).
Alternative:
- Replace with bounded `MemoryCache` (size limits + eviction) or distributed cache (Redis) with TTL and cardinality controls.
### P2 - Frontend DOM drift and invalid markup
### 8) Medium - Scalability - Linking/results workflows load full sets into memory
Evidence:
- Missing closing `</div>` for status bar around section boundary (`wwwroot/index.html:83`, `wwwroot/index.html:102`).
- JS references elements not present in HTML: `all-suggestions`, `nav-vote-next` (`wwwroot/js/suggestions-ui.js:52`, `wwwroot/js/votes-ui.js:264`).
- Link/unlink loads all suggestions then computes roots in memory (`Endpoints/AdminWorkflowService.cs:136`, `Endpoints/AdminWorkflowService.cs:188`).
- Results projection includes full vote lists per suggestion (`Endpoints/ResultsWorkflowService.cs:23`, `Endpoints/ResultsWorkflowService.cs:34`, `Endpoints/ResultsWorkflowService.cs:50`).
Impact:
- Browser autocorrection masks real bugs, increases regression risk, and confuses future contributors.
Risk:
- Memory and query cost rise non-linearly with larger datasets.
Recommendation:
- Reconcile HTML/JS contracts and remove dead UI paths.
Alternative:
- Introduce persisted link-group IDs and push aggregation to SQL; add pagination/windowing for large result sets.
### P2 - Login flow can fail silently for non-auth errors
### 9) Medium - Security - Crypto is good but not state-of-the-art
Evidence:
- In login catch block, non-401/non-auth-handler errors are swallowed without user feedback (`wwwroot/js/app-auth-handlers.js:113`, `wwwroot/js/app-auth-handlers.js:116`).
- Password hashing uses PBKDF2-SHA256 (`Infrastructure/PasswordHasher.cs:33`) with fixed iteration count (`Infrastructure/PasswordHasher.cs:10`).
Impact:
- Poor UX and harder support/debugging.
Risk:
- PBKDF2 remains acceptable, but modern guidance favors memory-hard KDFs (Argon2id/scrypt) against GPU/ASIC attacks.
Recommendation:
- Add explicit fallback toast/logging for unexpected errors in login flow.
Alternative:
- Add versioned password hashes and migrate to Argon2id on login/re-hash.
### P2 - Formatting checks do not cover the whole frontend codebase
### 10) Medium - Security - CSP remains permissive for inline style and mixed image origins
Evidence:
- ESLint scans all JS (`package.json:6`), but Prettier scripts only include a subset of files (`package.json:7`, `package.json:8`).
- CSP allows `style-src 'unsafe-inline'` and `img-src ... https: http:` (`Program.cs:138`).
Impact:
- Style drift and inconsistent diffs across untouched files.
Risk:
- Wider policy surface than needed, especially for long-term hardening.
Recommendation:
- Align formatting scope with lint scope.
Alternative:
- Remove inline style dependence (e.g., CSS classes or nonce/hash), and restrict image sources to `https` and/or a media proxy allowlist.
### P2 - Suggestion validation does synchronous external network checks on write paths
### 11) Medium - Maintainability/Extensibility - API contract sync is manual
Evidence:
- Suggestion validation calls image reachability check (`Endpoints/SuggestionValidator.cs:13`).
- Reachability check performs outbound call with timeout per request (`Endpoints/EndpointHelpers.cs:175`, `Endpoints/EndpointHelpers.cs:194`, `Endpoints/EndpointHelpers.cs:197`).
- Frontend endpoints are hard-coded in JS (`wwwroot/js/api.js:37`, `wwwroot/js/api.js:70`).
- Contract is also maintained manually in markdown (`API.md`).
Impact:
- Slower writes and external dependency coupling even for basic edit operations.
Risk:
- Contract drift between backend DTOs and frontend consumers over time.
Recommendation:
- Make this validation less blocking (cache results, only revalidate when URL changes, or async verification workflow).
Alternative:
- Generate OpenAPI from endpoints and produce typed client code for frontend consumption.
## Positive foundations worth keeping
### 12) Medium - Scalability/Resilience - External image validation is synchronous on write path
- Security baseline is present: cookie hardening + rate limiting + security headers (`Program.cs:100`, `Program.cs:104`, `Program.cs:121`, `Program.cs:137`).
- Global exception handling and health endpoint exist (`Infrastructure/PlayerIdentityExtensions.cs:40`, `Infrastructure/PlayerIdentityExtensions.cs:68`).
- Broad integration test suite exists and is currently green (104 tests).
Evidence:
- Suggestion validation calls network reachability checks (`Endpoints/SuggestionValidator.cs:40`).
- Validation does outbound DNS/connect/HEAD/GET in request path (`Endpoints/EndpointHelpers.cs:196`, `Endpoints/EndpointHelpers.cs:240`, `Endpoints/EndpointHelpers.cs:270`).
## Suggested initial remediation order
Risk:
- User write latency is coupled to third-party host responsiveness.
1. Concurrency/invariant hardening for owner bootstrap, suggestion limits, and vote upsert.
2. Service boundary refactor (`IResult` decoupling) to make logic reusable for future apps.
3. Startup/deployment hardening (migration strategy + remove runtime file rewrite + parameterized deploy script).
4. Coverage policy enforcement and documentation correction.
5. Frontend cleanup pass (invalid markup, dead selectors, polling strategy, error handling).
Alternative:
- Accept URL quickly, then validate asynchronously (background job + status flag), optionally with trusted media proxying.
## Recommended remediation order
1. Reduce read amplification: event-driven updates + state endpoint optimization.
2. Harden security baseline: explicit CSRF, CSP tightening, and password-hash migration plan.
3. Decouple growth hotspots: workflow state machine + role/permission model.
4. Improve operational scale: replace unbounded in-memory structures and large in-memory link/result operations.
5. Introduce contract tooling: OpenAPI + generated frontend client to reduce drift.