Nottermost
A Mattermost-inspired team chat platform built as a distributed system on AWS. The scope is intentionally minimal but essential to exercise real architecture, operations, and cost trade-offs end to end.
Table of contents
- Goals
- Non-functional requirements
- Architecture principles
- Core features
- Application stack
- AWS platform map
- Data, search, and sharding
- Security
- Operational maturity
- Observability
- Infrastructure as code
- CI/CD
- Caching
- Production deployments
- Deployment strategies
- Incident handling
- Scaling under real traffic
- Monitoring in real environments
- Load and cost testing
- Design trade-offs
- Documentation backlog
- Changelog
Goals
- AWS hands-on: realistic environment (VPC, subnets, consistent resource tags, multiple managed services; not a single “lift and shift” box).
- Documentation: decisions, alternatives, and trade-offs (especially cost vs latency vs durability).
- Diagrams: networks (VPC/subnets), service boundaries, request/event flows.
- Engineering practice: mergeable, observable, reproducible (IaC, secrets, CI/CD patterns).
- No CLI touches: avoid manual console/CLI changes; prefer repeatable automation via IaC + pipelines.
Hard constraint: everything that defines the environment should be Infrastructure as Code (IaC). Networking must use a VPC with subnets and consistent resource tags for this environment.
Non-functional requirements
These drive service choice, topology, and budget:
- Concurrent WebSockets: millions of connections (connection tiering, regional presence, back-pressure).
- Latency: global low-latency messaging (<100ms where feasible; region placement and edge/cache matter).
- Fan-out: 1 message → thousands of recipients (async pipelines, partitioning, hot-key handling).
- Durability & ordering: strict where required (durable queues, idempotency, explicit consistency per path).
- Cost: continuous optimization as a first-class requirement.
Architecture principles
- Distributed system: components run as separate deployable units with clear ownership of data and failure domains.
- Microservices: used for learning and separation of concerns; acknowledge that a well-factored monolith can be simpler at early scale (see Design trade-offs).
Core features
- One-to-one messaging
- Workspaces and teams
- Messaging: text, emoji, images, GIFs
- File upload and object storage integration
- Message history with pagination
- Search and filtering (OpenSearch-backed)
Local development
Local testing is fully Dockerized (frontend + backend + Postgres + Redis).
Prereqs
Run
- Create a
.env file from the example:
- Start everything:
docker compose up --build
URLs
- Web:
http://localhost:3000
- API:
http://localhost:4000 (/healthz)
Quick end-to-end test
- Create account A at
/register
- Create a workspace
- In another browser/profile, create account B
- Back as A, open the workspace and add member by B’s email
- Click DM next to B, send a message, and verify it appears in real time
Notes / troubleshooting
- The API syncs the local DB schema automatically in development (dev-only convenience).
- If ports are busy, change
WEB_PORT / API_PORT in .env.
- Reset everything (including volumes):
docker compose down -v
Application stack
- Web: Next.js
- API / services: Node.js
High-level mapping from product needs to AWS building blocks (exact boundaries evolve with implementation).
- Static web + edge: S3, CloudFront, WAF
- API edge: API Gateway (HTTP/WebSocket as appropriate)
- Compute: containers or Lambda for suitable workloads (auth hooks, async workers, small RPC; exact split TBD per service)
- Async work: SQS
- Pub/sub & fan-out: SNS (and/or streaming where ordering/scale demands it)
- Objects / attachments: S3
- Relational data: RDS (multi-region / read replicas where justified by read patterns and DR)
- High-throughput key-value: DynamoDB (optionally DAX for hot read paths)
- Search: OpenSearch
Supporting capabilities: Secrets Manager (or Parameter Store) for secrets, KMS for encryption, and Cognito + JWT for identity patterns where applicable.
Data, search, and sharding
- Message storage: prefer DynamoDB (with DAX if needed) for write-heavy, high-cardinality streams; complement with RDS where SQL fits (relational/cross-entity consistency).
- Channel/workspace metadata: typically RDS (or dedicated metadata store) with clear schema + migrations.
- Search: OpenSearch for full-text and filters over indexed projections.
- Sharding/partitioning: first-class topic (partition keys, hot channels, cross-shard queries); document patterns before scaling claims.
Deep-dive docs to write: NoSQL vs SQL for messages, channel metadata model, and cost estimates at extreme scale (e.g. 100M users; see below).
Security
- Encryption in transit and at rest (KMS-managed keys where appropriate).
- Authentication: Cognito and/or JWT-based API auth, aligned with API Gateway authorizers.
- Rate limiting at the edge (API Gateway / WAF) and in application logic where abuse patterns differ.
Operational maturity
- Reliability thinking: design for failure, explicitly define consistency/durability/ordering per path, and bake in idempotency/de-duplication where at-least-once delivery exists.
- Failure handling: timeouts, retries with backoff, circuit breaking/back-pressure, DLQs for async pipelines, and safe degradation when dependencies fail.
- Rollback strategies: fast revert is a requirement (see Deployment strategies).
- Logs / metrics / alerting: treat as product features for operators, not afterthoughts (see Observability and Monitoring in real environments).
Observability
- Metrics: Prometheus-compatible collection + Grafana dashboards
- Logs: centralized logging (service + platform logs)
- Tracing: distributed tracing across microservices
Infrastructure as code
- Terraform: primary declarative IaC for AWS resources
- Ansible: configuration/bootstrapping/operational automation where imperative steps complement Terraform
CI/CD
- Jenkins as the CI/CD orchestrator (pipelines for build, test, security scans, and promoted deployments).
- Promotion mindset: changes flow through environments with automated checks; avoid manual production mutation.
Caching
Caching is required for cost and latency (CDN/edge, application caches, and managed cache layers where hot read paths justify them; exact services TBD by workload profiling).
Production deployments
Production-style deployments are treated as part of the architecture:
- Repeatability: IaC-defined infra + pipeline-driven deployments (minimize manual steps).
- Safety: staged rollout + health signals drive promotion decisions.
- Reversibility: every deploy must have a clear rollback path.
Deployment strategies
Production-style promotion patterns (implementation-specific):
- Rollback after failed deploys or bad metrics
- Canary releases for gradual traffic shift
- Blue/green for full cutover with fast revert
- Rolling deployments for incremental replacement where appropriate
Incident handling
- Detection: alerts tied to SLO-style symptoms (latency, error rate, saturation) + business signals where relevant.
- Response: runbooks, severity levels, clear ownership, and a “stop-the-bleed” playbook (rollback, disable feature, shed load).
- Communication: incident timeline + status updates (internal/external as applicable).
- Learning loop: blameless postmortems with tracked follow-ups.
Scaling under real traffic
- WebSockets at scale: connection tiering, regional placement, and back-pressure to prevent cascading failure.
- Fan-out: async pipelines (SQS, SNS) with partitioning and hot-key mitigation.
- Data scaling: DynamoDB partition key design (and DAX for hot reads), RDS read replicas where justified, and clear sharding strategy.
- Search scaling: OpenSearch indexing projections and query patterns that avoid hotspots.
- Cost under load: continual cost optimization as traffic grows (compute choices, caching, retention, and right-sizing).
Monitoring in real environments
- Dashboards: per-service golden signals (traffic, errors, latency, saturation) plus dependency health.
- Alerting: actionable alerts with thresholds tied to user impact; avoid noisy “everything is on fire” paging.
- Logs: structured logs with correlation IDs across services; include deploy/version metadata for fast rollback decisions.
- Tracing: end-to-end traces for critical paths and fan-out pipelines to find bottlenecks and failure points.
Load and cost testing
- Stress / scale testing toward very large user counts (e.g. 100M users as a modeling exercise), including use of Spot capacity where appropriate for ephemeral test fleets.
- Cost optimization at scale: produce monthly cost estimates under stated assumptions (regions, message rates, attachment mix, retention); this is a standing documentation deliverable.
Design trade-offs
- SQL vs NoSQL: SQLite locally for fast iteration; in cloud: DynamoDB (+ DAX) for message-scale paths and RDS for relational aggregates, billing-adjacent data, and metadata that benefits from SQL constraints.
- Monolith vs microservices: monolith can be simpler/cheaper early; this repo uses microservices for practice and clearer boundaries (accept overhead consciously).
- WebSockets vs polling: WebSockets for real-time delivery; avoid polling for primary transport at scale.
- Delivery semantics: compare at-most-once vs at-least-once; for at-least-once, use idempotency keys + de-duplication for user-visible side effects.
Documentation backlog
Planned written artifacts (in addition to this README):
- Message storage: NoSQL vs SQL trade-offs for this workload.
- Channel metadata: schema, consistency, and indexing strategy.
- Cost model: monthly estimates for aggressive scale (e.g. 100M users) with explicit assumptions.
- Diagrams: VPC/subnets, service dependency graph, and critical request/notification flows.
Changelog
All notable changes are tracked in CHANGELOG.md.