Geo-Distributed EC2 Server Setup with Client Locking and Token Management
What We are designing a system where clients are routed to regional servers to reduce latency, while maintaining session consistency during multi-step API calls. Key components: Regional servers in different geographies Central discovery server tracking server heartbeat and load JWT tokens for authentication: Primary token expiry: 1–5 minutes Refresh token expiry: 7–30 days Client-server locking: ensures multi-step requests stay on the same server Load balancing and failover via Route 53 geo-proximity with health checks Why Simple geo-proximity DNS (Route 53) is insufficient for multi-step API workflows Multi-step POST requests can fail if a client jumps servers due to geo routing This happens because even with active-active database replication, there’s latency in the replication process When clients hit different servers too frequently, there’s a high chance that a server might not have the latest data required This data inconsistency leads to request failures, as the server processing a request might be working with stale data Need to lock clients to a server during critical operations Need flexibility to load balance or move clients across regions safely when not performing critical tasks Health checks in Route 53 ensure traffic isn’t routed to servers that are down How 1. Central Discovery Server Each regional server sends heartbeat with its unique ID, region code, and public URL Optionally collect telemetry/load data from each server (either directly from regional servers or via a central telemetry system) Discovery server maintains the active server list, their public URLs, and load information 2. DNS Setup Each regional server gets its own URL Central URL uses Route 53 geo-proximity with health checks to route clients to nearest healthy server 3. Client Login & Locking Client hits the central geo-proximity URL Login request routed to nearest server Server returns: JWT token Its own server URL → marks the client lock All further requests from this client use the locked server URL 4. Server Discovery Sync Regional servers periodically pull the active server list (with public URLs) + load information from discovery server (load data can originate from a central telemetry system or directly from servers) Enables load balancing within regions and global awareness 5. Refresh Token API & Closest Server Before sending a refresh token request, client calls /closestToMe on central geo URL Returns closest server identifier Payload includes: closestServerId criticalTaskInProgress (boolean) 6. Refresh Token Handling If criticalTaskInProgress = true: Do not switch servers Refresh token and maintain lock with current server If criticalTaskInProgress = false: Check closest server and region: Same region → pick server with lowest load, update token with new server URL Different region → switch client to that server and update token Ensures safe cross-region movement while maintaining active tasks 7. Load Balancing Regional servers use closest server identifier + server load to redistribute clients Maintains even load distribution while keeping active sessions safe 8. Frontend Considerations Detect if primary server URL changed in token response Show user-friendly message: “Your primary server has changed. Any missing data will be synced within 5 minutes.” Ensures users are aware but do not panic over temporary replication delays 9. Handling Server Failures If a server goes down, client will receive a 500-series error Client should wait 30 seconds with a timer: “Reconnecting…” During this time, discovery server confirms the server stopped sending heartbeats and updates its registry of available servers and their public URLs After the wait, attempt a refresh token request again Request will now hit the closest healthy server Refresh token response will include the new server URL Thoughts / Caveats Client lock is critical for multi-step operations Discovery server is the single source of truth for server status, public URLs, and load (whether collected directly or via a central telemetry system) Token expiry strategy (short-lived JWT, long-lived refresh token) balances security vs availability Cross-region movement and load balancing happen only when safe (no critical tasks) Frontend intelligence improves user experience during server switches Route 53 health checks ensure no traffic is sent to unhealthy servers Automatic refresh/reconnect handles server failures without breaking client workflows