📝 Design Google Docs — System Design Interview Guide
Hard · Real-Time Collaboration
Design a collaborative document editor like Google Docs where multiple users can edit the same document simultaneously with real-time updates and conflict resolution.
Open the interactive Google Docs design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Create, read, update, and delete documents
- Multiple users edit same document simultaneously
- Changes appear in real-time for all collaborators
- Persistent revision history (undo/redo)
- Sharing: view/edit/comment permissions per user
- Rich text formatting: bold, italic, headings, lists
Non-functional requirements & scale
- 1B documents; 10M concurrent active editors
- Change propagation latency < 100ms (P95)
- No data loss — every keystroke eventually persisted
- Conflict resolution must be correct — converge to same state
- Document load time < 1 second
- Offline editing with sync on reconnect
Capacity estimation
Core challenge: concurrent edits from multiple users. If User A deletes char at pos 5 and User B inserts at pos 6 simultaneously — naive last-write-wins corrupts the document. Operational Transformation (OT) or CRDTs solve convergence. At Google scale: 10M concurrent WS connections.
Core entities
- Document — docId, ownerId, title, createdAt, currentRevision, sharedWith[]
- Operation — opId, docId, userId, type (insert|delete|format), position, content, revision, timestamp
- Revision — revisionId, docId, baseRevision, ops[], snapshot (full doc at checkpoint), createdAt
- Permission — docId, userId, role (owner|editor|commenter|viewer)
API design
GET /api/v1/docs/:docId— Load document. Returns latest snapshot + ops since snapshot.WS wss://docs.app/docs/:docId— Real-time collaboration channel for a document.POST /api/v1/docs/:docId/ops— Submit operation. Body: { type, position, content, baseRevision }.GET /api/v1/docs/:docId/revisions— List revision history for the document.
High-level design
Client connects via WebSocket to Doc Service. User types → generates Operation → sent to server → OT applied against concurrent ops → broadcast to all collaborators → persisted to DB. Periodic snapshots reduce replay time on load.
Deep dives
🔀 Operational Transformation (OT)
OT transforms concurrent operations so they can be applied in any order and still converge. Example: A deletes char at pos 3; B inserts "x" at pos 5 — when B's op arrives at server after A's delete, transform B's position to 4. Jupiter algorithm: server serializes all ops; each client tracks server-revision and local-revision; transform against divergence.
📦 CRDT Alternative
Conflict-free Replicated Data Types (CRDTs) like LSEQ or Logoot assign unique fractional positions to each character. Characters are never moved — deletion marks as tombstone. Merge = union of all character sets, sorted by position. Advantage: no central server needed (P2P possible). Disadvantage: tombstones grow unboundedly, requires periodic GC.
💾 Revision History
Store every operation in append-only log (Spanner/BigTable). On load: fetch latest snapshot + replay ops since snapshot. Create new snapshot every 1000 ops or 1 hour. Snapshot = full document state at that revision. Version comparison: diff between two snapshot revisions. Storage: each op ~200 bytes; 1M ops = 200MB per doc (large docs).
📴 Offline Editing
Client stores ops locally (IndexedDB). On reconnect: client sends all offline ops with their baseRevision. Server transforms offline ops against any ops that happened during offline period. Conflict: OT resolves automatically for text; for structural conflicts (table deleted then edited) → prompt user.
Scaling considerations
- Shard documents across Doc Service instances — all editors of doc X go to same server (sticky routing)
- Google Spanner for global consistency of operation log across regions
- Redis Pub/Sub per docId channel for cross-server broadcast
- Snapshot service runs as separate background job, no impact on real-time path
- Full-text search index updated async; search results may lag 10-30s behind latest edits
What interviewers expect by level
- Junior: Describe basic doc CRUD, WebSocket for real-time. Understand why last-write-wins fails for concurrent edits.
- Mid: Explain OT conceptually, sticky sessions, op log + snapshot design.
- Senior: OT transformation algorithm, CRDT vs OT trade-offs, offline sync, 10M concurrent WS connections.
- Staff: Global consistency (Spanner), P2P collaboration (WebRTC + CRDTs), cost at Google scale, regulatory (GDPR doc deletion).
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…