MongoDB's flexibility makes it easy to start with, but scaling to billions of documents requires careful planning. Here are the lessons we learned scaling a MongoDB deployment from prototype to production at scale.
Schema Design Matters
MongoDB being schema-less does not mean you should not design your schema. Embedding vs referencing decisions have massive performance implications at scale.
- Embed when data is always accessed together and the embedded array does not grow unbounded
- Reference when data is accessed independently or the related collection is large
Indexing Strategy
- Every query should use an index -- full collection scans are death at scale
- Compound indexes should follow the ESR rule: Equality, Sort, Range
- Monitor index usage with db.collection.aggregate($indexStats) and remove unused indexes
- Consider TTL indexes for time-series data that should auto-expire
Sharding
Choose your shard key carefully -- it cannot be changed without rebuilding the cluster. A good shard key has high cardinality, evenly distributed writes, and supports your most common query patterns.