Map-Reduce and Sharded Collections
Map-reduce supports operations on sharded collections, both as an input and as an output. This section describes the behaviors of mapReduce
specific to sharded collections.
Sharded Collection as Output
If the out
field for mapReduce
has the sharded
value, MongoDB shards the output collection using the _id
field as the shard key.
To output to a sharded collection:
- If the output collection does not exist, MongoDB creates and shards the collection on the
_id
field.
- Starting in version 3.6.6, if the output collection already exists and is not sharded, map-reduce fails.
- For a new or an empty sharded collection, MongoDB uses the results of the first stage of the map-reduce operation to create the initial chunks distributed among the shards.
mongos
dispatches, in parallel, a map-reduce post-processing job to every shard that owns a chunk. During the post-processing, each shard will pull the results for its own chunks from the other shards, run the final reduce/finalize, and write locally to the output collection.
Note
- During later map-reduce jobs, MongoDB splits chunks as needed.
- Balancing of chunks for the output collection is automatically prevented during post-processing to avoid concurrency issues.