Journaling

On this page

To provide durability in the event of a failure, MongoDB uses write ahead logging to on-disk journal files.

Journaling and the WiredTiger Storage Engine

Important

The log mentioned in this section refers to the WiredTiger write-ahead log (i.e. the journal) and not the MongoDB log file.

WiredTiger uses checkpoints to provide a consistent view of data on disk and allow MongoDB to recover from the last checkpoint. However, if MongoDB exits unexpectedly in between checkpoints, journaling is required to recover information that occurred after the last checkpoint.

With journaling, the recovery process:

Looks in the data files to find the identifier of the last checkpoint.
Searches in the journal files for the record that matches the identifier of the last checkpoint.
Apply the operations in the journal files since the last checkpoint.

Journaling Process

Changed in version 3.2.

With journaling, WiredTiger creates one journal record for each client initiated write operation. The journal record includes any internal write operations caused by the initial write. For example, an update to a document in a collection may result in modifications to the indexes; WiredTiger creates a single journal record that includes both the update operation and its associated index modifications.

MongoDB configures WiredTiger to use in-memory buffering for storing the journal records. Threads coordinate to allocate and copy into their portion of the buffer. All journal records up to 128 kB are buffered.

WiredTiger syncs the buffered journal records to disk upon any of the following conditions:

For replica set members (primary and secondary members),
- If there are operations waiting for oplog entries. Operations that can wait for oplog entries include:
  - forward scanning queries against the oplog
  - read operations performed as part of causally consistent sessions
- Additionally for secondary members, after every batch application of the oplog entries.
If a write operation includes or implies a write concern of j: true.

Note

Write concern "majority" implies j: true if the writeConcernMajorityJournalDefault is true.
At every 100 milliseconds (See storage.journal.commitIntervalMs).
When WiredTiger creates a new journal file. Because MongoDB uses a journal file size limit of 100 MB, WiredTiger creates a new journal file approximately every 100 MB of data.

Important

In between write operations, while the journal records remain in the WiredTiger buffers, updates can be lost following a hard shutdown of mongod.

Journal Files

For the journal files, MongoDB creates a subdirectory named journal under the dbPath directory. WiredTiger journal files have names with the following format WiredTigerLog.<sequence> where <sequence> is a zero-padded number starting from 0000000001.

Journal Records

Journal files contain a record per each client initiated write operation.

The journal record includes any internal write operations caused by the initial write. For example, an update to a document in a collection may result in modifications to the indexes; WiredTiger creates a single journal record that includes both the update operation and its associated index modifications.
Each record has a unique identifier.
The minimum journal record size for WiredTiger is 128 bytes.

Compression

By default, MongoDB configures WiredTiger to use snappy compression for its journaling data. To specify a different compression algorithm or no compression, use the storage.wiredTiger.engineConfig.journalCompressor setting. For details, see Change WT Journal Compressor.

Note

If a log record less than or equal to 128 bytes (the mininum log record size for WiredTiger), WiredTiger does not compress that record.

Journal File Size Limit

WiredTiger journal files for MongoDB have a maximum size limit of approximately 100 MB.

Once the file exceeds that limit, WiredTiger creates a new journal file.
WiredTiger automatically removes old journal files to maintain only the files needed to recover from last checkpoint.

Pre-Allocation

WiredTiger pre-allocates journal files.

Journaling and the MMAPv1 Storage Engine

With MMAPv1, when a write operation occurs, MongoDB updates the in-memory view. With journaling enabled, MongoDB writes the in-memory changes first to on-disk journal files. If MongoDB should terminate or encounter an error before committing the changes to the data files, MongoDB can use the journal files to apply the write operation to the data files and maintain a consistent state.

Journaling Process

With journaling, MongoDB’s storage layer has two internal views of the data set: the private view, used to write to the journal files, and the shared view, used to write to the data files:

MongoDB first applies write operations to the private view.
MongoDB then applies the changes in the private view to the on-disk journal files in the journal directory roughly every 100 milliseconds. MongoDB records the write operations to the on-disk journal files in batches called group commits. Grouping the commits help minimize the performance impact of journaling since these commits must block all writers during the commit. Writes to the journal are atomic, ensuring the consistency of the on-disk journal files. For information on the frequency of the commit interval, see storage.journal.commitIntervalMs.
Upon a journal commit, MongoDB applies the changes from the journal to the shared view.
Finally, MongoDB applies the changes in the shared view to the data files. More precisely, at default intervals of 60 seconds, MongoDB asks the operating system to flush the shared view to the data files. The operating system may choose to flush the shared view to disk at a higher frequency than 60 seconds, particularly if the system is low on free memory. To change the interval for writing to the data files, use the storage.syncPeriodSecs setting.

If the mongod instance were to crash without having applied the writes to the data files, the journal could replay the writes to the shared view for eventual write to the data files.

When MongoDB flushes write operations to the data files, MongoDB notes which journal writes have been flushed. Once a journal file contains only flushed writes, it is no longer needed for recovery and MongoDB can recycle it for a new journal file.

Once the journal operations have been applied to the shared view and flushed to disk (i.e. pages in the shared view and private view are in sync), MongoDB asks the operating system to remap the shared view to the private view in order to save physical RAM. MongoDB routinely asks the operating system to remap the shared view to the private view in order to save physical RAM. Upon a new remapping, the operating system knows that physical memory pages can be shared between the shared view and the private view mappings.

Note

The interaction between the shared view and the on-disk data files is similar to how MongoDB works without journaling. Without journaling, MongoDB asks the operating system to flush in-memory changes to the data files every 60 seconds.

Journal Files

With journaling enabled, MongoDB creates a subdirectory named journal under the dbPath directory. The journal directory contains journal files named j._<sequence> where <sequence> is an integer starting from 0 and a “last sequence number” file lsn.

Journal files contain the write ahead logs; each journal entry describes the bytes the write operation changed in the data files. Journal files are append-only files. When a journal file holds 1 gigabyte of data, MongoDB creates a new journal file. If you use the storage.smallFiles option when starting mongod, you limit the size of each journal file to 128 megabytes.

The lsn file contains the last time MongoDB flushed the changes to the data files.

Once MongoDB applies all the write operations in a particular journal file to the data files, MongoDB can recycle it for a new journal file.

Unless you write many bytes of data per second, the journal directory should contain only two or three journal files.

A clean shutdown removes all the files in the journal directory. A dirty shutdown (crash) leaves files in the journal directory; these are used to automatically recover the database to a consistent state when the mongod process is restarted.

Journal Directory

To speed the frequent sequential writes that occur to the current journal file, you can ensure that the journal directory is on a different filesystem from the database data files.

Important

If you place the journal on a different filesystem from your data files, you cannot use a filesystem snapshot alone to capture valid backups of a dbPath directory. In this case, use fsyncLock() to ensure that database files are consistent before the snapshot and fsyncUnlock() once the snapshot is complete.

Preallocation Lag

MongoDB may preallocate journal files if the mongod process determines that it is more efficient to preallocate journal files than create new journal files as needed.

Depending on your filesystem, you might experience a preallocation lag the first time you start a mongod instance with journaling enabled. The amount of time required to pre-allocate files might last several minutes; during this time, you will not be able to connect to the database. This is a one-time preallocation and does not occur with future invocations.

To avoid preallocation lag, see Avoid Preallocation Lag for MMAPv1.

Journaling and the In-Memory Storage Engine

Starting in MongoDB Enterprise version 3.2.6, the In-Memory Storage Engine is part of general availability (GA). Because its data is kept in memory, there is no separate journal. Write operations with a write concern of j: true are immediately acknowledged.

If any voting member of a replica set runs without journaling (i.e. either runs an in-memory storage engine or runs with journaling disabled), you must set writeConcernMajorityJournalDefault to false.

Note

Starting in version 3.6.14, if a replica set member uses the in-memory storage engine (voting or non-voting) but the replica set has writeConcernMajorityJournalDefault set to true, the replica set member logs a startup warning.

With writeConcernMajorityJournalDefault set to false, MongoDB does not wait for w: "majority" writes to be written to the on-disk journal before acknowledging the writes. As such, majority write operations could possibly roll back in the event of a transient loss (e.g. crash and restart) of a majority of nodes in a given replica set.

Docs

Docs4dev

Title here

Journaling

Journaling and the WiredTiger Storage Engine

Journaling Process

Journal Files

Journal Records

Compression

Journal File Size Limit

Pre-Allocation

Journaling and the MMAPv1 Storage Engine

Journaling Process

Journal Files

Journal Directory

Preallocation Lag

Journaling and the In-Memory Storage Engine