Sunday, August 30, 2015

The filesystem image and edit log

When a filesystem client performs a write operation, the transaction is first recorded in the edit log. Namenode in-memory metadata will be update after edit log has been modified.

- Edit log has many segment (file prefix by "edits" and suffix by transactionID, ex: edits_inprogress_0000000000000000020).

- Each fsimage file is a complete persistent checkpoint of the filesystem metadata.
- It is not updated for every filesystem write operation, because writing out the fsimage file would be very slow (This does not compromise resilience because if the namenode fails, then the latest state of its metadata can be reconstructed by loading the latest fsimage from disk into memory, and then applying each of the transactions from the relevant point onward in the edit log).

- Each fsimage file contains a serialized form of all the directory and file inodes in the filesystem. Each inode is an internal representation of a file or directory’s metadata and contains such information as the file’s replication level, modification and access times, access permissions, block size, and the blocks the file is made up of. For directories, the modification time, permissions, and quota metadata are stored. 

- An fsimage file does not record the datanodes on which the blocks are stored. Instead, the namenode keeps this mapping in memory, which it constructs by asking the datanodes for their block lists when they join the cluster and periodically afterward to ensure the namenode’s block mapping is up to date.

- Checkpointing process:
1. The secondary asks the primary to roll its in-progress edits file, so new edits go to a new file. The primary also updates the seen_txid file in all its storage directories.


2. The secondary retrieves the latest fsimage and edits files from the primary (using HTTP GET). 


3. The secondary loads fsimage into memory, applies each transaction from edits, then creates a new merged fsimage file.


4. The secondary sends the new fsimage back to the primary (using HTTP PUT), and the primary saves it as a temporary .ckpt file.


5. The primary renames the temporary fsimage file to make it available.


- Secondary has similar memory requirements to the primary.

- The schedule for checkpointing is controlled by two configuration parameters:
+ The secondary namenode checkpoints every hour ( dfs.namenode.checkpoint.period in seconds)
+ Or sooner if the edit log has reached one million transactions since the last checkpoint ( dfs.namenode.checkpoint.txns ), which it checks every minute
( dfs.namenode.checkpoint.check.period in seconds).


Tools for view fimage and editlog:
http://hadooptutorial.info/oiv-hdfs-offline-image-viewer/ 
http://hadooptutorial.info/hdfs-offline-edits-viewer-tool-oev/ 

No comments:

Post a Comment