Friday, August 14, 2015

HDFS Concepts

Blocks

- 128 MB by default.
- Unlike file system, a file usually is bigger than a block size (512 bytes), HDFS's block is sometimes bigger than the input file, and in this case, it is only occupied the amount equals to the file size, not full block size.
Why HDFS build up Block abstraction, not file?
- It's possible that a file is bigger a disk, even all disks in cluster, how can it could be stored and managed?
- Block is chunk of data, no need to store file metadata. Block size is fixed, easy to find out how much can be stored in a disk.
- Blocks fit well with replication for providing fault tolerance and availability.

HDFS command:
         hdfs fsck / -files -blocks


Namenodes and Datanodes

namenode (the master)
datanodes (workers)

- Namenode maintains the file system tree and the metadata for all the files and directories in the tree. 

- This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. 
- The namenode also store all blocks location of a given file but not persist it, because this information is reconstructed from datanodes when the system starts.
- Datanodes store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
- If namenode is died, no way to reconstruct file system from datanodes. Two ways to backup:
+ Configure Hadoop to let namenode writes its persistent state to multiple filesystems. It can be local disk or a remote NFS mount and synchronous and atomic to client.
+ Run secondary namenode. The main role is merges namespace image with edit log to prevent edit log becoming too large. Can use this merged namespace image to restore system in case failing. However, state of the secondary namenode lags, so if total failure of primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary.

Block caching

- Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache. 
- By default, a block is cached in only one datanode’s memory 
- Users or applications instruct the namenode which files to cache (and for how long) by adding a cache directive to a cache pool. Cache pools are an administrative grouping for managing cache permissions and resource usage. 

HDFS Federation

- Introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user, say, and a second namenode might handle files under /share 
- Under federation, each namenode manages 
+ a namespace volume, which is made up of the metadata for the namespace; independent of each other, which means namenodes do not communicate with one another, does not affect the availability of the namespaces managed by other namenodes
+ a block pool containing all the blocks for the files in the namespace; not partitioned, so datanodes register with each namenode in the cluster and store blocks from multiple block pools (access to block pool)

High availability

- To recover from a failed namenode in this situation, an administrator starts a new primary namenode with one of the filesystem metadata replicas and configures datanodes and clients to use this new namenode. 
- The new namenode is not able to serve requests until it has (i) loaded its namespace image into memory, (ii) replayed its edit log, and (iii) received enough block reports from the datanodes to leave safe mode
- Hadoop 2 offer active and standby namenode.
- Architecture:
+ The namenodes must use highly available shared storage to share the edit log
+ Active write log, standby read log
+ Datanodes must send block reports to both namenodes because block info was put in memory
+ The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace

- For the highly available shared storage (edit log), there are two options: an NFS filer, or a quorum journal manager (QJM)
- QJM is recommended because: it runs as a group of journal nodes, and each edit must be written to a majority of the journal nodes. Typically, there are three journal nodes, so the system can tolerate the loss of one of them.
- Failover controller manages transition between active and standby namenode using ZooKeeper (default).
- Each namenode runs a lightweight failover controller process.
- Failover may also be initiated manually by an administrator
- Fencing is the concept about killing namenode’s process to prevent it from doing any damage and causing corruption (failed but may still write to edit log)
- Many ways to do fencing: revoking access to shared log, disable network port or shutdown physical server...
- The HDFS URI uses a logical hostname that is mapped to a pair of namenode addresses, so client failover is handled transparently.

No comments:

Post a Comment