Big Data Notes: 2015

Sunday, August 30, 2015

HDFS FileSystem Java API in Maven project

1. Create Java project

2. Convert it to Maven project
In file pom.xml, add Hadoop dependencies:
        <dependency>
           <groupId>org.apache.hadoop</groupId>
           <artifactId>hadoop-client</artifactId>
           <version>2.6.0</version>
       </dependency>
       <dependency>
           <groupId>commons-configuration</groupId>
           <artifactId>commons-configuration</artifactId>
           <version>1.10</version>
       </dependency>
       <dependency>
           <groupId>commons-lang</groupId>
           <artifactId>commons-lang</artifactId>
           <version>2.6</version>
       </dependency>
       <dependency>
           <groupId>commons-logging</groupId>
           <artifactId>commons-logging-api</artifactId>
           <version>1.1</version>
       </dependency>

3. Create new package and new class (has main() function)
Example: package name: hdfs.operations, class name: Operations

4. Put your code

package hdfs.operations;

import java.io.IOException;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class Operations {

    public static void main(String[] args) {
        FileSystem hdfs;
        try {
            hdfs = FileSystem.get(new Configuration());
            Path homeDir = hdfs.getHomeDirectory();

            // Print the home directory
            System.out.println("Home folder -" + homeDir);

        } catch (IOException e) {
            LOGGER.error(e.getMessage(), e);
        }
    }

    private static final Log LOGGER = LogFactory.getLog(Operations.class);
}

5. Configure Maven build and move target jar file to destination folder
In build->plugins section in pom.xml, add one more plugin, and specify the output folder for new target jar file:

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
            <configuration>
                <outputDirectory>/data/hadoop/jars</outputDirectory>
            </configuration>
            <version>2.6</version>
        </plugin>

6. Run with Hadoop
hadoop jar /data/hadoop/jars/jar-name.jar hdfs.operations.Operations

The filesystem image and edit log

When a filesystem client performs a write operation, the transaction is first recorded in the edit log. Namenode in-memory metadata will be update after edit log has been modified.

- Edit log has many segment (file prefix by "edits" and suffix by transactionID, ex: edits_inprogress_0000000000000000020).

- Each fsimage file is a complete persistent checkpoint of the filesystem metadata.

- It is not updated for every filesystem write operation, because writing out the fsimage file would be very slow (This does not compromise resilience because if the namenode fails, then the latest state of its metadata can be reconstructed by loading the latest fsimage from disk into memory, and then applying each of the transactions from the relevant point onward in the edit log).

- Each fsimage file contains a serialized form of all the directory and file inodes in the filesystem. Each inode is an internal representation of a file or directory’s metadata and contains such information as the file’s replication level, modification and access times, access permissions, block size, and the blocks the file is made up of. For directories, the modification time, permissions, and quota metadata are stored.

- An fsimage file does not record the datanodes on which the blocks are stored. Instead, the namenode keeps this mapping in memory, which it constructs by asking the datanodes for their block lists when they join the cluster and periodically afterward to ensure the namenode’s block mapping is up to date.

- Checkpointing process:
1. The secondary asks the primary to roll its in-progress edits file, so new edits go to a new file. The primary also updates the seen_txid file in all its storage directories.

2. The secondary retrieves the latest fsimage and edits files from the primary (using HTTP GET).

3. The secondary loads fsimage into memory, applies each transaction from edits, then creates a new merged fsimage file.

4. The secondary sends the new fsimage back to the primary (using HTTP PUT), and the primary saves it as a temporary .ckpt file.

5. The primary renames the temporary fsimage file to make it available.

- Secondary has similar memory requirements to the primary.

- The schedule for checkpointing is controlled by two configuration parameters:
+ The secondary namenode checkpoints every hour ( dfs.namenode.checkpoint.period in seconds)
+ Or sooner if the edit log has reached one million transactions since the last checkpoint ( dfs.namenode.checkpoint.txns ), which it checks every minute
( dfs.namenode.checkpoint.check.period in seconds).

Tools for view fimage and editlog:
http://hadooptutorial.info/oiv-hdfs-offline-image-viewer/
http://hadooptutorial.info/hdfs-offline-edits-viewer-tool-oev/

Hadoop File System Commands

References:

http://hadooptutorial.info/hdfs-file-system-commands/

Friday, August 14, 2015

HDFS Concepts

Blocks

- 128 MB by default.
- Unlike file system, a file usually is bigger than a block size (512 bytes), HDFS's block is sometimes bigger than the input file, and in this case, it is only occupied the amount equals to the file size, not full block size.
Why HDFS build up Block abstraction, not file?
- It's possible that a file is bigger a disk, even all disks in cluster, how can it could be stored and managed?
- Block is chunk of data, no need to store file metadata. Block size is fixed, easy to find out how much can be stored in a disk.
- Blocks fit well with replication for providing fault tolerance and availability.

HDFS command:
hdfs fsck / -files -blocks

Namenodes and Datanodes

namenode (the master)
datanodes (workers)

- Namenode maintains the file system tree and the metadata for all the files and directories in the tree.
- This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log.
- The namenode also store all blocks location of a given file but not persist it, because this information is reconstructed from datanodes when the system starts.
- Datanodes store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
- If namenode is died, no way to reconstruct file system from datanodes. Two ways to backup:
+ Configure Hadoop to let namenode writes its persistent state to multiple filesystems. It can be local disk or a remote NFS mount and synchronous and atomic to client.
+ Run secondary namenode. The main role is merges namespace image with edit log to prevent edit log becoming too large. Can use this merged namespace image to restore system in case failing. However, state of the secondary namenode lags, so if total failure of primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary.

Block caching

- Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.
- By default, a block is cached in only one datanode’s memory
- Users or applications instruct the namenode which files to cache (and for how long) by adding a cache directive to a cache pool. Cache pools are an administrative grouping for managing cache permissions and resource usage.

HDFS Federation

- Introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user, say, and a second namenode might handle files under /share
- Under federation, each namenode manages
+ a namespace volume, which is made up of the metadata for the namespace; independent of each other, which means namenodes do not communicate with one another, does not affect the availability of the namespaces managed by other namenodes
+ a block pool containing all the blocks for the files in the namespace; not partitioned, so datanodes register with each namenode in the cluster and store blocks from multiple block pools (access to block pool)

High availability

- To recover from a failed namenode in this situation, an administrator starts a new primary namenode with one of the filesystem metadata replicas and configures datanodes and clients to use this new namenode.
- The new namenode is not able to serve requests until it has (i) loaded its namespace image into memory, (ii) replayed its edit log, and (iii) received enough block reports from the datanodes to leave safe mode
- Hadoop 2 offer active and standby namenode.
- Architecture:
+ The namenodes must use highly available shared storage to share the edit log
+ Active write log, standby read log
+ Datanodes must send block reports to both namenodes because block info was put in memory
+ The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace

- For the highly available shared storage (edit log), there are two options: an NFS filer, or a quorum journal manager (QJM)
- QJM is recommended because: it runs as a group of journal nodes, and each edit must be written to a majority of the journal nodes. Typically, there are three journal nodes, so the system can tolerate the loss of one of them.

- Failover controller manages transition between active and standby namenode using ZooKeeper (default).
- Each namenode runs a lightweight failover controller process.
- Failover may also be initiated manually by an administrator
- Fencing is the concept about killing namenode’s process to prevent it from doing any damage and causing corruption (failed but may still write to edit log)
- Many ways to do fencing: revoking access to shared log, disable network port or shutdown physical server...
- The HDFS URI uses a logical hostname that is mapped to a pair of namenode addresses, so client failover is handled transparently.

The Design of HDFS

- Very large files
- Streaming data access (write one, read many times)
- Commodity hardware

Not good for those applications which:
+ Low-latency data access (need quick response time)
+ Lots of small files: information of file is stored in namenode. Each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory.
+ Multiple writers, arbitrary file modifications: There is no support for multiple writers or for modifications at arbitrary offsets in the file.