break

Hadoop Distributed File System Concrete Architecture

Hadoop Distributed File System contains 114 java files. After analyzing these files, I concluded the HDFS concrete architecture can be represented in three main subsystems: Server, Protocol and Tools. Additionally, it contains other components such as: HDFS Policy Provider, Checksum Distributed File System, Distributed File System Client, Distributed File System Util, HFTP File System, and HSFTP File System. Figure 1 illustrates the concrete architecture of the Hadoop Distributed File System.

HDFS
Figure 1: Hadoop Distributed File System Concrete Architecture

3.2.1 Server
Hadoop Core master-slave architecture is clearly represented in the server subsystem. It shows the interaction between the datanode and namenode components. Additionally, it shows the protocol both components use to communicate with one another. The server subsystem contains five components which are explained in sections 3.2.1.1 – 3.2.1.5.

3.2.1.1 Protocol
The protocol component contains the datanode protocol: a protocol that a DFS datanode uses to communicate with the namenode. The datanode protocol sends a heartbeat to tell the namenonde that a datanode is still alive, with some status information appended. It also determines actions such as: transfer blocks to another datanode, invalidate blocks, shutdown a node, request a block recovery, and others.
The protocol component also handles the instructions sent to a datanode regarding some blocks under its control. It tells the datanode either to invalidate a set of blocks, or to copy a specific set of blocks into another datanode.

3.2.1.2 Balancer
Distribution of blocks across datanodes can become unbalanced [6]. An unbalanced cluster relies heavily on highly utilized datanodes. The balancer tool, re-distributes blocks by balancing disk space usage on a HDFS cluster. It moves over-used datanodes to under-used datanodes, while placing block replicas on different racks. In addition, the balancer runs until the cluster is balanced, it cannot move any more blocks, or loses contact with the namenode [6].

3.2.1.3 Common
The Common component provides internal constants for HDFS. Also, it throws exceptions when a file system is inconsistent and is not recoverable. Additionally, it contains common classes for storage information. It stores the type of node (namenode or datanode), storage layout version, and the file system creation time. Storage can reside in multiple directories, yet each directory contains the same version. Furthermore, the common component provides a common interface to upgrade namenode or datanode objects.

3.2.1.4 Datanode
The Datanode component stores a set of blocks for a DFS deployment. Moreover, it maintains a map from a block to its metadata. A datanode communicates regularly with a single namenode. It can also communicate with other datanodes. One deployment can have one or many datanodes.
The datanode allows a client to read blocks, or write new block data. When instructions are received from the namenode it may delete or copy blocks from other datanodes. Blocks are stored on a local disk. When a server starts, the datanode reports the table of contents to the namenode. However, it is also capable of maintaining various statistics of the blocks. Additionally, the datanodes maintain an open server socket so that client code or other datanodes can read or write data.

3.2.1.5 Namenode
The namenode subsystem manages the file system namespace and controls access by external clients. Unless there is a second backup namenode, usually there is a single namenode running in any DFS deployment. It contains key components such as:
• INode Directory: keeps an in-memory representation of the block hierarchy.
• Log Manager: reads and writes log data from storage
• File System Directory: handles the writing and loading values to the disk, and logs the changes as it happens.
• Secondary Namenode: a helper to the primary namenode. It is responsible for supporting periodic checkpoints of the HDFS metadata. In a HDFS cluster, only one secondary namenode is allowed.
• File System Image: stores all information about the file system namespace.

3.2.2 Protocol
The protocol subsystem allows a user to communicate with a namenode. Also, it allows users to finds lost blocks, check for quota and file exceptions. The protocol subsystem contains the following components:
• Exceptions: manages exceptions such as: when a user wants to create a file that is being created but is not closed yet, disk space and namespace quota is exceeded, datanodes that are not previously registered try to access namenodes, and others.
• Client Protocol: provides a protocol for block recovery. Additionally, it allows users to manipulate the directory namespace, and open and close file streams.
• Blocker Reporter: it reports where to find a collection of blocks and its file length.
• Data Transfer Protocol: streaming protocol used by the client to transfer data to and from the datanode.

3.2.3 Tools
The tools subsystem provides administrative access to the HDFS, and provides a rudimentary tool for check DFS volumes for errors and sub-optimal conditions. It contains two components:
• DFS Admin: reports how the file system is doing. It allows administrators to put a cluster in safe mode, generate a list of datanodes, and decommission datanodes [1].
• DFS Volume Check: scans all files and directories starting from an indicated root path. It detects and handles abnormal conditions. Additionally, it is able to collect DFS statistics, and can print statistics on block locations and replication factors of each file.

3.2.4 Additional HDFS Components
Other components found in the HDFS concrete architecture include:
• HDFS Policy Provider: provides the HDFS definitions and protocols for the security in effect.
• Checksum Distributed File System: creates a checksum file for each raw file. It generates and verifies checksums at the client side.
• Distributed File System Client: component that connects to a Hadoop file system and performs basic file tasks such as: rename, delete, set permissions, set file or directory owner, set or reset quotas, and others. It uses client protocol to communicate with a Namenode daemon, and connects directly to datanodes to read and write block data.
• Distributed File System Util: a utility to verify whether a pathname is valid. It prohibits relative paths, or names that contain a “:” or “/”.
• HFTP File System: provides the implementation of a protocol for accessing file systems over Hyper Text Transfer Protocol (HTTP).
• HSFTP File System: provides the implementation of a protocol for accessing file systems over Hyper Text Transfer Protocol Secure (HTTPS).

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

CAPTCHA Image
Reload Image