Presentations
- The exabyte club: LinkedIn’s journey of scaling the Hadoop Distributed File System.
- Scaling HDFS with Consistent Reads from Standby Replicas. (Video)
- HDFS Selective Wire Encryption.
- HDFS Scalability and Consistent Reads from Standby Node. (Video)
- Scaling Hadoop at LinkedIn. (Video)
- HDFS Trunncate: Evolving Beyond Write-Once Semantics. (Video)
- HDFS for Geographically Distributed File System. (Video)
-
Coordinating Metadata Replication: Survival Strategy for Distributed Systems.
(Video) - Apache Hadoop: Foundations of Scalability.
- Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation.
- Apache HDFS: Distributed Storage for Vast Quantities of Data. (Podcast)
- HDFS Design Principles and the Scale-out-Ability of Distributed Storage.
- Apache Hadoop 0.22 and Other Versions.
- Automatic-Hot HA for HDFS NameNode.
- Hadoop Gateway: Cluster Virtualization Framework.
- Distributed Computing with Apache Hadoop. Introduction to MapReduce.
- Distributed Computing with Apache Hadoop. Technology Overview.
- Scaling Storage and Computation with Hadoop. (Video) in russian.
HDFS texts
- The Hadoop Distributed File System,
- Apache Hadoop: the Scalability Update,
- Scalability of the Hadoop Distributed File System, (html)
- Scaling Hadoop to 4000 nodes at Yahoo!
- The Hadoop Distributed File System requirements.
Favorite Issues
- NameNode Fine-Grained Locking via Metadata Partitioning. HDFS-14703.
- Consistent Reads from Standby Node. HDFS-12943.
- Segmented block reports proposal. HDFS-11313.
- Interleaving block reports race. HDFS-10301.
-
HDFS truncate.
HDFS-3107,
Snapshot support for truncate. HDFS-7056. - Coordinated replication of the namespace using ConsensusNode. HDFS-6469.
- Introduce Coordination Engine. HADOOP-10641.
- Warm HA NameNode going Hot. HDFS-2064.
- Stress Test and Live Data Verification (S-Live) design. HDFS-708.
- Sequential generation of block ids. HDFS-898.
- Appending to an HDFS file. HDFS-265.
- BackupNode maintains the up-to-date state of the namespace by receiving edits from the NameNode. HADOOP-4539.
- DFSIO - a MapReduce based benchmark to measures performance of writes, appends, and sequential and random reads.
MAPREDUCE-4651, HDFS-663. HADOOP-193, -
Slot utilization measures the actual job load on a map-reduce cluster and characterizes the overall cluster productivity.
The utilization is measured by analysing job history logs. HDFS-459. - File size distribution analysis. HDFS-461.
- Quadruple memory size reduction for the name-node by
redesigning memory data structures HADOOP-1687.
and removing checksum files from the name-node. HADOOP-1134. - Distributed cluster upgrade framework. HADOOP-1286.
- Faster cluster startup. HADOOP-3022.
-
File system snapshots.
A snapshot of the previous state of the file system is taken during software upgrades in order to avoid data loss caused by software bugs or administrators mistakes. HADOOP-702.
- NNThroughputBenchmark - a pure name-node benchmark.
HADOOP-2149, HADOOP-3860. - Chain reaction caused by simultaneous failure of a few DataNodes.
HADOOP-572. - Safe mode is a read-only state of the name-node. HADOOP-306.
- Integrity of HDFS cluster components. HADOOP-124.