Configuring the Hadoop Daemons,其中介绍了很多在最基本的几项配置之外的其他很多重要的配置。比如mapred.tasktracker.{map|reduce}.tasks.maximum、dfs.hosts/dfs.hosts.exclude、mapred.hosts/mapred.hosts.exclude等。
Real-World Cluster Configurations,介绍了9TB of data sorted on a cluster with 900 nodes和14TB of data sorted on 1400 nodes and 20TB of data sorted on 2000 nodes两种实际案例的参考配置。
Task Controllers,除了默认的DefaultTaskController之外,还有一种Controller,即LinuxTaskController。以及如何配置使用LinuxTaskController。LinuxTaskController保证了“except the job owner and tasktracker, no other user can access any of the local files/directories including those localized as part of the distributed cache”,即进一步保证了安全性。
2. Guids
“This document describes how to configure Hadoop HTTP web-consoles to require user authentication. By default Hadoop HTTP web-consoles (JobTracker, NameNode, TaskTrackers and DataNodes) allow access without any form of authentication.”
介绍如何通过配置文件来进行认证网页浏览Hadoop集群信息,因为默认的是没有认证都可以查看的。 3. MapReduce
Example: WordCount v1.0 (比较老版本的wordcount程序),里面的 Walk-through 详细的解释了在整个map、combile和reduce过程对于file0和file1的处理输入和输出结果。
User Interfaces 这部分在面向使用者的角度更加详细的介绍了一定量的细节。This should help users implement, configure and tune their jobs in a fine-grained manner.但是也指出,最好的文档最好参考javadoc,这才是切切实实在代码中参考的最重要的内容。
The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit.
The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode.
The Backup node provides the same checkpointing functionality as the Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the file system namespace that is always synchronized with the active NameNode state.