1. HDFS对大量小文件的缺点
通常,HDFS (hdfs://node14:9000/user/hadoop/inputDir) 的文件以block方式存放,block的metadata被放在NameNode的内存中. 这样,a large number of samll files can eat up a lot of memory on the NameNode
2.Hadoop Archives
HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing NameNode memory usage while still allowing transparent access to files. In particular, Hadoop Archives can be used as input to MapReduce.
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars mylib.jar -archives myarchive.zip input output
myarchive.zip will be placed and unzipped into a directory by the name "myarchive.zip"
hadoop jar hadoop-examples.jar wordcount -files dir1/dict.txt#dict1,dir2/dict.txt#dict2 -archives mytar.tgz#tgzdir input output
the files dir1/dict.txt and dir2/dict.txt can be accessed by tasks using the symbolic names dict1 and dict2 respectively. And the archive mytar.tgz will be placed and unarchived into a directory by the name tgzdir.