How To Set up Hadoop on OS X Lion 10.7(转)

zsyzhou · 发表于 2015-7-14 07:16:28

　　Chances are good if you are a just starting out software engineer knowing MapReduce inside and out is as important now as knowing how to configure a LAMP stack was in the last decade. Therefore most developers will want to have a local instance to learn and experiment without having to go down the route of virtualization.
　　Although there are a lot of competing MapReduce implementations out there, Apache Hadoop is the leader, with most PaaS vendors such as Amazon and Microsoft supporting it.
　　Setting up Apache Hadoop on Mac OS X follows the similar pattern to the official Hadoop single node documentation on the Apache side, but there are some bugs and custom configuration for OS X Lion that could trip you up, so this post should help you get started. Here is a quick tutorial (with some gotcha configuration changes you need to make until some bugs are fixed by Apache) to get you started. If you have any updates or suggestions please drop me a line and I’ll update.

Getting Java
　　Mac OS X no longer provides Java out of the box, but forcing it is fairly easy.

Option 1: From UNIX Command Line
　　Just check your Java version on a command line, which will prompt OS X to ask if you’d like to install Java.

$ java -version
　　

Option 2: Get it from Apple website
　　You can also download it directly from Apple by visiting here: http://support.apple.com/kb/dl1421

Getting Hadoop

Setting up your environment
　　Some people like putting Hadoop under ~/Library/Hadoop. That’s fine, but I am use to the /usr/local/ of *nix world so I’ll use that for $HADOOP_HOME. You can make changes as appropriate.
　　Edit your .bash_profile and insert the following:

export HADOOP_HOME=/usr/local/hadoop export JAVA_HOME=$(/usr/libexec/java_home) export PATH=$PATH:$HADOOP_HOME/bin
　　Note that I have specified JAVA_HOME to point to a command which will dynamically find the correct Java in your OS X environment. This can be done both in your bash_profile and in hadoop-env.sh in your configuration. I recommend this to make sure any changes Apple (or perhaps Oracle once Apple gets out of the business of providing Java all together) makes in various updates does not break your Java configuration.

Download Hadoop from command line

$ cd /usr/local/ $ mkdir hadoop $ wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u1.tar.gz $ tar xzvf hadoop-0.20.2-cdh3u1.tar.gz $ mv hadoop-0.20.2-cdh3u1 ./hadoop
Configuring Hadoop for OS X (and fixing some bugs)
　　Once installed, there will be three configuration files you’ll want to edit. Learning what these files do in general is left up to the reader, but this will get you up to speed quick.
　　We will set up the following single node configuration:

sets the default file system as an HDFS instance
sets the path on the local filesystem that the Hadoop daemons will use for persistence to something accessible by you
sets hdfs configuration so that HDFS will only try to store one copy of each file
sets map reduce properties to define the number of map and reduce slots that will be available on your box (you can play with these depending on your system resources)

Configuring: hadoop-env.sh
　　In your command window, load the environment configuration file. You won’t want to change much here, but some things will help you ensure you run right the first time and every time. I recommend making these changes.

vi /usr/local/hadoop/config/hadoop-env.sh
　　Uncomment #JAVA_HOME and specify the command path to dynamically load your Java location as discussed above:

# The java implementation to use. Required. export JAVA_HOME=$(/usr/libexec/java_home)
　　Next, uncomment HADOOP_HEAPSIZE and make it 2000. This is optional but recommended.:

# The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE=2000
IMPORTANT: Fix Configuration Files To Get Around Lion Specific Problems
　　OS X Lion introduced a bug that many people experience when first initializing their name node storage. It typically appears as this error:
　　“Unable to load realm info from SCDynamicStore”
　　This error is currently being tracked in Apache HADOOP-7489 bug. Readers may want to check if this is fixed before applying the below fix.
　　To fix this issue, simply add the following to your hadoop-env.sh file:

export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
　　To sum up, your hadoop-env.sh should have the following defined:

export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" export JAVA_HOME=$(/usr/libexec/java_home) export HADOOP_HEAPSIZE=2000
　　With that file ready to go, let’s move on to configuring your hdfs and map reduce XML files.

Configuring: core-site.xml
　　A change from previous versions of Apache Hadoop is that instead of putting all the configuration for your hadoop instance in to one XML file (hadoop-site.xml), you now have three configuration files you need to edit. This separation of concern is a good decision, but causes some extra work for us. First up is the core-site.xml file.
　　As stated, you need to pick a good place to run the local instance of your single node hdfs storage and setup the location for running the master HDFS instance. I also chose to dynamically inject the username in the temp directory in order to keep track of what account is writing to the HDFS store. This is good practice if you plan on running a local service account (or a few) to test different scenarios. It’s not necessary though. Keep in mind that whatever tmp directory you point to, whatever service account you are using (or your own account) will need write access to the directory.
　　Your file should look something like this:

  hadoop.tmp.dir /usr/local/tmp/hadoop/hadoop-${user.name} A base for other temporary directories. fs.default.name hdfs://localhost:8020
Configuring: hdfs-site.xml
　　Now that we’ve accomplished that, we need to setup some configuration for hdfs itself. the hdfs-site.xml is used to configure HDFS itself. Since we are running a single node cluster on our Mac, we will want to specify for HDFS to only store one copy of the file:

  dfs.replication 1
Configuring: mapred-site.xml
　　Next we need to do some custom configuration on the map reduce engine itself. We specify the job tracker location (usually just your HDFS port + 1, but you can use any open port) and also set the maximum map and reduce jobs that can be spawned. You can configure these depending on the size / speed of your system. I specified 2 here.

  mapred.job.tracker localhost:8021 mapred.tasktracker.map.tasks.maximum 2 mapred.tasktracker.reduce.tasks.maximum 2
Setup HDFS For The First Time
　　We are almost done here, but one final step is to format the HDFS instance we’ve specified. Since we’ve already squashed the nasty SCDynamicStore bug in your hadoop-env.sh file, this should work without issue. This is also a great way to test if the account you are running hadoop as actually has access to all the required directories.

$ $HADOOP_HOME/bin/hadoop namenode -format
　　You should see output like the following:

Brandons-MacBook-Air:local bbjwerner$  hadoop namenode -format Warning: $HADOOP_HOME is deprecated. 11/10/23 00:30:26 INFO namenode.NameNode: STARTUP_MSG: Re-format filesystem in /usr/local/tmp/hadoop/hadoop-bbjwerner/dfs/name ? (Y or N) Y <-- NOTE: You have to use a capital "Y" here. Dumb script.  11/10/23 00:30:28 INFO util.GSet: VM type       = 64-bit11/10/23 00:30:28 INFO util.GSet: 2% max memory = 39.83375 MB11/10/23 00:30:28 INFO util.GSet: capacity      = 2^22 = 4194304 entries 11/10/23 00:30:28 INFO util.GSet: recommended=4194304, actual=4194304 11/10/23 00:30:28 INFO namenode.FSNamesystem: fsOwner=bbjwerner 11/10/23 00:30:28 INFO namenode.FSNamesystem: supergroup=supergroup 11/10/23 00:30:28 INFO namenode.FSNamesystem: isPermissionEnabled=true 11/10/23 00:30:28 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 11/10/23 00:30:28 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 11/10/23 00:30:28 INFO namenode.NameNode: Caching file names occuring more than 10 times 11/10/23 00:30:29 INFO common.Storage: Image file of size 115 saved in 0 seconds. 11/10/23 00:30:29 INFO common.Storage: Storage directory /usr/local/tmp/hadoop/hadoop-bbjwerner/dfs/name has been successfully formatted. 11/10/23 00:30:29 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at Brandons-MacBook-Air.local/10.0.1.31
　　With this complete, your setup of Hadoop is ready! Now all we have to do is run a simple test to make sure it all works!

Startup Hadoop With The Included Scripts
　　You use to have to start each part of Hadoop individually (datanode, namenode, jobtracker, tasktracker) but now they include a script that will start all the services at once.

$ $HADOOP_HOME/bin/start-all.sh
　　You will see each service startup. If there are no errors, you are ready to move on to testing out your Hadoop instance!

Run Hadoop with the included Examples JAR files in the Hadoop distribution
　　To test out your single node, run a quick command from your command line to test it out. To see the available tests for Hadoop, run the following command:

$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar
　　You will see a bunch of different cool tests. The easiest is pi. You can run it in this way:

$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 100
　　You should see output like the following:

Number of Maps  = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2
Congratulations!
　　You now have a single node Hadoop on OS X Lion. Happy Hacking!

hardoop, OSX, Apple

Who Am I?

　　I am Brandon Werner. I love good friends, good coffee, and good ideas shared around a room. I work for Microsoft helping build the next identity platform in the cloud for Azure.

Comments

Hiroshi@gmail.com

22 Dec 2011 4:36 PM

　　Thank you for the great instruction! A couple of comments for your attention;
　　1) $ mv hadoop-0.20.2-cdh3u1 ./hadoop : three steps before in your instruction, you have already created hadoop directory, this command will move hadoop-0.20.2-cdh3u1/ under ./hadoop.
　　2) During formatting namenode and start-all.sh, I hit many of permission error like, localhost: mkdir: /usr/local/hadoop/bin/../logs: Permission denied
　　How do you set your account on Mac (Lion)? I used "sudo" with root password during the installation. /usr/local are is protected on my Mac.
　　Thank you again for your help.
　　-Hiroshi
Brandon Werner

27 Dec 2011 7:14 PM

　　It may be best just to sudo su and run the entire process as root to ensure anything spawned during the installation also has permission to the directory. Lion has done a lot of confusing things to permissions in the unix directory to "protect" users, so your best bet is to take ownership of the entire /usr/local/ directory recursively and then set 775 on them.
　　There is no reason in my mind why /usr/local/ shouldn't be under ownership of the user in a single user machine.
Will L

30 Dec 2011 9:42 AM

　　Hello, Have you been able to get Hadoop Eclipse Plugin to work on OS X Lion? For some reason in Eclipse 3.7.1 with Hadoop 0.20.205.0, the eclipse plugin cannot connect to the DFS and gives an error of "Failed to login". What I don't understand is why in Hadoop is starting to deprecate the Hadoop Eclipse PLugin. I tried building Hadoop 1.0.0's eclipse plugin from the source directory but it doesn't seem to generate any jar files.
　　Thank you again for your help!
Ryan

9 Jan 2012 6:01 PM

　　Hi Will, it looks like the plugin is missing some jar files and I think the manifest may be off too.

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] How To Set up Hadoop on OS X Lion 10.7(转)

浏览过的版块

扫码加入运维网微信交流群