ubuntu上hadoop环境的配置

ubuntu下hadoop环境的配置

        今天去学校里的研究生实验室看了下视频大数据处理这块,其中要用到hadoop技术,自己对分布式开发虽然不了解,但是感觉很感兴趣。研2的学长人很好,愿意带着我跟他们一起做项目。确实有点压力,但是努力吧。。

        言归正传,配置过程主要分为两步

1:jdk的配置。详情请看jdk的配置

2:hadoop的安装  下载地址 http://labs.xiaonei.com/apache-mirror/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz

    1、安装java和ssh
      在Ubuntu下使用apt-get就可以很方便地将JDK和ssh安装好,Ubuntu一般默认安装有ssh客户端,并没有安装服务器端,输入"apt-get install ssh"便会将服务器安装好,然后使用"/etc/init.d/ssh start"将服务器运行起来。
      2、创建hadoop用户组和hadoop用户
 #addgroup hadoop
 #adduser --ingroup hadoop hadoop
      3、配置ssh
切换到hadoop用户下
#su - hadoop
生成密钥对
hadoop@ubuntu:~$ssh-keygen -t rsa -P ""
将公钥拷贝到服务器上
hadoop@ubuntu:~$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
      4、安装Hadoop
Hadoop不需要安装解压后就可以用了,以root用户运行下面的命令。
#cd /usr/local
#tar xzf hadoop-0.20.0.tar.gz
#mv hadoop-0.20.0 hadoop
#chown -R hadoop:hadoop hadoop
      5、配置Hadoop
      打开conf/hadoop-env.sh,修改其中一句就ok了。将“#export JAVA_HOME=/usr/lib/j2sdk1.5-sun”改成“export JAVA_HOME=/usr/lib/jvm/java-6-sun“就好了(此处去掉了#),当然要看安装的java版本了,Ubuntu 9.10的源的Java版本就是1.6。
      接着修改core-site.xml文件,填入以下内容(/usr/local/hadoop-datastore/hadoop-hadoop目录必须存在(没有的话自己新建),并且需要将目录属主改成hadoop用户,${user.name}这个变量不知道是哪儿定义的):此处一定要记住为hadoop用户获取权限(最好只获得该目录,多了的话容易出问题)
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<!-- Put site-specific property overrides in this file. -->
 
<configuration>
 
<property>
  <name>hadoop.tmp.dir</name>
  <value>/usr/local/hadoop-datastore/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>
 
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
 
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
 
</configuration>
       然后编辑mapred-site.xml文件,输入以下内容:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

</configuration>
       原文中似乎将这两段配置均放在了hadoop-site.xml配置文件中,0.20.0版本之后的Hadoop似乎不再有这个配置文件了,取而代之是core-site.xml,如果将这些内容全部放入这个文件中会出问题。TaskTracker和JobTracker将运行不起来,log记录的错误为:
2009-10-31 21:43:28,399 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.RuntimeException: Not a host:port pair: local
       这个错误让我郁闷了好久,偶然的机会在一个*的网页上看到说必须将第二段放入mapred-site.xml文件中,这样果然ok了 ^_^
      6、初始化name node节点
hadoop@ecy-geek:/usr/local/hadoop/bin$ ./hadoop namenode -format
09/10/31 23:30:10 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ecy-geek/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.1
STARTUP_MSG:   build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC 2009
************************************************************/
Re-format filesystem in /usr/local/hadoop-datastore/hadoop-hadoop/dfs/name ? (Y or N) y
Format aborted in /usr/local/hadoop-datastore/hadoop-hadoop/dfs/name
09/10/31 23:30:16 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ecy-geek/127.0.1.1
************************************************************/
     7、运行Hadoop
hadoop@ecy-geek:/usr/local/hadoop/bin$ ./start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-ecy-geek.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-ecy-geek.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-ecy-geek.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-ecy-geek.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ecy-geek.out
     可见namenode,datanode,secondarynamenode,jobtracker,tasktracker都运行起来了,使用jps可显示相关信息。
hadoop@ecy-geek:/usr/local/hadoop/bin$ jps
21581 NameNode
21975 SecondaryNameNode
22238 TaskTracker
22477 Jps
22053 JobTracker
21777 DataNode
     呵呵,到这里就差不多,Hadoop提供了方便的Web UI以查看相关信息,地址分别如下:
http://localhost:50030/ - web UI for MapReduce job tracker(s)
http://localhost:50060/ - web UI for task tracker(s)
http://localhost:50070/ - web UI for HDFS name node(s)
      接着就可以运行MapReduce job了,也可以通过DSF Shell来操作分布式文件系统,可惜是单机的,如果有许多台机器就壮观了。



其中我遇到一个问题

hadoop@ubuntu:/usr/local/hadoop$ bin/start-all.sh
mkdir: cannot create directory `/usr/local/hadoop/bin/../logs': Permission denied
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-ubuntu.out
/usr/local/hadoop/bin/hadoop-daemon.sh: line 117: /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-ubuntu.out: No such file or directory

解决:修改hadoop文件夹的权限,保证hadoop用户能正常访问其中的文件sudo chown -hR hadoop /usr/local/hadoop

主要是因为hadoop用户没权限。