Hadoop2-YARN 伪分布模式筹建
Hadoop2-YARN 伪分布模式搭建
网络上关于搭建YARN开发环境的文章也不少了,在我搭建的过程中进行了参考,发现好多都有
问题,特别时针对如何在Eclipse中运行WordCount没有详细的介绍,本篇文章是我自己尝试搭
建YARN伪分布模式开发环境的一个总结,若有疑问欢迎讨论,谢谢!
1. 系统环境
Memory: 3G
CentOS6.3 x86-64
jdk-6u37-linux-x64.bin
hadoop-2.0.2-alpha.tar.gz
并配置好Java环境变量。
2. 配置hosts、IP及SSH认证
[kevin@linux-fdc ~]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain
::1 localhost6 localhost6.localdomain6
192.168.81.251 linux-fdc.tibco.com linux-fdc
3. 创建Hadoop账户
(1)创建账户
useradd -g kevin -d /home/kevin -m kevin
(2)创建密码
passwd kevin
(3)删除账户
userdel --help
groupdel --help
(4)查看账户
cat /etc/group
cat /etc/passwd
4. 解压hadoop-2.0.2-alpha.tar.gz
解压hadoop-2.0.2-alpha.tar.gz至/usr/custom/hadoop-2.0.2-alpha.
5. 配置Hadoop环境变量
export HADOOP_HOME=/usr/custom/hadoop-2.0.2-alpha
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_LIB=$HADOOP_HOME/lib
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
6. 配置Hadoop
(1)core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> <final>true</final> <description> The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/kevin/workspace-yarn/tmp</value> <description> A base for other temporary directories. </description> </property> <property> <name>io.native.lib.available</name> <value>true</value> <description> Should native hadoop libraries, if present, be used. </description> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> <final>true</final> <description> The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations. </description> </property> </configuration>
(2)hdfs-site.xml
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/kevin/workspace-yarn/dfs/name</value> <description> Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/kevin/workspace-yarn/dfs/data</value> <description> Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> <final>true</final> </property> <property> <name>dfs.namenode.edits.dir</name> <value>/home/kevin/workspace-yarn/dfs/edits</value> <description> Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.name.dir </description> </property> <property> <name>dfs.replication</name> <value>1</value> <description> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.permissions.enabled</name> <value>false</value> <description> If "true", enable permission checking in HDFS. If "false", permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories. </description> </property> </configuration>
(3)mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <valie>yarn</valie> <description> The runtime framework for executing MapReduce jobs. Can be one of local, classic or yarn. </description> </property> <property> <name>yarn.app.mapreduce.am.staging-dir</name> <value>/home/kevin/workspace-yarn/history/stagingdir</value> <description> YARN requires a staging directory for temporary files created by running jobs. By default it creates /tmp/hadoop-yarn/staging with restrictive permissions that may prevent your users from running jobs. To forestall this, you should configure and create the staging directory yourself. </description> </property> <property> <name>mapreduce.task.io.sort.mb</name> <value>100</value> <description> The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. </description> </property> <property> <name>mapreduce.task.io.sort.factor</name> <value>10</value> <description> More streams merged at once while sorting files. </description> </property> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>5</value> <description> Higher number of parallel copies run by reduces to fetch outputs from very large number of maps. </description> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1024</value> <description>The amount of memory on the NodeManager in GB, the default:8192. </description> </property> </configuration>
(4)yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> <description> Shuffle service that needs to be set for Map Reduce applications. </description> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> <description> The exact name of the class for shuffle service. </description> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>linux-fdc.tibco.com:8030</value> <description> ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources. Host is the hostname of the resourcemanager and port is the port on which the Applications in the cluster talk to the Resource Manager. </description> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>linux-fdc.tibco.com:8031</value> <description> ResourceManager host:port for NodeManagers. Host is the hostname of the resource manager and port is the port on which the NodeManagers contact the Resource Manager. </description> </property> <property> <name>yarn.resourcemanager.address</name> <value>linux-fdc.tibco.com:8032</value> <description> The address of the applications manager interface in the RM. </description> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>linux-fdc.tibco.com:8033</value> <description> ResourceManager host:port for administrative commands. The address of the RM admin interface. </description> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>linux-fdc.tibco.com:8088</value> <description> The address of the RM web application. </description> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/home/kevin/workspace-yarn/nm/local</value> <description> Specifies the directories where the NodeManager stores its localized files. All of the files required for running a particular YARN application will be put here for the duration of the application run. 必须配置,如果不配置将使得NodeManager处于Unhealthy状态,无法提供服务,现象是提交作业时, 作业一直处于pending状态无法往下执行。 </description> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/home/kevin/workspace-yarn/nm/log</value> <description>ResourceManager web-ui host:port. Specifies the directories where the NodeManager stores container log files. 必须配置,如果不配置将使得NodeManager处于Unhealthy状态,无法提供服务,现象是提交作业时, 作业一直处于pending状态无法往下执行。 </description> </property> <property> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/home/kevin/workspace-yarn/aggrelog</value> <description> Specifies the directory where logs are aggregated. </description> </property> </configuration>
7. hadoop-env.sh增加JAVA_HOME
# The java implementation to use.
export JAVA_HOME=/usr/custom/jdk1.6.0_37
8. 格式化HDFS
bin/hdfs namenode -format
9. 启动HDFS
sbin/start-dfs.sh
或者
sbin/hadoop-daemon.sh start namenode
sbin/hadoop-daemon.sh start datanode
10. 启动YARN
sbin/start-yarn.sh
或者
sbin/yarn-daemon.sh start resourcemanager
sbin/yarn-daemon.sh start nodemanager
11.查看集群
(1)查看集群: http://192.168.81.251:8088/
(2)Namenode: localhost:50070/dfshealth.jsp
(3) SencondNameNode: 192.168.81.251:50090/status.jsp
12.Eclipse中运行Example: WordCount.java
因为我内存较小的晕因,通过hadoop jar 命令直接运行hadoop-mapreduce-examples-2.0.2-alpha.jar
中的wordcount时,在Map阶段就产出了Java hea space异常,在将WordCount代码Import
进Eclipse时,按照Hadoop v1的方式运行该例子也遇到了不少的问题,现将我成功运行该例子的
步骤记录下来,仅供参考:
(1) 启动RM, NM, NN, DN, SNN
(2) 上传测试文件:student.txt至HDFS
创建/input文件夹:hadoop fs -mkdir /input
上传文件:hadoop fs -put /home/kevin/Documents/student.txt /input/student.txt
查看结果:
[kevin@linux-fdc ~]$ hadoop fs -ls -d -R /input/student.txt
Found 1 items
-rw-r--r-- 1 kevin supergroup 131 2013-01-19 10:30 /input/student.txt
(3) Eclipse 中Run Configurations...中的Arguments Tab中在Program arguments中输入:
hdfs://localhost:9000/input hdfs://localhost:9000/output
(4) 运行日志
2013-01-19 10:43:01,088 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - session.id is deprecated. Instead, use dfs.metrics.session-id 2013-01-19 10:43:01,095 INFO jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId= 2013-01-19 10:43:01,599 WARN util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-01-19 10:43:01,682 WARN mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(247)) - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 2013-01-19 10:43:01,734 INFO input.FileInputFormat (FileInputFormat.java:listStatus(245)) - Total input paths to process : 1 2013-01-19 10:43:01,817 WARN snappy.LoadSnappy (LoadSnappy.java:<clinit>(46)) - Snappy native library not loaded 2013-01-19 10:43:02,155 INFO mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(368)) - number of splits:1 2013-01-19 10:43:02,256 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 2013-01-19 10:43:02,257 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class 2013-01-19 10:43:02,257 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 2013-01-19 10:43:02,257 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapred.job.name is deprecated. Instead, use mapreduce.job.name 2013-01-19 10:43:02,258 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 2013-01-19 10:43:02,258 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 2013-01-19 10:43:02,258 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 2013-01-19 10:43:02,258 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 2013-01-19 10:43:02,259 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 2013-01-19 10:43:02,264 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 2013-01-19 10:43:02,545 INFO mapreduce.JobSubmitter (JobSubmitter.java:printTokens(438)) - Submitting tokens for job: job_local_0001 2013-01-19 10:43:02,678 WARN conf.Configuration (Configuration.java:loadProperty(2028)) - file:/home/kevin/workspace-eclipse/example-hadoop/build/test/mapred/staging/kevin-1414338785/.staging/job_local_0001/job.xml:an attempt to override final parameter: hadoop.tmp.dir; Ignoring. 2013-01-19 10:43:02,941 WARN conf.Configuration (Configuration.java:loadProperty(2028)) - file:/home/kevin/workspace-eclipse/example-hadoop/build/test/mapred/local/localRunner/job_local_0001.xml:an attempt to override final parameter: hadoop.tmp.dir; Ignoring. 2013-01-19 10:43:02,948 INFO mapreduce.Job (Job.java:submit(1222)) - The url to track the job: http://localhost:8080/ 2013-01-19 10:43:02,950 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1267)) - Running job: job_local_0001 2013-01-19 10:43:02,951 INFO mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(320)) - OutputCommitter set in config null 2013-01-19 10:43:02,986 INFO mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(338)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 2013-01-19 10:43:03,173 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(386)) - Waiting for map tasks 2013-01-19 10:43:03,173 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(213)) - Starting task: attempt_local_0001_m_000000_0 2013-01-19 10:43:03,278 INFO mapred.Task (Task.java:initialize(565)) - Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@40be76c7 2013-01-19 10:43:03,955 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1288)) - Job job_local_0001 running in uber mode : false 2013-01-19 10:43:03,974 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1295)) - map 0% reduce 0% 2013-01-19 10:43:03,975 INFO mapred.MapTask (MapTask.java:setEquator(1130)) - (EQUATOR) 0 kvi 26214396(104857584) 2013-01-19 10:43:03,979 INFO mapred.MapTask (MapTask.java:<init>(926)) - mapreduce.task.io.sort.mb: 100 2013-01-19 10:43:03,979 INFO mapred.MapTask (MapTask.java:<init>(927)) - soft limit at 83886080 2013-01-19 10:43:03,979 INFO mapred.MapTask (MapTask.java:<init>(928)) - bufstart = 0; bufvoid = 104857600 2013-01-19 10:43:03,979 INFO mapred.MapTask (MapTask.java:<init>(929)) - kvstart = 26214396; length = 6553600 2013-01-19 10:43:04,528 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(501)) - 2013-01-19 10:43:04,569 INFO mapred.MapTask (MapTask.java:flush(1392)) - Starting flush of map output 2013-01-19 10:43:04,569 INFO mapred.MapTask (MapTask.java:flush(1411)) - Spilling map output 2013-01-19 10:43:04,570 INFO mapred.MapTask (MapTask.java:flush(1412)) - bufstart = 0; bufend = 195; bufvoid = 104857600 2013-01-19 10:43:04,570 INFO mapred.MapTask (MapTask.java:flush(1414)) - kvstart = 26214396(104857584); kvend = 26214336(104857344); length = 61/6553600 2013-01-19 10:43:04,729 INFO mapred.MapTask (MapTask.java:sortAndSpill(1600)) - Finished spill 0 2013-01-19 10:43:04,734 INFO mapred.Task (Task.java:done(979)) - Task:attempt_local_0001_m_000000_0 is done. And is in the process of committing 2013-01-19 10:43:05,077 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(501)) - map 2013-01-19 10:43:05,078 INFO mapred.Task (Task.java:sendDone(1099)) - Task 'attempt_local_0001_m_000000_0' done. 2013-01-19 10:43:05,078 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(238)) - Finishing task: attempt_local_0001_m_000000_0 2013-01-19 10:43:05,078 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(394)) - Map task executor complete. 2013-01-19 10:43:05,155 INFO mapred.Task (Task.java:initialize(565)) - Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@63f8247d 2013-01-19 10:43:05,182 INFO mapred.Merger (Merger.java:merge(549)) - Merging 1 sorted segments 2013-01-19 10:43:05,206 INFO mapred.Merger (Merger.java:merge(648)) - Down to the last merge-pass, with 1 segments left of total size: 143 bytes 2013-01-19 10:43:05,206 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(501)) - 2013-01-19 10:43:05,487 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(816)) - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords 2013-01-19 10:43:05,789 INFO mapred.Task (Task.java:done(979)) - Task:attempt_local_0001_r_000000_0 is done. And is in the process of committing 2013-01-19 10:43:05,792 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(501)) - 2013-01-19 10:43:05,792 INFO mapred.Task (Task.java:commit(1140)) - Task attempt_local_0001_r_000000_0 is allowed to commit now 2013-01-19 10:43:05,840 INFO output.FileOutputCommitter (FileOutputCommitter.java:commitTask(432)) - Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/output/_temporary/0/task_local_0001_r_000000 2013-01-19 10:43:05,840 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(501)) - reduce > reduce 2013-01-19 10:43:05,840 INFO mapred.Task (Task.java:sendDone(1099)) - Task 'attempt_local_0001_r_000000_0' done. 2013-01-19 10:43:06,001 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1295)) - map 100% reduce 100% 2013-01-19 10:43:07,002 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1306)) - Job job_local_0001 completed successfully 2013-01-19 10:43:07,063 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1313)) - Counters: 32 File System Counters FILE: Number of bytes read=496 FILE: Number of bytes written=315196 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=262 HDFS: Number of bytes written=108 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 Map-Reduce Framework Map input records=8 Map output records=16 Map output bytes=195 Map output materialized bytes=158 Input split bytes=104 Combine input records=16 Combine output records=11 Reduce input groups=11 Reduce shuffle bytes=0 Reduce input records=11 Reduce output records=11 Spilled Records=22 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=3 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=327155712 File Input Format Counters Bytes Read=131 File Output Format Counters Bytes Written=108
(5) 运行结果