spark入门知识和job任务提交流程

ienki · 发表于 2019-1-30 12:16:14

spark是Apache开源社区的一个分布式计算引擎，基于内存计算，所以速度要快于hadoop.
下载

安装

本地运行模式
使用spark-submit提交job

　　cd /usr/local/spark
　　./bin/spark-submit --class org.apache.spark.examples.SparkPi ./examples/jars/spark-examples_2.11-2.1.0.jar 10000

使用spark-shell进行交互式提交

　　创建root下的文本文件hello.txt
　　./bin/spark-shell
　　再次连接一个terminal，用jps观察进程，会看到spark-submit进程
　　sc
　　sc.textFile("/root/hello.txt")
　　val lineRDD = sc.textFile("/root/hello.txt")
　　lineRDD.foreach(println)
　　观察网页端情况
　　val wordRDD = lineRDD.flatMap(line => line.split(" "))
　　wordRDD.collect
　　val wordCountRDD = wordRDD.map(word => (word,1))
　　wordCountRDD.collect
　　val resultRDD = wordCountRDD.reduceByKey((x,y)=>x+y)
　　resultRDD.collect
　　val orderedRDD = resultRDD.sortByKey(false)
　　orderedRDD.collect
　　orderedRDD.saveAsTextFile("/root/result")
　　观察结果
　　简便写法：sc.textFile("/root/hello.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortByKey().collect

使用local模式访问hdfs数据

　　start-dfs.sh
　　spark-shell执行：sc.textFile("hdfs://192.168.56.100:9000/hello.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortByKey().collect （可以把ip换成master，修改/etc/hosts）
　　sc.textFile("hdfs://192.168.56.100:9000/hello.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortByKey().saveAsTextFile("hdfs://192.168.56.100:9000/output1")

spark standalone模式

spark on yarn模式
　　

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详