cdh环境下，spark streaming与flume的集成问题总结

a2005147 发表于 2015-9-17 07:22:44

　　文章发自：http://www.cnblogs.com/hark0623/p/4170156.html转发请注明
　　
　　如何做集成，其实特别简单，网上其实就是教程。

http://blog.iyunv.com/fighting_one_piece/article/details/40667035看这里就成。我用的是第一种集成。。

做的时候，出现了各种问题。大概从从2014.12.17 早晨5点搞到2014.12.17晚上18点30

总结起来其实很简单，但做的时候搞了许久啊啊啊！！！！这样的事情，吃一堑长一智吧

问题1、需要引用各种包，这些包要打入你的JAR中，因为用的是spark on yarn模式，所以如果不打进去，在集群中是找不到依赖包的！！！去哪找呢？直接去search.maven.org找。。

问题2：因为搭建的spark on yarn集群，所以监听时只能监听localhost，不然如果你指定了ip，那么非该IP下的结点，就会因为监听不到而出现了问题

问题3：cdh中的flume的启动，你要去find / -name flume.conf ，找一下，然后找到最新的，与cloudera manager配置文件一样的那么，flume启动时就用这个配置文件

问题4：不要直接用集群，先用单点测试一下。。因为单点测试一下后会发现各种问题。解决后再去集群测试

问题5：一定要注意版本！cdh5.2中spark的版本是1.1.0，而我用的插件一直是1.1.1版本的！！！啊，为这事儿，我从中午搞到现在。这个要吃一堑长一智啦！！！

spark代码如下：

package com.hark
import java.io.File
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
/**
* Created by Administrator on 2014-12-16.
*/
object SparkStreamingFlumeTest {
def main(args: Array) {
//println("harkhark")

val path = new File(".").getCanonicalPath()
//File workaround = new File(".");
System.getProperties().put("hadoop.home.dir", path);
new File("./bin").mkdirs();
new File("./bin/winutils.exe").createNewFile();
//val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local")
val sparkConf = new SparkConf().setAppName("HdfsWordCount")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(20))

//val hostname = "127.0.0.1"
val hostname = "localhost"
val port = 2345
val storageLevel = StorageLevel.MEMORY_ONLY
val flumeStream = FlumeUtils.createStream(ssc, hostname, port, storageLevel)
flumeStream.count().map(cnt => "Received " + cnt + " flume events." ).print()

ssc.start()
ssc.awaitTermination()

}
}

　　

flume配置文件如下：

# Please paste flume.conf here. Example:
# Sources, channels, and sinks are defined per
# agent name, in this case 'tier1'.
tier1.sources= source1
tier1.channels = channel1
tier1.sinks = sink1
# For each source, channel, and sink, set
# standard properties.
tier1.sources.source1.type = exec
tier1.sources.source1.command = tail -F /opt/data/test3/123
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = memory
#tier1.sinks.sink1.type       = logger
tier1.sinks.sink1.type       = avro
tier1.sinks.sink1.hostname    = localhost
tier1.sinks.sink1.port    = 2345
tier1.sinks.sink1.channel    = channel1
# Other properties are specific to each type of yhx.hadoop.dn01
# source, channel, or sink. In this case, we
# specify the capacity of the memory channel.
tier1.channels.channel1.capacity = 100
　　

spark启动命令如下：

spark-submit --driver-memory 512m --executor-memory 512m --executor-cores 1--num-executors 3 --class com.hark.SparkStreamingFlumeTest --deploy-mode cluster --master yarn /opt/spark/SparkTest.jar
　　

flume启动命令如下：

flume-ng agent --conf /opt/cloudera-manager/run/cloudera-scm-agent/process/585-flume-AGENT --conf-file /opt/cloudera-manager/run/cloudera-scm-agent/process/585-flume-AGENT/flume.conf --name tier1 -Dflume.root.logger=INFO,console
　　

页: [1]

运维网's Archiver

cdh环境下，spark streaming与flume的集成问题总结