spark streaming中使用flume数据源

710661809 · 发表于 2015-9-17 07:44:59

　　有两种方式，一种是sparkstreaming中的driver起监听，flume来推数据；另一种是sparkstreaming按照时间策略轮训的向flume拉数据。
　　
　　最开始我以为只有第一种方法，但是尼玛问题在于driver起来的结点是没谱的，所以每次我重启streaming后发现尼玛每次都要修改flume的sinks，蛋疼死了，后来才发现有后面的方法，好吧，把不同的方法代码写出来，其实变化不大。（代码转自官方的githup）
　　
　　第一种，监听端口：

package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
/**
*  Produces a count of events received from Flume.
*
*  This should be used in conjunction with an AvroSink in Flume. It will start
*  an Avro server on at the request host:port address and listen for requests.
*  Your Flume AvroSink should be pointed to this address.
*
*  Usage: FlumeEventCount <host> <port>
* <host> is the host the Flume receiver will be started on - a receiver
*          creates a server and listens for flume events.
* <port> is the port the Flume receiver will listen on.
*
*  To run this example:
* `$ bin/run-example org.apache.spark.examples.streaming.FlumeEventCount <host> <port> `
*/
object FlumeEventCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(
"Usage: FlumeEventCount <host> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val Array(host, IntParam(port)) = args
val batchInterval = Milliseconds(2000)
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumeEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval)
// Create a flume stream
val stream = FlumeUtils.createStream(ssc, host, port, StorageLevel.MEMORY_ONLY_SER_2)
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
ssc.start()
ssc.awaitTermination()
}
}

　　
　　第二种是轮训主动向flume拿数据

package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
import java.net.InetSocketAddress
/**
*  Produces a count of events received from Flume.
*
*  This should be used in conjunction with the Spark Sink running in a Flume agent. See
*  the Spark Streaming programming guide for more details.
*
*  Usage: FlumePollingEventCount <host> <port>
* `host` is the host on which the Spark Sink is running.
* `port` is the port at which the Spark Sink is listening.
*
*  To run this example:
* `$ bin/run-example org.apache.spark.examples.streaming.FlumePollingEventCount [host] [port] `
*/
object FlumePollingEventCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(
"Usage: FlumePollingEventCount <host> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val Array(host, IntParam(port)) = args
val batchInterval = Milliseconds(2000)
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval)
// Create a flume stream that polls the Spark Sink running in a Flume agent
val stream = FlumeUtils.createPollingStream(ssc, host, port)
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
ssc.start()
ssc.awaitTermination()
}
}

　　

账号		自动登录	找回密码
密码			立即注册

VMware vcenter+vSphere 6.5 U2共享

【跟谁学】韩宇极简英语课-技术人员不得不

用Zabbix通过JMX方式监控weblogic

winhex数据恢复教程（非常巨大，内容丰富）

Symantec Backup Exec 2015 2016/2012 BE20

NetScaler VPX部署之：NetScaler Gateway调

zabbix3.4.1安装部署+微信推送信息+大屏显

[经验分享] spark streaming中使用flume数据源

扫码加入运维网微信交流群