spark streaming中使用flume数据源

710661809 发表于 2015-9-17 07:44:59

　　有两种方式，一种是sparkstreaming中的driver起监听，flume来推数据；另一种是sparkstreaming按照时间策略轮训的向flume拉数据。
　　
　　最开始我以为只有第一种方法，但是尼玛问题在于driver起来的结点是没谱的，所以每次我重启streaming后发现尼玛每次都要修改flume的sinks，蛋疼死了，后来才发现有后面的方法，好吧，把不同的方法代码写出来，其实变化不大。（代码转自官方的githup）
　　
　　第一种，监听端口：

package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
/**
*Produces a count of events received from Flume.
*
*This should be used in conjunction with an AvroSink in Flume. It will start
*an Avro server on at the request host:port address and listen for requests.
*Your Flume AvroSink should be pointed to this address.
*
*Usage: FlumeEventCount <host> <port>
* <host> is the host the Flume receiver will be started on - a receiver
* creates a server and listens for flume events.
* <port> is the port the Flume receiver will listen on.
*
*To run this example:
* `$ bin/run-example org.apache.spark.examples.streaming.FlumeEventCount <host> <port> `
*/
object FlumeEventCount {
def main(args: Array) {
if (args.length < 2) {
System.err.println(
"Usage: FlumeEventCount <host> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val Array(host, IntParam(port)) = args
val batchInterval = Milliseconds(2000)
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumeEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval)
// Create a flume stream
val stream = FlumeUtils.createStream(ssc, host, port, StorageLevel.MEMORY_ONLY_SER_2)
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
ssc.start()
ssc.awaitTermination()
}
}

　　
　　第二种是轮训主动向flume拿数据

package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
import java.net.InetSocketAddress
/**
*Produces a count of events received from Flume.
*
*This should be used in conjunction with the Spark Sink running in a Flume agent. See
*the Spark Streaming programming guide for more details.
*
*Usage: FlumePollingEventCount <host> <port>
* `host` is the host on which the Spark Sink is running.
* `port` is the port at which the Spark Sink is listening.
*
*To run this example:
* `$ bin/run-example org.apache.spark.examples.streaming.FlumePollingEventCount `
*/
object FlumePollingEventCount {
def main(args: Array) {
if (args.length < 2) {
System.err.println(
"Usage: FlumePollingEventCount <host> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val Array(host, IntParam(port)) = args
val batchInterval = Milliseconds(2000)
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval)
// Create a flume stream that polls the Spark Sink running in a Flume agent
val stream = FlumeUtils.createPollingStream(ssc, host, port)
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
ssc.start()
ssc.awaitTermination()
}
}

　　

ukhano009 发表于 2025-5-29 05:19:04

RBTV77 is a free sports streaming app offering HD, ad-free live matches, highlights, and real-time scores. Compatible with Android and iOS, it delivers seamless access to global sports events. https://apkrbtv77.com/

页: [1]

运维网's Archiver

spark streaming中使用flume数据源