Flume 基本概念及安装

cxwpf200 · 发表于 2015-11-27 17:26:21

　　本文的主要内容都是网上的视频的内容，记录下来，方便以后查阅。
　　Flume 基本概念
Flume是一个分布式、可靠、高可用的海量日志聚合系统，支持在系统中定制各类数据发送方，用于收集数据；同时，Flume提供对数据的简单处理，并写到各种数据接收方的能力。
Flume在0.9.x和1.x之间有较大的架构调整，1.x版本之后的改称Flume NG，0.9.x的称为Flume OG。New和Old。
　　
Flume NG体系架构：

　　运行Flume时，必须安装JDK6.0以上的版本，并且Flume目前只有Linux的启动脚本，没有Windows环境的启动脚本。
Flume核心组件-Source
FlumeSource：完成对日志数据的收集，分成transtion和event打入到channel之中。
Flume提供了各种source的实现，包括Avro Source、Exce Source、 Spooling Directory Source、NetCatSource、 Syslog Source、Syslog TCP Source、Syslog UDP Source、HTTP Source、HDFS Source,etc.
对现有程序改动最小的使用方式是使用直接读取程序原来记录的日志文件，基本可以实现无缝接入，不需要对现有程序进行任何改动。直接读取文件Source,有两种方式：
1）Exec Source
以运行Linux命令的方式，持续的输出最新的数据，如tail-F文件名指令，在这种方式下，取的文件名必须是指定的。
2）Spool Source
是监测配置的目录下新增的文件，并将文件中的数据读取出来。
　　
* 使用Spool Source需要注意：
1)拷贝到spool目录下的文件不可以再打开编辑
2)spool目录下不可包含相应的子目录。
　　
* Spool Source如何使用？
在实际使用的过程中，可以结合log4j使用，使用log4j的时候，将log4j的文件分割机制设为1分钟一次，将文件拷贝到spool的监控目录。log4j有一个TimeRolling的插件，可以把log4j分割的文件到spool目录。基本实现了实时的监控。Flume在传完文件后，将会修改文件的后缀，变为.COMPLETED(后缀也可以在配置文件中灵活指定)

　　* Exec Source 和 Spool Source比较
1、ExecSource可以实现对日志的实时收集，但在Flume不运行或者指令执行出错时，将无法收集到日志数据，无法保证日志数据的完整性。
2、SpoolSource虽然无法实现实时的收集数据，但是可以使用以分钟的方式分割文件，趋近于实时。
3、总结：如果应用无法实现以分钟切割日志文件的话，可以两张收集方式结合使用。
　　
Flume核心组件-Sink
Flume Sink取出Channel中的数据，进行相应的存储文件系统，数据块，或者提交到远程服务器。
Flume也提供了各种Sink的实现，包括HDFS sink、 Logger sink、Avro sink、File Roll sink、Null sink、 HBase sink,etc。
Flume Sink在设置存储数据时，可以向文件系统中，数据库中、HDFS中存储数据，在日志数据较少时，可以将数据存储在文件系统中，并且设定一定的时间间隔保存数据。在日志数据较多时,可以将相应的日志数据存储到HDFS中，便于日后进行相应的数据分析。
　　
Flume核心组件-Channel
Flume Channel主要提供一个队列的功能，对source提供中的数据进行简单的缓存。
Flume对于Channel,则提供了Memory Channel、JDBC Channel、File Channel,etc。
MemoryChannel可以实现高速的吞吐，但是无法保证数据的完整性。
MemoryRecoverChannel在官方文档中已经建议使用FileChannel来替换。
FileChannel保证数据的完整性与一致性。在具体配置FileChannel时，建议FileChannel设置的目录和程序日志文件保存的目录设成不同的磁盘，以便提高效率。
　　Flume集群搭建
1、数据采集端（192.168.217.12 ~ 192.168.217.17）：
Source：使用spooldir扫描文件获取资源
Channel：memory
Sink：avro sink

在conf目录下新增文件push.conf,内容如下：
agent_col.sources = spooldir-source
agent_col.channels = mem-channel-1
agent_col.sinks = avro-sink-1
　　agent_col.sources.spooldir-source.type = spooldir
agent_col.sources.spooldir-source.channels = mem-channel-1
agent_col.sources.spooldir-source.spoolDir = /root/logs
agent_col.sources.spooldir-source.fileHeader = true
　　agent_col.channels.mem-channel-1.type = memory
agent_col.channels.mem-channel-1.keep-alive = 10
agent_col.channels.mem-channel-1.capacity = 1000000
agent_col.channels.mem-channel-1.transactionCapacity = 1000000
　　agent_col.sinks.avro-sink-1.type = avro
agent_col.sinks.avro-sink-1.hostname= 192.168.217.11
agent_col.sinks.avro-sink-1.port = 44444
agent_col.sinks.avro-sink-1.channel = mem-channel-1
　　2、数据接收端(192.168.217.11)：
Source：avro source
Channel：memory
Sink：logger Sink

在conf目录下新增文件：pull.conf
agent_rec.sources = avro-src
agent_rec.channels = mem-channel-1
agent_rec.sinks = logger-sink-1
　　agent_rec.sources.avro-src.type = avro
agent_rec.sources.avro-src.channels = mem-channel-1
agent_rec.sources.avro-src.bind = 192.168.217.11
agent_rec.sources.avro-src.port = 44444
　　agent_rec.channels.mem-channel-1.type = memory
agent_rec.channels.mem-channel-1.keep-alive = 10
agent_rec.channels.mem-channel-1.capacity = 100000
agent_rec.channels.mem-channel-1.transactionCapacity = 100000
　　agent_rec.sinks.logger-sink-1.type = logger
agent_rec.sinks.logger-sink-1.channel = mem-channel-1
　　3、开启数据接收端的flume
bin/flume-ng agent --conf conf --conf-file conf/pull.conf --name agent_rec -Dflume.root.logger=INFO,console
4、开启数据采集端的flume
bin/flume-ng agent -n agent_col -c conf -f conf/push.conf
5、发现数据采集端的flume没有运行，可能是没有安装avro。
　　
6、在数据接收端安装cmake和avro
1）首先安装cmake
$ wget http://www.cmake.org/files/v3.2/cmake-3.2.1.tar.gz
$ tar -zxvf cmake-3.2.1.tar.gz
$ cd cmake-3.2.1
$ ./bootstrap
$ make
$ make install
2)Avro安装
下载Avro安装包，我下载的是avro-c-1.7.7.tar.gz
$ wget http://apache.fayea.com/avro/stable/c/avro-c-1.7.7.tar.gz
$ tar -zxvf avro-c-1.7.7.tar.gz
$ cd avro-c-1.7.7
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX=$PREFIX -DCMAKE_BUILD_TYPE=RelWithDebInfo
$ make
$ make install
　　

　　7、测试
　　将新建的日志文件移入到数据采集端的/root/logs下，在数据接收端可以看到新加的日志的内容，大功告成！

　　

　　Flume Source
* Avro Source
Avro端口监听并接收来自外部的Avro客户流的事件。
example:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
　　案例2：Test Avro Source
# case2_avro.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
　　# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.217.11
a1.sources.r1.port = 44444
a1.sources.r1.channels = c1
　　# Describe the sink
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
　　# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
　　开启flume agent a1：
$ bin/flume-ng agent --conf conf --conf-file conf/case2_avro.conf --name a1 -Dflume.root.logger=INFO,console
　　创建指定文件：
$ echo "Hello world" > /root/logs/log1
使用avro-client发送文件
$ bin/flume-ng avro-client -H 192.168.217.11 -p 44444 -F /root/logs/log1
　　* ThriftSource
Thrift端口监听并接收来自外部的Thrift客户端的事件。
# Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
　　* Exec Source
此源启动运行一个给定的Unix命令，预计这一过程中不断产生标准输出上的数据，如果因任何原因的进程退出时，源也退出，并不会产生任何进一步的数据。
# Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1
　　案例3：Test Exec Source
# case3_exec.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
　　# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command=cat /root/logs/log1
a1.sources.r1.channels = c1
　　# Describe the sink
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
　　# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
　　开启flume agent a1：
$ bin/flume-ng agent --conf conf --conf-file conf/case3_exec.conf --name a1 -Dflume.root.logger=INFO,console
　　* NetCat Source
一个netcat在某一端口上监听，每一行文字变成一个事件源。
#Example for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.bind = 6666
a1.sources.r1.channels = c1
　　* Spooling Directory Source
SpoolSource是监测配置的目录下新增的文件，并将文件中的数据读取出来。需要注意两点：
1）拷贝到spool目录下的文件不可以再打开编辑。
2）spool目录下不可包含相应的子目录。
#Example for an agent named agent-1:
agent-1.channels = ch-1
agent-1.sources = src-1
agent-1.sources.src-1.type = spooldir
agent-1.sources.src-1.channels = ch-1
agent-1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
　　* Syslog Sources
读取syslog数据，并生成Flume event。UDP Source将整个消息作为单一的event.TCP Source为每一个用回车（\n）来分隔的字符串创建一个新的事件。
> Syslog Tcp Source
> Multiport Syslog TCP Source
> Syslog UDP Source
　　* HTTP Source
一个Source接受flume event 通过HTTP POST和GET。GET应只用于实验。Flume event使用一个可插拔的handler程序来实现转换，它必须实现HTTPSourceHandler接口。此处程序需要一个HttpServletRequest和返回一个flume event列表。
#For example, a http source for agent named a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.example.rest.RestHandler
a1.sources.r1.handler.nickname = random props
案例4：Test HTTP Source
# case4_http.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
　　# Describe/configure the source
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
　　# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
　　启动flume agent a1：
$ bin/flume-ng agent --conf conf --conf-file conf/case4_http.conf --name a1 -Dflume.root.logger=INFO,console
　　生产JSON格式的POST request
$curl -X POST -d '[{"headers" : {"timestamp" : "434324343","host" :"random_host.example.com" },"body" : "random_body"},{"headers" : {"namenode" : "namenode.example.com","datanode" : "random_datanode.example.com"},"body" : "really_random_body"}]'http://localhost:5140
　　Flume Sink
* HDFS Sink
这个Sink将事件写入到Hadoop分布式文件系统（HDFS）。目前，它支持创建文本和序列文件。它支持两个文件类型的压缩，对所有的时间、数据大小、事件的数量为参数，对文件进行关闭（关闭当前文件，并创建一个新的）。它还可以对事件源的机器名及时间属性分离数据。HDFS目录路径可能包含格式转义序列用于取代由HDFS Sink生成一个目录/文件名存储的事件。
注意：使用该Sink需要首先安装Hadoop，使用flume可以利用Hadoop的jar文件与HDFS通信。Hadoop的版本需要支持sync()方法调用。
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
　　
* Logger Sink
INFO级别的日志事件，通常有用的测试/调试目的。
* Avro Sink
Avro支持Flume分层结构，Flume event发送给这个sink的事件都会转化成Avro事件，发送到配置好的Avro主机和端口上。这些事件可以批量传输给通道。
* Hbase Sink
该Sink负责将数据写入到Hbase中。Hbase的配置信息从classpath路径里面遇到的第一个hbase-site.xml文件中获取。在配置文件中指定的实现了HbaseEventSerializer接口的类，用于将事件转换成Hbase所表示的事件或增量。然后将这些事件和增量写入到Hbase中。
Hbase Sink 支持写数据到安全的Hbase。为了将数据写入安全的Hbase，用户必须对配置的table表有写权限。主要用来验证对KDC的密钥表可以在配置中指定。在Flume Agent的classpath/opt/hadoop-2.5.2/logs/yarn-root-nodemanager-worker06.out路径下的Hbase-site.xml文件必须设置到Kerberos认证。
* AsyncHbase Sink
该Sink使用异步方式将数据写入到Hbase中，在配置文件中需要指定一个实现了AsyncHbaseEventSerializer接口的类来讲event转化成Hbase所表示的事件或增量。
　　Flume Channel
* Memory Channel
事件存储在一个可配置的最大尺寸在内存中的队列。适应场景：需要更高的吞吐量。但是代理出现故障后数据丢失。
* JDBC Channel
事件存储在数据库。目前的JDBC通道支持嵌入式Derby。这是一个持久的理想的地方。可恢复是很主要的特性。
* File Channel
注意默认情况下，File Channel使用检查点和在用户home目录上指定的数据目录。所以在一个agent下面启动多个File Channel实例，只会有一个File channel能锁住文件目录，其他的都将初始化失败。因此，有必要提供明确的路径的所有已配置的通道，最好在不同的磁盘上。

　　



账号		自动登录	找回密码
密码			立即注册

VMware vcenter+vSphere 6.5 U2共享

【跟谁学】韩宇极简英语课-技术人员不得不

用Zabbix通过JMX方式监控weblogic

winhex数据恢复教程（非常巨大，内容丰富）

Symantec Backup Exec 2015 2016/2012 BE20

NetScaler VPX部署之：NetScaler Gateway调

zabbix3.4.1安装部署+微信推送信息+大屏显

[经验分享] Flume 基本概念及安装

扫码加入运维网微信交流群