icon_CloudMgmt icon_DollarSign icon_Globe icon_ITAuto icon_ITOps icon_ITSMgmt icon_Mainframe icon_MyIT icon_Ribbon icon_Star icon_User icon_Users icon_VideoPlay icon_Workload icon_caution icon_close s-chevronLeft s-chevronRight s-chevronThinRight s-chevronThinRight s-chevronThinLeft s-chevronThinLeft s-trophy s-chevronDown
BMC
BMC Contact Options

Select the link below that best matches your interest.

For questions about BMC products, solutions, and services, you can also phone the number below:

1‑855‑834‑7487

Flume 101

Apache Flume reads a data source and writes it to storage at incredibly high volumes and without losing any events. These data feeds include streaming logs, network traffic, Twitter feeds, etc. (examples below)

But it does not do data manipulation. Instead you can write it to something with the ability to do that, like Hive and, in that case, use Hive’s SQL-like commands (HQL) to process it. Or you can output it to ElasticSearch, where the data gets stored in JSON format which can be queried with Kibana. There are many other options for where to store the data.

Installation

The installation of Flume is simple. You just download it and unzip it. There is no classpath or anything like that to set. You do not start Flume, per se. Instead you start agents, one for every data source that you want to read and save.

Flume source and sinks

There is not much terminology to master:

A source is input data. A sink is output. And a memory channel stores events in memory, with the proviso that those get lost if the agent dies. Some of the sinks, sources, and memory channels that Flume supports include:

Sources—Avro (data serialization with a schema), Thrift (abstraction written by Facebook to facilitate communications between different programming languages), Exec (shell commands), JMS (messaging), Twitter, Kafka (LinkedIn’s streaming product), NetCat, Syslog, HTTP, custom (you can write your own, plus there are plenty of third party ones.)

Sinks—HDFS, Hive, Logger (Apache log4j), ElasticSearch, Kafka, Custom (third party plugins like Cassandra, HBase, and MongoDB)

Memory Channels—JDBC (meaning databases), Kafka, File, Custom

Configuration

The configuration of a Flume agent has the format shown below. There are sources (input), sinks (output), and channels (connects sources to sinks). The agent name is whatever name you make up.

The color coding shows how you declare the names of sources, sinks, and channels and then associate those with their corresponding options.


(agent name).sources = (source name)
(agent name).channels = (channel name)
(agent name).sinks = (sink name)




(agent name).sources.(source name).options = value
(agent name).sinks.(sink name).options = value




(agent name).channels.(channel name).options = value



 


(agent name).sinks.(sink name).channel =   (channel name)
(agent name).sources.(sink name).channel =   (channel name)

To start an agent you go to the bin folder and execute:


./flume-ng agent -n (agent name) -c conf -f ../conf/(agent config file name)



Tail a file

Here is an example of how to read syslog on a CentOS system. In actual practice you would most likely use syslog and then configure Flume to listen for TCP or UDP messages. But here we just use tail.

We want to store this data in Hadoop so we make a directory:


hadoop fs -mkdir /flume

Here is our configuration file. Notice the types exec and hdfs. Each type has different options. Like the exec type requires a command to execute. And the hdfs type requires the path.


cat ../conf/flume.conf




# give names to the agent, source, sinks, and channels
agent.sources=tail
agent.sinks=hdfs
agent.channels=ch1




# tell it what command to run
agent.sources.tail.type=exec
agent.sources.tail.command=tail -F /var/log/messages




# give Hadoop options
agent.sinks.hdfs.type=hdfs
agent.sinks.hdfs.hdfs.path=/flume
agent.sinks.hdfs.hdfs.filePrefix=syslog
agent.sinks.hdfs.hdfs.rollInterval=10
agent.sinks.hdfs.hdfs.rollSize=0




# give channel options
agent.channels.ch1.type=memory
agent.channels.ch1.capacity=1000




# connect the sources and sinks to the channel
agent.sources.tail.channels=ch1
agent.sinks.hdfs.channel=ch1



Start the flume agent. The name of the agent must match the agent name you put in flume.conf


./flume-ng agent -n (agent  name) -c conf -f ../conf/(config file)



Then stdout will write messages showing that data has been written:


17/03/08 16:42:44 INFO hdfs.HDFSEventSink: Closing /flume/syslog
17/03/08 16:42:44 INFO hdfs.BucketWriter: Closing /flume/syslog.1488987752411.tmp
17/03/08 16:42:44 INFO hdfs.BucketWriter: Renaming /flume/syslog.1488987752411.tmp to /flume/syslog.1488987752411



We can look at what it wrote using cat.It will not always be legible text.


hadoop fs -cat /flume/syslog.1488987752411



If you have any mistake in the configuration file Flume will give you a Java Stack trace with a fairly friendly error message telling you what is wrong.

Twitter

Here we write Twitter tweets, using Twitter’s streaming API, and then save those to Hadoop. To do this you first have to create an app at https://apps.twitter.com/, which will give you the keys you need to connect.

The Tweets have no structure in Flume. So if you want to work with them you would write them to, for example, Hive. But to do that you have to create the Twitter schema in Hive. You can look for instructions for how to do that on the internet. In this example we just write them to Hadoop.


TwitterAgent.sources = Twitter
TwitterAgent.channels = t1
TwitterAgent.sinks = hdfs




TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = (obfuscated)
TwitterAgent.sources.Twitter.consumerSecret =  (obfuscated)
TwitterAgent.sources.Twitter.accessToken =  (obfuscated)
TwitterAgent.sources.Twitter.accessTokenSecret =  (obfuscated)
TwitterAgent.sources.Twitter.keywords = health
TwitterAgent.sinks.console.type = logger




TwitterAgent.sinks.hdfs.type=hdfs
TwitterAgent.sinks.hdfs.hdfs.path=/flume/twitter
TwitterAgent.sinks.hdfs.hdfs.filePrefix=messages
TwitterAgent.sinks.hdfs.hdfs.rollInterval=10
TwitterAgent.sinks.hdfs.hdfs.rollSize=0




TwitterAgent.channels.t1.type=memory
TwitterAgent.channels.t1.capacity = 1000




TwitterAgent.sources.Twitter.channels = t1
TwitterAgent.sinks.hdfs.channel = t1



Then list the files written in Hadoop:


-rw-r--r--   1 root supergroup  217127 2017-03-08 14:54 
/flume/twitter/messages.1488981289896
-rw-r--r--   1 root supergroup  232847 2017-03-08 14:55
/flume/twitter/messages.1488981289897
-rw-r--r--   1 root supergroup     227083 2017-03-08 14:55 
/flume/twitter/messages.1488981289898



You can log Tweets to the console using:


TwitterAgent.sinks.console.type = logger



But as you can see they are in Hex, so hard to read:


7/03/08 14:44:39 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61
76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
17/03/08 14:44:41 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 
76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }



That is a basic introduction to Flume. As you can see it is not a complicated product at all. As an exercise you might try now to write syslogs to Flume.

Explore other BMC Guides ›

Share on Social Media

BMC Solutions for Hadoop & Big Data

Control-M for Big Data

Realize the benefits of big data faster

Learn More | Free Trial

Control-M Application Integrator

Automate the critical workloads that power your business

Learn More

TrueSight Capacity Optimization for Hadoop

Plan and optimize Hadoop IT infrastructure resources

Learn More | Free Trial

BMC Discovery (formerly ADDM)

Identify and model your complete big data infrastructure

Learn More | Free Trial