Apache Flume - Interview Questions

What is Apache Flume?

Apache Flume is an open source platform to efficiently and reliably collect, aggregate and transfer massive amounts of data from one or more sources to a centralized data source. Data sources are customizable in Flume, hence it can injest any kind of data including log data, event data, network data, social-media generated data, email messages, message queues etc.

What are the key components used in Flume data flow?

Flume flow has three main components - Source, Channel and Sink

Source: Flume source is a Flume component that consumes events and data from sources such as a web server or message queue.Flume sources are customized to handle data from specific sources. For example, an Avro Flume source is used to injest data from Avro clients and a Thrift flume source is used to injest data from Thrift clients. You can write custom Flume sources to inject custom data. For example you can write a Twitter Flume source to injest Tweets.

Channel: Flume sources injest data and store them into one or more channels.Channels are temporary stores, that keep the data until it is consumed by Flume sinks.

Sink: Flume sinks removes the data stored in channels and puts it into a central repository such as HDFS or Hive.

What is flume Agent?

A Flume agent is a JVM process that hosts the components through which events flow from an external source to either the central repository or to the next destination.Flume agent wires togeatrher the external sources, Flume sources, flume Channels, Flume sinks, and external destinations for each flume data flow. Flume agent does this through a configuration file in which it maps the sources, channels, sinks, etc. and defines the properties for each component.

How is reliability of data delivery ensured in Flume?

Flume uses a transactional approach to gaurantee the delivery of data. Events or data is removed from channels only after they have been successfully stored in the terminal reposiroty for single-hop flows, or successfully stored in the channel of next agent in the case of multi-hop flows.

How is recoverability ensured in Flume?

In Flume the events or data is staged in channels. Flume sources add events to Flume channels. Flume sinks consume events from channels and publish to terminal data stores. Channels manage recovery from failures. Flume supports different kinds of channels. In-memory channels store events in an in-memory queue, which is faster. File channels are durable which is backed by the local file system.

Big Data Interview Guide has over 150+ interview questions and answers. Get the guide for $49.95 only.

BUY EBOOK

How do you install third-party plugins into Flume? OR Why do you need third-party plugins in Flume? OR What are the different ways you can install plugins into flume?

Flume is a plugin-based architecture. It ships with many out-of-the-box sources, channels and sinks. Many other customized components exist seperately from Flume which can be pluged into Flume and used for you applications. Or you can write your own custom components and plug them into Flume.

There are two ways to add plugins to Flume.

Add the plugin jar files to FLUME_CLASSPATH variable in the flume-env.sh file.

What do you mean by consolidation in Flume? Or How do you injest data from multiple sources into a single terminal destination?

Flume can be setup to have multiple agents process data from multiple sources and send to a single or a few intermiate destimations. Separate agents consume messages from the intermediate data source and write to a central data source.

How do you check the integrity of file channels?

Fluid platform provides a File Channel Integrity tool which verifies the integrity of individual events in the File channel and removes corrupted events.

How do you handle agent failues?

If Flume agent goes down then all flows hosted on that agent are aborted.Once the agent is restarted then flow will resume. If the channel is set up as in-memory channel then all events that are stored in the chavvels when the agent went down are lost. But chanels setup as file or other stable channels will continue to process events where it lest off.

$29.95

BUY EBOOK

SSL Secure Payment

$29.95

BUY EBOOK

SSL Secure Payment

You're also subscribing to the interviewgrid.com email newsletter for tips, updates & promotions. Unsubscribe any time.

Big Data - Interview Questions

Map Reduce Apache Flume Apache Kafka Apache Hive Apache Hue Apache Oozie Apache Sqoop

RECOMMENDED RESOURCES

Behaviorial Interview

Top resource to prepare for behaviorial and situational interview questions.