Apache Spark Streaming - Interview Questions

What is Apache Spark Streaming?

FAQ

Spark Streaming is a library provided in Apache Spark for processing live data streams that is scalable, has high-throughput and is fault-tolerant. Spark Streaming can ingest data from multiple sources such as Kafka, Flume, Kinesis or TCP sockets; and process this data using complex algorithms provided in the Spark API including algorithms provided in the Spark MLlib and GraphX libraries. Processed data can be pushed to live dashboards, file systems and databases.

Describe how Spark Streaming processes data?

FAQ

Apache Spark Streaming component receives live data streams from input sources such as Kafka, Flume, Kinesis etc. and divides them into batches. The Spark engine processes these input batches and produces the final stream of results in batches.

What are DStreams?

FAQ

DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that represents a continuous stream of data. DStreams can be either created from input sources such as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.

Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval.

What is a StreamingContext object?

FAQ

StreamingContext object represents the entry point to a Spark streaming application and contains the batch interval for the Spark streaming program and the SparkConf object. The SparkConf object and the batch interval are set when creating a new instance of the StreamingContext object.

What are the two different types of built-in streaming sources provided by Spark Streaming?

FAQ

Spark Streaming provides two different categories of built-in streaming sources.

Basic Sources - These sources are directly available in the StreamingContext API. Example: file systems and socket connections.

Advanced Sources - These are sources such as Kafka, Flume etc. that are provided in the Spark Streaming library through extra utility classes.

Big Data Interview Guide has over 150+ interview questions and answers. Get the guide for $44.95 only.

BUY EBOOK

What are some of the common transformations on DStreams supported by Spark Streaming ?

FAQ

map(func) - map() transformation returns a new distributed dataset from a source dataset formed by passing each element of the source through a function func.

filter() - filter() transformation returns a new distributed dataset from a source dataset formed by selecting the elements of the source on which func returns true.

flatMap() - flatMap() transformation is similar to map() function. In flatMap() each input item can be mapped to 0 or more output items.

union() - union() transformation returns a new dataset that contains the union of the elements in the source dataset and the dataset that is passed as argument to the function.

intersection() - intersection() transformation returns a new distributed dataset that contains the intersection of elements in the source dataset and the dataset that is passed as argument to the function.

distinct() - distinct() transformation returns a new distributed dataset that contains the distinct elements of the source dataset.

groupByKey() - groupByKey(func) transformation called on a dataset of (K, V) pairs and returns a dataset of (K, Iterable) pairs.

reduceByKey(func) - reduceByKey(func) transformation is called on a dataset of (K, V) pairs, and returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func.

aggregateByKey(func) - aggregateByKey() transformation is called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral zero value.

sortByKey() - sortByKey() transformation is called on a dataset of (K, V) pairs which returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

What are the output operations that can be performed on DStreams?

FAQ

Output operations on DStreams pushes the DStream's data to external systems like a database or a file system. Following are the key operations that can be performed on DStreams.

saveAsTextFiles() - Saves the DStream's data as text file.

saveAsObjectFiles() - Saves the DStreams data as serialized Java objects.

saveAsHadoopFiles() - Saves the DStream's data as Hadoop files.

foreachRDD() - A generic output operator that applies a function, func, to each RDD generated from the DStream.