Apache Flink, developed by the Apache Software Foundation, is an open-source stream-processing and batch-processing framework that excels in performing stateful computations over bounded and unbounded data streams.
Apache Flink is highly scalable - Apache Flink applications can be distributed across multiple containers in a cluster, and thousands of Flink tasks can be executed parallelly across these containers in the cluster.
Apache Flink tasks have very low processing latencies - Apache Flink applications are optimized to use task states that are stored in-memory and on local disk.
Bounded Streams | Unbounded Streams | |
---|---|---|
Bounded streams are datasets that have a defined start and end. | Unbounded streams are datasets that have a defined start but no defined end. | |
Bounded streams can be processed by ingesting the complete data and them preforming any computations. | Unbounded streams have to be processed continuously as new data comes in. | |
In most cases, unbounded streams have to be process in the order in which messages are received. | Bounded messages can be processed in any order since the messages can be sorted as needed. |
Apache Flink excels in processing both unbounded and bounded streams.
Apache Flink has the capability to precisely control the time and state of Flink processes, enabling it to run any kind of application on unbounded streams.
Apache Flink processes bounded streams with high performance - by utilizing algorithms and data structures, provided internally by Flint framework, that are specifically designed for bounded or fixed size data sets.
There are three key building blocks of streaming applications.
Streams - Streams are the fundamental building blocks of streaming applications. Streams can be categorized as follows
Bounded and Unbounded streams - Streams can be bounded (have a start and end), or Unbounded (Have a start but no end).
Real-time and Recorded streams - Streams can be process real-time as the data come in, or process after the data is persisted to a storage system.
State - Most streaming applications that run business logic on the streams are stateful. Streaming applications have to store the state or intermediate results to be used for processing future data.
Time - Time is an important aspect of streaming applications, since each message or event is produces at a specific time. Many common stream computations are based on time - such as time-based joins, pattern detection, and window aggregations.
Flink framework provides the following features to handle state.
Data structure specific state primitives - Flink framework provides specific state primitives for different data structures such as lists and maps.
Pluggable state storages - Flink supports multiple pluggable state storage systems that store state in-memory or on disc.
Exactly-once state consistency - Flink framework has checkpoint and recovery algorithms, which guarantee the consistency of state in case of failures.
Store large state data - Flink has the ability to store very large application state data, of several terabytes, due to its asynchronous and incremental checkpoint algorithm.
Scalable Applications - Flink applications are highly scalable since the application state data can be distributed across containers.
Flink framework provides the following features to handle time.
Event-time mode - Flink framework supports applications that process steams based on event-time mode, i.e. applications that process streams based on timestamp of events.
Processing-time mode - Flink framework also supports applications that process streams based on processing-time mode, i.e. applications that process streams based on the clock time of the processing machine.
Watermark support Flink framework provides watermark support in the processing of streams based on event-time mode.
Late data processing Flink framework supports the processing of events that arrive late, after a related computation has already been performed.
Flink framework provides three categories of APIs that target different use cases.
ProcessFunctions - ProcessFunctions provide fine grained control over time and state, and have the most expressive capabilities among the different categories of APIs provided by Flink framework.
DataStream API - Flink's DataStream API provides many common stream processing operations, and is based on functions such as map(), reduce(), aggregate(), etc.
SQL and Table API - Flink framework provides two relational APIs, SQL and Table APIs, that allow composition of relational queries using operators such as selections, filters, and joins.
Flink framework is suitable for developing following kinds of applications.
Event driven applications - Event driven applications inject events from one or more event streams and reacts to then - either performing computations, updating state, or triggering external actions.
Data analytics applications - Data analytics applications ingest data from one or more data streams, and perform analytics and insights on the data. Data can be analyzed in a batch mode, or in a real-time streaming mode.
Data pipeline applications - Data pipeline applications ingest data from one or more multiple streams, transform and enrich the data, and move the data to a storage system such as a data warehouse system.
A Flink streaming application consists of four key programming constructs.
1. Stream execution environment - Every Flink streaming application requires an environment in which the streaming program is executed.
Flink framework provides the class org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
as the environment context in which the streaming program is executed. Flink framework provides sub-classes such as LocalStreamEnvironment
- which executes the program in a local JVM, and RemoteStreamEnvironment
which executes the streaming program on a remote environment.
Usually you just have to call the getExecutionEnvironment()
method on the StreamExecutionEnvironment
which returns the appropriate environment based on the context.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
2. Data sources Data sources are applications or data stores from which the Flink application will read the input data from.
In a Flink application you attach a data source to the stream execution environment by using the addSource()
method - env.addSource(sourceFunction)
Flink framework comes with support for several in-built data sources that are...
env.readFile(...)
env.sockettextStream(...)
env.fromCollection(...)
env.addSource(new FlinkKafkaConsumer<>(...))
3. Data streams and transformation operations - Input data from data sources are read by the Flink application in the form of data streams. Flink framework provides the org.apache.flink.streaming.api.datastream.DataStream
class, on which operations can be applied to transform the data.
Flink framework provides many transformation functions that can be applied on the data stream such as ...
dataStream.keyBy(...)
keyedStream.reduce(...)
4. Data sinks - The transformed data from data streams are consumed by data sinks which output the data to external systems. Flink framework comes with support for a variety of built-in output formats such as ...
dataStream.writeAsText()
dataStream.writeToScoket()
dataSource.addSink(...)
JobManager and TaskManager are the main components of Flink.
JobManager - Flink client triggers the execution of a Flink application by preparing a dataflow and sending it to the JobManager. The JobManager is responsible for coordinating the distributed execution of Flink applications like deciding when to schedule tasks, manage finished tasks, manage and coordinate recovery of task failures, coordinate checkpoints etc.
JobManager contains three different components.
TaskManager - TaskManager is responsible for executing the tasks of the dataflow. There can be one or more TaskManagers. JobManagers connect to TaskManagers and assign them tasks to execute.
A TaskManager contains one or more task slots, with each task slot executing a task (consisting of one or more sub-tasks) in a separate thread.