Stackable Operator for Apache Airflow
The Stackable Operator for Apache Airflow manages Apache Airflow instances on Kubernetes. Apache Airflow is an open-source application for creating, scheduling, and monitoring workflows. Workflows are defined as code, with tasks that can be run on a variety of platforms, including Hadoop, Spark, and Kubernetes itself. Airflow is a popular choice to orchestrate ETL workflows and data pipelines.
Getting started
Get started using Airflow with the Stackable Operator by following the Getting started guide. It guides you through installing the Operator alongside a PostgreSQL database and Redis instance, connecting to your Airflow instance and running your first workflow.
Custom resources
The AirflowCluster is the resource for the configuration of the Airflow instance.
The resource defines three roles: webserver
, worker
and scheduler
(the worker
role is embedded within spec.celeryExecutors
: this is described in the next section).
The various configuration options are explained in the Usage guide.
It helps you tune your cluster to your needs by configuring resource usage, security, logging and more.
Executors
The worker
role is deployed when spec.celeryExecutors
is specified (the alternative is spec.kubernetesExecutors
, whereby pods are created dynamically as needed without jobs being routed through a redis queue to the workers).
This means that for kubernetesExecutors
there exists an implicit single role which does not appear in resource definition.
This is illustrated below:
Kubernetes resources
Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services.
The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other.
For every role group you define, the Operator creates a
StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the
main container running Airflow and a sidecar container gathering metrics for Monitoring. The
Operator creates a Service per role group as well as a single service for the whole webserver
role called
<clustername>-webserver
.
Additionally, a ConfigMap is created for each RoleGroup. These ConfigMaps contain two files, log_config.py
and webserver_config.py
, which contain logging and general Airflow configuration respectively.
Required external components
Airflow requires an SQL database in which to store its metadata as well as Redis for job execution. The required external components page lists all supported databases and Redis versions to use in production. You need to provide these components for production use, but the Getting started guides you through installing an example database and Redis instance with an Airflow instance that you can use to get started.
Redis is only needed if the executors have been set to spec.celeryExecutors as the jobs will be queued via Redis before being assigned to a worker pod. When using spec.kubernetesExecutors the scheduler will take direct responsibility for this.
|
Using custom workflows/DAGs
Direct acyclic graphs (DAGs) of tasks are the core entities you will use in Airflow. Have a look at the page on Mounting DAGs to learn about the different ways of loading your custom DAGs into Airflow.
Demo
You can install the airflow-scheduled-job demo and explore an Airflow installation, as well as how it interacts with Apache Spark.
Supported versions
The Stackable Operator for Apache Airflow currently supports the Airflow versions listed below. To use a specific Airflow version in your AirflowCluster, you have to specify an image - this is explained in the Product image selection documentation. The operator also supports running images from a custom registry or running entirely customized images; both of these cases are explained under Product image selection as well.
-
2.9.3 (LTS)
-
2.9.2 (deprecated)
Useful links
-
The airflow-operator GitHub repository
-
The operator feature overview in the feature tracker
-
The AirflowCluster CRD documentation