Stackable Operator for Apache Spark

This operator manages Apache Spark applications on Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing.

Getting started

Follow the Getting started guide to get started with Apache Spark using the Stackable operator. The guide will lead you through the installation of the operator and running your first Spark application on Kubernetes.

How the operator works

This operator manages SparkApplication custom resources which you use to define your applications. The operator creates the relevant Kubernetes resources for the job to run.

Custom resources

The operator manages two custom resource kinds: The SparkApplication and the SparkHistoryServer.

The SparkApplication resource is the main point of interaction with the operator. Unlike other Stackable operator custom resources, the SparkApplication does not have roles. An exhaustive list of options is given in the SparkApplication CRD reference .

The SparkHistoryServer has a single node role. It is used to deploy a Spark history server that displays application logs from S3 buckets. Of course, your applications need to write their logs to the same buckets.

Kubernetes resources

For every SparkApplication deployed to the cluster the operator creates a Job, A ServiceAccout and a few ConfigMaps.

The Job runs spark-submit in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured in the SparkApplication resource.

The two main ConfigMaps are the <name>-driver-pod-template and <name>-executor-pod-template which define how the driver and executor Pods should be created.

The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a service to connect to.

RBAC

The Spark-Kubernetes RBAC documentation describes what is needed for spark-submit jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods.

However, to add security each spark-submit job launched by the operator will be assigned its own ServiceAccount.

During the operator installation, a cluster role named spark-k8s-clusterrole is created with pre-defined permissions.

When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role spark-k8s-clusterrole .

Integrations

You can read and write data from s3 buckets, load custom job dependencies. Spark also supports easy integration with Apache Kafka which is also supported on the Stackable Data Platform. Have a look at the demos below to see it in action.

Demos

The data-lakehouse-iceberg-trino-spark demo connects multiple components and datasets into a data Lakehouse. A Spark application with structured streaming is used to stream data from Apache Kafka into the Lakehouse.

In the spark-k8s-anomaly-detection-taxi-data demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table.

Supported versions

The Stackable operator for Apache Spark on Kubernetes currently supports the Spark versions listed below. To use a specific Spark version in your SparkApplication, you have to specify an image - this is explained in the Product image selection documentation. The operator also supports running images from a custom registry or running entirely customized images; both of these cases are explained under Product image selection as well.

3.5.2 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 17) (LTS)
3.5.1 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 17)

Useful links

The spark-k8s-operator GitHub repository
The operator feature overview in the feature tracker
The SparkApplication and SparkHistorServer CRD documentation