Microservice architectures are all the rage and for good reason. Small, lightweight, business focused and independently deployable services provide scalability and isolation which it hard to achieve in monolithic systems.
However, adding and scaling lots of tiny processes, either in containers or not can pose challenges even with Kafka as a central data hub of your organisation. At DataMountaineer we are big fans of KStreams and Kafka Connect but as the numbers deployed grow;
- how do you integrate with your CI/CD street?
- how do you ensure your design time (provenance) topology is deployed and running?
- how do you monitor and attach your lineage to this topology?
- how do you attach different monitoring and alerting criteria?
- how do you promote different flows to production independently?
Even if you are cool and use Dockers your landscape can still be complex…handling multi tenancy, inspecting and managing docker files, handling service discovery, environment variables and promotion to production.
Mircoservices help promote isolation but often we find we need to deploy complete pipelines, for example a Twitter feed with a Kafka Connect Source tracking specific terms, one or two KStreams processors for manipulation of the data and a Kafka Connect Cassandra sink to write to a Cassandra.
We could deploy the Connectors into one Kafka Connect Cluster but we lose separation of concerns and promoting different flows to production independently becomes difficult. Kafka is still great here as all data still flows in and out of Kafka and we keep our tech stack small, Kafka + Connect + Streams and back it with data governance via the Schema Registry. So with this in mind and being super cool we helped Eneco create the Landscaper!
Kafka Connect and KStreams play well in containers, all state is stored or backed up in Kafka so Eneco started moving off virtual machines onto Kubernetes. When doing this we set out with some goals in mind about how to manage the dataflows that were to be deployed;
- Have a blueprint of what the landscape (apps in the cluster) looks like;
- Keep track of changes: when, what, why and by who;
- Allow others to review changes before applying them;
- Let the changes be promoted to specific environments.
This resulted in the Landscaper which takes a repository containing a desired state description of the landscape and eliminates difference between desired and actual state of releases in a Kubernetes cluster.
How it works
The Landscaper uses Helm charts behind the scenes. Helm is a package manager for Kubernetes which allows you to set out in configuration which image you want, the container specs, the application environment and importantly labelling and annotations in Kubernetes that allow for metric and monitoring scrapping.
For example here’s a chart for a our Cassandra Sink.
We base our Docker images of Confluents base connector image. This contains a script that uses the environment variables starting with “CONNECT_” to create the Kafka Connect Worker property files. We added a second script that uses the environment variables starting with “CONNECTOR_” to create a properties files for the actual connector we want to start. Here’s the magic for the Worker configuration:
The same applies for the connector configuration but we look for environment variables beginning with “CONNECTOR_”:
Via YAML files the desired Landscape is described. These YAML files describe the Charts to use, override any values and attached required metadata such as the pipeline which is then scraped and combined with the JMX reporting from Kafka + Connect + KStreams in Prometheus and Grafana. An example landscape may look like this:
This directory structure contains three subdirectories;
The services we want to deploy .i.e. HAProxy to the Kafka, Cloudera, HDFS, Azure Doc DB.
Yaml files descripting the components and charts we want to deploy.
A directory which the landscaper will use to set up the services, it copies the service yaml’s from the service directory to here at run time.
Below is an example landscape file for the a Cassandra Sink Connector instance.
The landscaper gathers the configurations in the configuration section of the yaml and applies them to the Values.yaml of the specified Helm Chart. This in turn sets the environment variables in the Docker from which the scripts outlined earlier build the configuration files. For any secrets set up the Values.yaml of the Helm Chart the Landscaper will pull this from environment variables.
The Landscaper has two modes, dry-run and apply, all changes go through merge requests and for each branch the landscaper is run in dry-run mode. This checks for example that the charts, secrets and images exists as well as linting. On successful run and acceptance of the merge request the Landscaper applies the desired state to the Kubernetes cluster, this eliminates any difference and ensures we only have our design time flows deployed, for example, rogue producers are easily spotted, identified and removed.
Even though all these services are deployed separately and comply to a microservices architecture there is still a rich amount of data available at design time from the Landscaper. From this we can build a picture of the topologies or dataflows in our system. We can join the input and output topics of Connectors, KStreams or any application deployed via the Landscaper. Since we can tag and annotate the flows we can provide monitoring. More importantly the monitoring can be mapped back and overlaid onto the design time provenance we get from the Landscaper.
Now we have the ability to answer to compliance questions, who, what, where, when, why? At the same time we can deploy quickly in a repeatable scalable manner, allowing us to;
- Correlate our runtime metrics/monitoring and logs our lineage back to our design time provenance.
Is the deployed Landscape what we designed?
- Identify and control rogue or non-compliant processes.
- Provide an audit trail via the CI/CD and show due diligence in deploying flows.
The extra bonus with Kafka Connect is the large coverage of source and sinks for the various data feeds and stores. Once the Helm charts are written we can concentrate on simply configuring the landscape and deploying to Kubernetes in the last step of the CI/CD pipe.