The Ultimate Guide to Setting Up ClickHouse Database with Kubernetes on AWS

Published in

Towards Dev

13 min readSep 29, 2023

The Ultimate Guide to Setting Up ClickHouse Database with Kubernetes on AWS

Introduction

In the boundless universe of database management, ClickHouse has emerged as a beacon of efficiency and speed for processing analytical data. This high-performance, column-oriented database management system (DBMS) is not just an option but a necessity for those looking to process queries with unprecedented speed and agility. Yet, just having ClickHouse is not enough. To truly harness its capabilities, the correct environment is pivotal, and this is where Kubernetes coupled with AWS comes into the picture.

If you have ventured into the realms of cloud computing, you might be familiar with Kubernetes, a platform designed to automate deploying, scaling, and operating application containers. It stands tall in the container orchestration space, thanks to its open-source nature and the robustness it brings to container management.

Meanwhile, AWS with its Elastic Kubernetes Service (EKS) has been the cornerstone of cloud services, providing a secure and manageable environment for Kubernetes. Combining these stalwarts not only streamlines the workflow but empowers your ClickHouse DBMS to operate at an optimum level, maintaining high availability and resilience to failures.

However, venturing into setting up a stateful application in this environment is not without its hurdles. As someone who has navigated these turbulent waters, I understand the intricacies involved. Through this blog, my endeavor is to be your compass, guiding you step by step in setting up ClickHouse database with Kubernetes on AWS, navigating you through the theoretical landscapes before delving into a detailed walkthrough.

In the upcoming sections, we will unravel the concepts and utilities of ClickHouse and Kubernetes individually before merging their forces. I will be sharing insights on leveraging the components of Kubernetes and AWS to create stateful applications and the path to setting up ClickHouse in this fortified environment.

So, join me as we embark on this enriching journey, equipping you with the knowledge and tools to set up a ClickHouse database that is not just efficient and swift but robust and streamlined, ready to take on the heavy-duty tasks with grace and agility.

Fasten your seatbelts as we are about to venture into a world where speed meets efficiency, delving deep into the realms of ClickHouse, Kubernetes, and AWS, setting a benchmark in database management. Welcome to the ultimate guide to setting up ClickHouse Database with Kubernetes on AWS.

What is ClickHouse?

Dive into the world of ClickHouse, an open-source column-oriented database management system optimized for speed and efficiency in analyzing extensive volumes of data. Let’s understand what sets ClickHouse apart:

Fast Query Processing: ClickHouse is a beacon for rapid query processing. Thanks to its column-oriented nature, it can handle large datasets at remarkable speeds, making OLAP transactions seamless and efficient.
Real-time Analytics: With ClickHouse, real-time analytics isn’t just a promise but a reality, enabling businesses to gain instant insights for informed decision-making. It stands tall in providing real-time analysis of live data, offering a real pulse on business dynamics.
Highly Scalable: Designed to grow parallelly with your needs, ClickHouse offers horizontal scalability that easily allows your database to stretch across numerous servers, ensuring performance, reliability, and fault tolerance.
Community and Ecosystem: ClickHouse boasts a vibrant community with a rich ecosystem of tools and integrations, ensuring continuous innovation and adaption to the fast-paced changes in the tech landscape.
SQL Support: Offering support for a wide subset of SQL, ClickHouse provides a familiar terrain for developers and data analysts, facilitating an easy transition while leveraging ClickHouse’s potent features.
Open-source: Being open-source, it affords developers flexibility and transparency, fostering tailor-made solutions drawn from a globally pooled wisdom and community contributions.

As we gear up to explore the integration of ClickHouse with Kubernetes on AWS, it’s pivotal to appreciate ClickHouse not just as a tool but as a transformative solution in data management and analytics, promising speed, efficiency, and scalability.

Stay tuned as we delve deeper into synergizing ClickHouse with the robust framework of Kubernetes, enhanced by AWS security and scalability in the subsequent sections.

What is Kubernetes?

In a nutshell, Kubernetes is an open-source platform designed to automate deploying, scaling, and operating application containers. Initially conceived by Google, it’s now maintained by the Cloud Native Computing Foundation. Let’s shed light on its basic aspects and original design for maintaining stateless applications.

Orchestration System: At its core, Kubernetes is a container orchestration system, meaning it handles the automation of the deployment and scaling of applications housed in containers. It efficiently manages the workloads of containers based on user-defined parameters.
Master and Node Architecture: Kubernetes follows a master-node architecture, where the master coordinates and the nodes are the workers executing tasks. This structure promotes efficient, centralized control and distributed computing.
Pods and Services: In Kubernetes, applications run in units called “pods,” which are groupings of one or more containers. Pods are ephemeral, and for stable network communication, services are used that act as a permanent IP address to manage pods.
Intended for Stateless Applications: Initially, Kubernetes was primed for stateless applications — apps that don’t store data from one session to the next. This design promoted agility and scalability as no data had to be retained, allowing resources to be freely allocated and reallocated as necessary without worrying about losing vital information.

Kubernetes: Transition to Stateful Applications

Kubernetes, originally designed by Google and now maintained by the Cloud Native Computing Foundation, has primarily been geared towards managing stateless applications. Stateless applications are designed to handle requests without retaining any data from one session to another, making them relatively straightforward to manage and scale.

Despite its original design, Kubernetes has evolved to support stateful applications, thanks to persistent volumes, StatefulSets, and other enhancements that allow data to be stored persistently, marrying the benefits of Kubernetes with applications that require data retention.

However, the ecosystem has evolved, and Kubernetes now supports stateful applications — applications that save session data and require persistent storage. This transition hasn’t been without its hurdles. Below, we delve into some significant concerns that developers encounter while establishing stateful applications on Kubernetes:

Concerns while Establishing Stateful applications on Kubernetes

Data Persistence: Ensuring data persistence in a system originally favoring ephemeral computing resources has been challenging. The ephemeral storage solutions native to Kubernetes are unsuited for long-term data preservation, necessitating robust storage solutions.
Complex Configuration: Stateful applications demand a multifaceted configuration process, requiring developers to manage configurations for stateful sets meticulously, persistent volumes, and other resources, a task requiring a deep understanding of Kubernetes architecture.
Data Security: Handling sensitive data escalates security concerns. Implementing solid encryption, access controls, and backup strategies is vital but introduces complexity to the setup process.
Resource Allocation and Management: Efficient resource allocation for optimal performance while averting wastage is a critical task, demanding careful planning and dynamic management.
Application Portability: While Kubernetes facilitates portability for stateless applications seamlessly, it faces constraints with stateful applications, as transferring them between environments while retaining their state is a complex task.
Error Handling and Recovery: Creating mechanisms to handle failures automatically and recover without data loss or downtime adds a layer of complexity to the development process.
Upgrades and Rollbacks: Implementing upgrades or rollbacks without affecting the stored data remains a challenging task, involving meticulous planning and testing to avoid data corruption or loss.

In conclusion, while Kubernetes has made strides to support stateful applications, it presents a unique set of challenges that require expertise and thoughtful planning to navigate. Understanding both the opportunities and constraints of Kubernetes can help developers create robust stateful applications effectively.

Standard Infrastructure of an Application on Kubernetes Using EKS on AWS

In the dynamic landscape of cloud-native technologies, AWS EKS stands as a reliable and popular choice for orchestrating containerized applications using Kubernetes. Let us break down the standard infrastructure components that one would leverage to set up a basic full-stack application on AWS EKS.

Cluster Setup: The heart of your Kubernetes deployment is the cluster, which is a set of nodes, or machines, virtual or physical, where your containerized applications run. On EKS, you would generally start by setting up an EKS cluster, which entails choosing the right instance types, defining the roles and IAM policies, and configuring the networking settings.
Node Groups: After setting up your cluster, the next step is to configure your node groups, essentially grouping together EC2 instances to ensure your applications have the computational resources they require.
VPC and Subnet Configuration: A critical step in the setup is defining your VPC (Virtual Private Cloud) and subnet configurations, setting the groundwork for network isolation and control over the AWS resources that your Kubernetes applications can access.
Storage Solutions: Given that we are aiming to manage stateful applications, integrating robust storage solutions such as Amazon EBS (Elastic Block Store) would be essential to ensuring data persistence. Although, EBS is not what we will be using while setting up Clickhouse in our infrastructure. We will dive deep into why in the next section.
Load Balancing: To ensure high availability and fault tolerance, AWS EKS supports integration with AWS Load Balancers, efficiently distributing traffic across several servers to prevent any single point of failure.
Pod Deployment: Pods are the smallest deployable units of computing that can be created and managed in Kubernetes. You would be deploying your applications in the form of pods, grouped logically based on their functions — frontend, backend, etc.
Service Configuration: Services in Kubernetes allow you to abstract away the underlying pod IPs, offering a single point of access to a group of pods, facilitating network communication and load distribution.
Auto-Scaling: To handle fluctuating loads, configuring auto-scaling groups can be a boon, helping in automatically adjusting the number of running instances based on the real-time demand.
Logging and Monitoring: Integrating AWS CloudWatch or other logging solutions would be crucial for monitoring the health of your applications and gaining insights through log analytics.
Security: Implementing role-based access control (RBAC) and setting up proper encryption ensures the security of your data and application, an indispensable component of any architecture.

In this schema of infrastructure, it’s evident that while EKS simplifies Kubernetes deployment, the full setup involves meticulous planning and configuration to bring a full-stack application to life efficiently and securely. The AWS ecosystem offers a variety of tools and integrations that can aid in setting up a resilient application, suited to both stateless and stateful paradigms.

Leveraging Components of Kubernetes and AWS for Stateful Applications

To effectively manage stateful applications using Kubernetes on AWS, a strategic approach in selecting the appropriate components from both ecosystems can optimize your applications’ performance and reliability. Let’s delve into the pivotal role of AWS EFS in enhancing storage management, coupled with the advantages of using StatefulSets in Kubernetes for orchestrating stateful applications.

Why AWS EFS Over EBS?

While AWS EBS (Elastic Block Store) has been a go-to solution for storage, it isn’t without its limitations, especially when it comes to handling stateful applications on Kubernetes orchestrated through EKS. A prominent limitation is its confinement to a single availability zone, posing challenges in a multi-AZ or multi-region setup where nodes might not communicate with EBS volumes effectively.

Here is where AWS EFS (Elastic File System) takes the stage as a more favorable solution. Unlike EBS, EFS supports multi-AZ and multi-region architectures, thereby ensuring a seamless connection between EKS nodes and persistent volumes regardless of their geographical placements. Moreover, EFS offers the following advantages:

Auto-Scaling: EFS automatically adjusts its capacity to maintain the performance and storage requirements, eliminating the hassle of pre-provisioning storage space.
Data Consistency: EFS guarantees strong data consistency, offering reliable file operations and views, crucial for stateful applications.
Data Protection: EFS offers two functionalities when it comes to Data Protection, one is automatic and reliable data backups, and the second is data replication across regions or within the same region.

The above features make EFS a better candidate for persistent storage requirements.

StatefulSets: The Preferred Choice Over Deployments

When deploying applications on Kubernetes, it is common to utilize Deployments for managing stateless applications. However, when it comes to orchestrating stateful applications, StatefulSets emerge as a more fitting choice for several reasons:

Stable Unique Network Identifiers: Each pod in a StatefulSet maintains a sticky, unique identifier, enabling a predictable naming convention and facilitating easier discovery and communication between pods.
Persistent Pod Storage: StatefulSets ensures that the allocated storage volumes are retained, even in cases of pod rescheduling, thereby safeguarding against data losses.
Ordered Deployment and Scaling: Unlike Deployments, StatefulSets follow a defined order in the deployment and scaling of pods, ensuring that operations do not proceed to the next pod until the current one is successfully deployed and running.
Graceful Pod Termination: StatefulSets enable safe and graceful termination of pods, maintaining system reliability during updates and failures.

By judiciously employing AWS EFS for scalable and consistent storage solutions and opting for Kubernetes StatefulSets for orchestrating your applications, you can build a robust infrastructure tailored to the unique demands of stateful applications, leveraging the best from both AWS and Kubernetes.

Setting Up ClickHouse on Kubernetes with AWS

In this hands-on section, we will walk you through the comprehensive steps involved in setting up ClickHouse on Kubernetes with AWS. Follow along as we set up necessary drivers, instances, and roles to ensure a seamless deployment:

Step 1: Set Up CSIDrivers.Storage Drivers on Your Kubernetes Cluster

Begin with the setup of aws-efs-csi-driver on your Kubernetes cluster using the following commands:

helm repo add aws-efs-csi-driver <https://kubernetes-sigs.github.io/aws-efs-csi-driver/>

helm repo update aws-efs-csi-driverhelm upgrade --install aws-efs-csi-driver --namespace kube-system aws-efs-csi-driver/aws-efs-csi-driver

The above commands along with the details about the helm repo can be found here.

Step 2: Establish an EFS Instance on AWS

Next, navigate to the AWS Management Console to create an EFS instance. Follow the guided wizard and take note of the FileSystemId as it will be needed in subsequent steps. You can follow the official guide provided by AWS in order to understand the various configuration options and what would be best suited for your needs.

Step 3: Create an IAM Role with EFS Access

Create an IAM role with a policy that grants access to EFS. Attach this policy to the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowDescribe",
            "Effect": "Allow",
            "Action": [
                "elasticfilesystem:DescribeAccessPoints",
                "elasticfilesystem:DescribeFileSystems",
                "elasticfilesystem:DescribeMountTargets",
                "ec2:DescribeAvailabilityZones"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowCreateAccessPoint",
            "Effect": "Allow",
            "Action": [
                "elasticfilesystem:CreateAccessPoint"
            ],
            "Resource": "*",
            "Condition": {
                "Null": {
                    "aws:RequestTag/efs.csi.aws.com/cluster": "false"
                },
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": "efs.csi.aws.com/cluster"
                }
            }
        },
        {
            "Sid": "AllowTagNewAccessPoints",
            "Effect": "Allow",
            "Action": [
                "elasticfilesystem:TagResource"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "elasticfilesystem:CreateAction": "CreateAccessPoint"
                },
                "Null": {
                    "aws:RequestTag/efs.csi.aws.com/cluster": "false"
                },
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": "efs.csi.aws.com/cluster"
                }
            }
        },
        {
            "Sid": "AllowDeleteAccessPoint",
            "Effect": "Allow",
            "Action": "elasticfilesystem:DeleteAccessPoint",
            "Resource": "*",
            "Condition": {
                "Null": {
                    "aws:ResourceTag/efs.csi.aws.com/cluster": "false"
                }
            }
        }
    ]
}

Please note that the above policy is the default AWS managed EFS policy and does not give a fine-grained access, essentially, the role would have a lot of privileges that you may want to consider before applying it to your IAM Role.

Step 4: Attach the EFS to the PersistentVolume

Create a PersistentVolume (PV) YAML file with the EFS FileSystemId:

# pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: ""
  csi:
    driver: efs.csi.aws.com
    volumeHandle: [FileSystemId]

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 5Gi

Apply this PersistentVolume and PersistentVolumeClaim using:

kubectl apply -f pvc.yaml

kubectl apply -f pv.yaml

Step 5: Set Up a StatefulSet with a Volume Mount

Finally, create a StatefulSet YAML file that refers to the PersistentVolumeClaim (PVC) that we created earlier:

# statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: clickhouse-sts
spec:
  serviceName: "clickhouse"
  replicas: 3
  selector:
    matchLabels:
      app: clickhouse
  template:
    metadata:
      labels:
        app: clickhouse
    spec:
      serviceAccountName: [Your-Service-Account-Name]
      containers:
      - name: clickhouse
        image: yandex/clickhouse-server:latest
				volumeMounts:
          - mountPath: /var/lib/clickhouse/
            name: clickhouse-analytics
			volumes:
      - name: clickhouse-analytics
        persistentVolumeClaim:
          claimName: efs-pvc

Apply the StatefulSet configuration using:

kubectl apply -f statefulset.yaml

This set-up should configure ClickHouse adequately on your Kubernetes environment leveraged by AWS. Be sure to replace placeholders like [FileSystemId] and [Your-Service-Account-Name] with your actual details. You are now all set to explore the power of ClickHouse on a robust, scalable, and resilient infrastructure. Do test the setup and ensure everything is functioning as expected. Let us know if there are any specific topics you'd like to explore further!

Conclusion

We have successfully navigated through the essential steps involved in setting up ClickHouse on Kubernetes with AWS, leveraging the superior functionalities of AWS EFS over EBS and utilizing StatefulSets in Kubernetes to ensure a more robust and efficient setup.

The combination of ClickHouse, Kubernetes, and AWS promises scalable, resilient, and high-performance analytical infrastructure. Leveraging AWS EFS ensures your data is stored safely and can scale automatically, accommodating growing datasets while offering multi-region support. Incorporating Kubernetes StatefulSets, on the other hand, allows for easy scalability and management of stateful applications, providing a firm foundation for deploying applications like ClickHouse that require persistent storage.

I thoroughly enjoyed delving into the details of setting up ClickHouse on Kubernetes using AWS and sharing this knowledge with all of you. The journey of writing this blog post has been as enriching as implementing the setup itself, and I am eager to engage with fellow enthusiasts in the cloud-native space.

I welcome discussions on cloud-native technologies, Kubernetes, or tech in general. Feel free to connect with me on LinkedIn to share your thoughts or to discuss any potential collaborations.

To know more about my work and to stay updated with my latest projects, do visit my personal website. I am looking forward to connecting with many of you and learning from your experiences as well.

Thank you for joining me in this exploration, and I hope to connect with you soon!