Design and Implementation of an On-Premise Cloud Native Lakehouse with Use-Case Example

Silas Jung
22. December 2023
Reading time: 8 min
Design and Implementation of an On-Premise Cloud Native Lakehouse with Use-Case Example

Introduction

In this blog post, the design and implementation of an On-Premise Cloud Native Lakehouse developed by evoila is discussed. The system is open source and can be found at GitHub. It is created for the use case of earthquake data analysis but can also be modified for any other form of usage as well. Throughout this blog post the use case example earthquake monitoring system is referred to by the abbreviation EQMS.

Earthquakes, like other natural disasters, are an integral part of our planet. Their origin can be of various natures, but most often, they occur due to the movement of tectonic plates located in the Earth’s uppermost layer. The damage caused by earthquakes to humans and infrastructure is sometimes enormous. Recent earthquake events in Turkey and Syria in 2023 are estimated by experts to cost up to four billion dollars, not to mention the damage to human life.

For this reason, early detection of such events is an important research area. To collect and analyze data reliably, a basic system is needed that can be upgraded with additional functionalities. It must provide important characteristics such as scalability and real-time processing while also being highly fault-tolerant. 

Architecture

For such a critical application to be implemented reliably, the right system architectures and components must be chosen from the design stage. The system should be fault-tolerant and scalable, while also being capable of being used independently of the underlying hardware or cloud infrastructure. Not least for this reason, the choice of the EQMS architecture falls on the microservice model. In this type of architecture, applications are organized as small, independent containers that communicate with each other to provide overall functionality. These containers can be connected in a network to create a larger, distributed system. The individual applications are encapsulated within containers. To control the lifecycle and environment of the containers, the container orchestration software Kubernetes is used.

The image shown below illustrates the structure of the EQMS. All applications run in a Kubernetes Cluster. To make actions such as deploying and updating the system on the cluster quick and easy to handle, the EQMS is packaged into a Helm chart using the Helm package manager.

The EQMS can be divided into five subgroups:

Kubernetes Architecture
  • Data Acquisition: To extract earthquake data from different sources like the USGS Rest-API the ETL Software Apache NiFi is used. Data acquired by NiFi is forwarded to the streaming platform Kafka. Kafka serves as a buffer between data acquisition and main memory and is highly available.
  • Data Storage: The retrieved data must now be stored persistently. For this purpose, the S3-compatible MinIO object storage is used. Unstructured data can be stored on it and prepared for further processing.
  • Data preparation: With Apache Spark, the raw data can now be written into a Delta table via the Delta Lake storage layer. For this purpose, a Spark script written in Python is used. The Delta table provides the MinIO object storage with features like ACID compliance and time-traveling. The Hive Metastore stores the table schema for accessors.
  • Visualizing: To make meaningful analyses and predictions, an intuitive dashboard with the most important data must exist. Apache Superset offers a wide range of visualization tools. The data for the charts is retrieved from the Delta table in the MinIO object storage using Trino. The query engine Trino provides the capability to handle increased query rates.
  • Continuous Delivery: When the system is updated, changes can be automatically applied to the cluster by the ArgoCD software. This simplifies management and saves time.

Hands-On

In this section, the previously introduced open source EQMS is deployed on a Kubernetes cluster. It can be found under this GitHub Repository

Requirements

To carry out this process, a Kubernetes cluster with a minimum of 16 gigabytes of RAM is required. Additionally, at least two CPU cores (2000 mC) are required.

It is also required that a Kubernetes cluster has already been created and has sufficient permissions. For this, you can either use a local cluster, such as Minikube, or a cluster from a cloud provider like GCPAWS or Azure

In addition, the command line tool kubectl must be installed on the computer in order to be able to manage the Kubernetes cluster.

1. Setup ArgoCD on the Cluster

The first step is to deploy ArgoCD. To do this, you can follow the official instructions provided by the owners, which can be found on the ArgoCD-Website.

Create the “argocd” namespace:

kubectl create namespace argocd


Install the ArgoCD Server and CRDs for ArgoCD on the cluster:

kubectl apply  n argocd  f https://raw.githubusercontent.com/argoproj/argo cd/stable/manifests/install.yaml

2. Apply the ArgoCD Application

Now the actual EQMS system can be deployed. For this, the manifest of the ArgoCD application is applied to the cluster, which ArgoCD then uses to set up the EQMS. Download the YAML-File from this URL. Set the values of both storage-class parameters to the name of your storage class you going to use. Your storage classes available on your cluster can be found with 

kubectl get storageclass

If the storage class is labelled as default, the manifest can be adopted unchanged.

Then apply the configurated YAML-File to your cluster with the following command

kubectl apply -n argocd -f <path/my-starter.yaml>

3. Collect the ArgoCD Password

To later log in to the ArgoCD web interface, the password must be extracted which is stored in a Kubernetes secret resource. To do this you can use one of these options:

Option 1 (when using Linux)

kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath="{.data.password}" | base64 -d > decoded_password.txt

Option 2 (manually when using another OS)

kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath="{.data.password}"

The resulting string must now be converted from base64 to text. For this, several tools can be used, and detailed instructions can be found on the internet. If the system is only used for testing purposes, a website like base64decode.org can also be used.

4. Setup the port forwards for ArgoCD and Superset

Now two port-forwards are created to make ArgoCD and Superset accessible via the web interface. Open a new terminal for each of the commands, as they are blocking

kubectl port-forward svc/argocd-server -n argocd 8080:443
kubectl port-forward svc/superset-lb -n default 8088:8088

5. Collect the external adresses for MinIO and Nifi

For access to the Web-UIs of Nifi and MinIO, the external server adresses of the Load Balancer Services can be viewed with the following command

kubectl get svc nifi-lb -n eqms-nifi -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
kubectl get svc minio-console-lb -n eqms-minio -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

6. Access the Web-UIs

Now the Web interfaces of the different applications can be reached over when entering the following adresses in your web browser:

ServiceURLDefault usernameDefault. password
ArgoCDhttp://localhost:8080admin<from step 3>
Supersethttp://localhost:8088adminadmin
MinIOhttp://<minio-adress-from-step-5>miniochange-me
Nifihttp://<nifi-adress-from-step-5>:8081/nifino login required

The system will automatically fetch data from the USGS Rest-API once an hour. 

NOTE: Web interfaces like the one from Superset can only be used once the automatic initialization (such as the ‘superset-dashboard-init’ job) is completed!

Preview of the EQMS Applications

In this section, the interfaces of the applications that will be used later are shown and briefly explained.

Starting with the Apache NiFi interface. Here, processors can be added or removed using drag-and-drop, which are designed to manipulate and forward data to other applications.

The data flow for fetching the Rest-Api
The data flow for fetching the Rest-Api

For example, the data flow within the API processor group “API_USGS,” which cyclically retrieves data from the USGS API using the “FetchAPI_USGS” processor and sends it to the kafka broker where it’s buffered before saving it to the MinIO object storage.

The MinIO Bucket storing data for further analysis
The MinIO Bucket storing data for further analysis
The dashboard in Superset that contains several diagrams
The dashboard in Superset that contains several diagrams

After the GeoJson data has been inserted into the Delta Table using Spark, it can be visualized by Superset. For this purpose, Superset uses the SQL engine Trino to query the data from the table as efficiently as possible. 

Conclusion

In this blog post, we introduced an on-premises Lakehouse designed to function as an earthquake monitoring system, capable of collecting, storing, and visualizing seismic data. Earthquake analysis remains an enduring and vital field, with modern data processing playing a pivotal role in advancing our understanding. Moreover, the underlying system for this earthquake monitoring is open source and can serve as a solid foundation for projects in various domains. With proven applications seamlessly integrated, it operates reliably and efficiently. Thanks to its modular architecture, it can be easily expanded.

We also delved into the system’s underlying microservices architecture, which, in conjunction with Kubernetes, provides scalability and fault tolerance. In the subsequent hands-on section, we provided a step-by-step guide for setting up and experimenting with the EQMS system on a Kubernetes cluster. The system can be extended and distributed as a foundation for new software.