Bring your data import processes into the cloud

Andreas Rütten

6. July 2023

Reading time: 3 min

Bring your data import processes into the cloud

For a company which is hosting a popular jobs and real estate website, it is essential to have a healthy and resilient data import pipeline in place.

Before migrating into AWS, the company was ~~self-~~hosting their software application and infrastructure responsible for the data imports on-premises. The import data was flowing via FTP-uploads to the importer service, which was then loading the assets, like images, into an on-premises data store and putting the document data of the single advertisements into a SQL database.

With their self-hosted solution, they had to face several challenges:

One major issue was the lack of scalability. At unpredictable times, their customers were uploading large amounts of data via FTP, which had to be processed by the import pipeline. This caused incidents and outages sometimes.
Another issue was the lack of sufficient space for the large amount of assets in their datastore.

Cloud architecture and data flow

These requirements led us to create the following infrastructure:

As the interface for the import data upload, we created a REST API with the AWS API Gateway, providing an authenticated public URL.

The import pipeline behind the API Gateway consists of a chain of AWS Lambda and AWS SQS to separate the lambda code to single concerns as described below. Lambdas are serverless event-driven functions that execute custom code which can be used to process data in real-time. SQS is a fully managed message queue service that enables you to decouple and scale microservices, distributed systems, and serverless applications. With SQS, you can easily manage the flow of data between your Lambdas, ensuring that your data import pipelines are reliable and scalable:

The first lambda examines the incoming data, ensuring that it meets the formatting requirements and that the required information is present. After that, it fetches all the linked assets, like images, stores them into S3 and puts the data as single objects into SQS for further processing.
The second lambda enriches the received objects with additional data like geographical information based on the fetched geo-data from the objects and puts the enriched objects into another queue.
The third lambda generates the asset URLs, creates a document based on the object data and puts the result into Elasticsearch.

The Elasticsearch cluster, which is a highly scalable and distributed search engine in AWS, is used as searchable document store for the consuming job search applications– and real estate– search applications.

Conclusion and benefits

Self-hosting crucial services on-premises like the import data pipeline in this case, can present several challenges like lacking scalability lacking availability. By moving the workload to the AWS cloud, the company took advantage of the scalability, resiliency and reliability of cloud-based solutions while ensuring the performance of real-time data processing and the safety of a durable object store. Also, backed with Elasticsearch, the company could provide a fast and flexible object-search application to their customers.