Data Management
A tutorial on how to use VDK to perform batch data processing
Photo by Mika Baumeister on Unsplash
Versatile Data Kit (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities. While VDK can handle various data integration tasks, including real-time streaming, this article will focus on how to use it in batch data processing.
This article covers:
Introducing Batch Data ProcessingCreating and Managing Batch Processing Pipelines in VDKMonitoring Batch Data Processing in VDK
1 Introducing Batch Data Processing
Batch data processing is a method for processing large volumes of data at specified intervals. Batch data must be:
Time-independent: data doesn’t require immediate processing and is typically not sensitive to real-time requirements. Unlike streaming data, which needs instant processing, batch data can be processed at scheduled intervals or when resources become available.Splittable in chunks: instead of processing an entire dataset in a single, resource-intensive operation, batch data can be divided into smaller, more manageable segments. These segments can then be processed sequentially or in parallel, depending on the capabilities of the data processing system.
In addition, batch data can be processed offline, meaning it doesn’t require a constant connection to data sources or external services. This characteristic is precious when data sources may be intermittent or temporarily unavailable.
ELT (Extract, Load, Transform) is a typical use case for batch data processing. ELT comprises three main phases:
Extract (E): data is extracted from multiple sources in different formats, both structured and unstructured.Load (L): data is loaded into a target destination, such as a data warehouse.Transform (T): the extracted data typically requires preliminary processing, such as cleaning, harmonization, and transformations into a common format.
Now that you have learned what batch data processing is, let’s move on to the next step: creating and managing batch processing pipelines in VDK.
2 Creating and Managing Batch Processing Pipelines in VDK
VDK adopts a component-based approach, enabling you to build data processing pipelines quickly. For an introduction to VDK, refer to my previous article, An Overview of Versatile Data Kit. This article assumes that you have already installed VDK on your computer.
To explain how the batch processing pipeline works in VDK, we consider a scenario where you must perform an ELT task.
Imagine you want to ingest and process, in VDK, Vincent Van Gogh’s paintings available in Europeana, a well-known European aggregator for cultural heritage. Europeana provides all cultural heritage objects through its public REST API. Regarding Vincent Van Gogh, Europeana provides more than 700 works.
The following figure shows the steps for batch data processing in this scenario.
Image by Author
Let’s investigate each point separately. You can find the complete code to implement this scenario in the VDK GitHub repository.
2.1 Extract and Load
This phase includes VDK jobs calling the Europeana REST API to extract raw data. Specifically, it defines three jobs:
job1 — delete the existing table (if any)job2 — create a new tablejob3 — ingest table values directly from the REST API.
This example requires an active Internet connection to work correctly to access the Europeana REST API. This operation is a batch process because it downloads data only once and does not require streamlining.
We’ll store the extracted data in a table. The difficulty of this task is building a mapping between the REST API, which is done in job3.
Writing job3 involves simply writing the Python code to perform this mapping, but instead of saving the extracted file into a local file, we call a VDK function (job_input.send_tabular_data_for_ingestion) to save the file to VDK, as shown in the following snippet of code:
import inspect
import logging
import os
import pandas as pd
import requests
from vdk.api.job_input import IJobInput
def run(job_input: IJobInput):
“””
Download datasets required by the scenario and put them in the data lake.
“””
log.info(f”Starting job step {__name__}”)
api_key = job_input.get_property(“api_key”)
start = 1
rows = 100
basic_url = f”https://api.europeana.eu/record/v2/search.json?wskey={api_key}&query=who:%22Vincent%20Van%20Gogh%22″
url = f”{basic_url}&rows={rows}&start={start}”
response = requests.get(url)
response.raise_for_status()
payload = response.json()
n_items = int(payload[“totalResults”])
while start < n_items:
if start > n_items – rows:
rows = n_items – start + 1
url = f”{basic_url}&rows={rows}&start={start}”
response = requests.get(url)
response.raise_for_status()
payload = response.json()[“items”]
df = pd.DataFrame(payload)
job_input.send_tabular_data_for_ingestion(
df.itertuples(index=False),
destination_table=”assets”,
column_names=df.columns.tolist(),
)
start = start + rows
For the complete code, refer to the example in GitHub. Please note that you need a free API key to download data from Europeana.
The output produced during the extraction phase is a table containing the raw values.
2.2 Transform
This phase involves cleaning data and extracting only relevant information. We can implement the related jobs in VDK through two jobs:
job4 — delete the existing table (if any)job5 — create the cleaned table.
Job5 simply involves writing an SQL query, as shown in the following snippet of code:
CREATE TABLE cleaned_assets AS (
SELECT
SUBSTRING(country, 3, LENGTH(country)-4) AS country,
SUBSTRING(edmPreview, 3, LENGTH(edmPreview)-4) AS edmPreview,
SUBSTRING(provider, 3, LENGTH(provider)-4) AS provider,
SUBSTRING(title, 3, LENGTH(title)-4) AS title,
SUBSTRING(rights, 3, LENGTH(rights)-4) AS rights
FROM assets
)
Running this job in VDK will produce another table named cleaned_asset containing the processed values. Finally, we are ready to use the cleaned data somewhere. In our case, we can build a Web app that shows the extracted paintings. You can find the complete code to perform this task in the VDK GitHub repository.
3 Monitoring Batch Data Processing in VDK
VDK provides the VDK UI, a graphical user interface to monitor data jobs. To install VDK UI, follow the official VDK video at this link. The following figure shows a snapshot of VDK UI.
Image by Author
There are two main pages:
Explore: This page enables you to explore data jobs, such as the job execution success rate, jobs with failed executions in the last 24 hours, and the most failed executions in the last 24 hours.Manage: This page gives more job details. You can order jobs by column, search multiple parameters, filter by some of the columns, view the source for the specific job, add other columns, and so on.
Watch the following official VDK video to learn how to use VDK UI.
https://medium.com/media/6a41ee2be8db7418097245b99cb1c16a/href
Summary
Congratulations! You have just learned how to implement batch data processing in VDK! It only requires ingesting raw data, manipulating it, and, finally, using it for your purposes! You can find many other examples in the VDK GitHub repository.
Stay up-to-date with the latest data processing developments and best practices in VDK. Keep exploring and refining your expertise!
Other articles you may be interested in…
An Overview of Versatile Data KitHandling Missing Values in Versatile Data KitFrom Raw Data to a Cleaned Database: A Deep Dive into Versatile Data Kit
Mastering Batch Data Processing with Versatile Data Kit (VDK) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.