Achieve better performance with an efficient lookup input option in Talend Spark Streaming

Description

Talend provides two options to deal with lookup in Spark streaming Jobs: a simple input component (for example: tMongoDBInput) or a lookup input component (tMongoDBLookupInput). Using a lookup input component will provide heavy uplifting in performance and code optimization for any Spark streaming Job. 

Instead of looking up the entire data from the lookup component, Talend provides a unique option for streaming Jobs: to query a smaller chunk of input data for lookup, thereby saving an enormous amount of time and building highly performant Jobs.

By Definition

Lookup components like tMongoDBLookupInputtJDBCLookupInput, and others provided by Talend execute a database query with a strictly defined order that must correspond to the schema definition.

It passes on the extracted data to tMap in order to provide the lookup data to the main flow. It must be directly connected to a tMap component, and requires this tMap to use Reload at each row or Reload at each row (cache) for the lookup flow.

The tricky part here is to understand the usage of the Reload at each row functionality of the Talend tMap component, and how it can be integrated with the lookup component.

Example

Below is an example of how we have used a tJDBCLookupInput component with tMap in a Talend Spark Streaming Job.

  1. At the tMap level, make sure the tMap for the lookup is set up with Reload at each row, and an expression for globalMap Key is defined as well.

2. At the lookup input component level, make sure our Query option is set up to query the globalMap Key (where condition extract.consumer_id) is defined in tMap as shown below. This is key for making sure the lookup component only fetches the data needed for processing at that point in time.

Summary

As we have seen, these minute changes in our Streaming Jobs can make our ETL Jobs more effective and performant. As there will always be multiple implementations of a Talend ETL Job, the ability to understand the nuances in making them more efficient is an integral part of being a data engineer.

For more information, reach out to us at: solutions@thinkartha.com

Author: Siddartha Rao Chennur

This article also published on Talend Community:
Source: https://community.talend.com/s/article/Achieve-better-performance-with-an-efficient-lookup-input-option-in-Talend-Spark-Streaming