Data Validation (Databricks)

The Data Validation block is an optional middle step that applies custom data quality rules using a Databricks notebook before data reaches the Transform Data block. It is available in both DIY and standard templates.

Block type: Databricks integration

Runs a Databricks job to clean, validate, and transform source data. Reads the input file into a temporary Databricks table, applies notebook-defined logic, and writes results to an output table consumed by downstream blocks.

Configuration fields

Field	Description
Databricks Job ID *	The ID of the Databricks job (notebook) to execute. Contact the Databricks team for this value
Additional Parameters	Optional key-value parameters passed to the Databricks job at runtime

* Required field

Key capabilities

Filter records based on conditions (e.g., retain only records where status = 'active')
Remove duplicate rows
Fill missing values with defaults or randomly generated placeholders
Enrich records by joining with reference data from other Databricks tables
Rename column headers to match downstream format requirements
Apply custom validation rules (format checks, uniqueness enforcement, etc.)
Transform and standardize field values

📘
Temporary input and output tables created by the Data Validation block are automatically deleted after 10 days. The block is designed for data preparation, not persistent storage.