Data Validation (Databricks)

Configuration reference for the Data Validation block — runs a Databricks notebook to clean, validate, and transform source data before ingestion.

The Data Validation block is an optional middle step that applies custom data quality rules using a Databricks notebook before data reaches the Transform Data block. It is available in both DIY and standard templates.

Block type: Databricks integration

Runs a Databricks job to clean, validate, and transform source data. Reads the input file into a temporary Databricks table, applies notebook-defined logic, and writes results to an output table consumed by downstream blocks.

Configuration fields

FieldDescription
Databricks Job ID *The ID of the Databricks job (notebook) to execute. Contact the Databricks team for this value
Additional ParametersOptional key-value parameters passed to the Databricks job at runtime

* Required field

Key capabilities

  • Filter records based on conditions (e.g., retain only records where status = 'active')
  • Remove duplicate rows
  • Fill missing values with defaults or randomly generated placeholders
  • Enrich records by joining with reference data from other Databricks tables
  • Rename column headers to match downstream format requirements
  • Apply custom validation rules (format checks, uniqueness enforcement, etc.)
  • Transform and standardize field values
📘

Temporary input and output tables created by the Data Validation block are automatically deleted after 10 days. The block is designed for data preparation, not persistent storage.