Skip to main content

Databricks Delta Tables

Delta Tables are a default format used in Databricks Delta Lake. In straightforward terms, a Delta Table is a collection of files in Parquet format that are stored in some storage system such as Azure Blob Storage or AWS S3. Delta Tables can function as Egress Sinks which means that they can be used to directly store incoming data from devices.

tip

See Consume Data in Your Systems page to learn more about configuring Egress Sinks and Routes.

Supported egress events:

KindIs supported
Messages
Batch Completions

Available Storage Backends

Delta Tables can be stored in various storage backends. Spotflow Delta Egress currently supports the following ones:

BackendDescriptionRecommendationLimitations
Azure Data Lake Storage (ADLS) Gen2Object storage which provides fully Hadoop-compatible access, largest number of features and best performance for analytics workloads. Can be created by enabling Hierarchical Namespace feature on Azure Storage Account.Recommended option for new deployments.None.
Azure Blob StorageObject storage which is a predecessor of ADLS Gen2. It is not optimized for analytics workloads.Good option for existing deployments.Soft-delete and Blob Events must be disabled, otherwise most analytics tools are not able to read stored Delta Tables.

Configuration

To route Messages, the Azure Blob Storage Container must be configured as an Egress Sink, and one or more Streams must have an Egress Route targeting that Egress Sink.

Each Delta Egress Sink can serve as a location for one or more Delta Tables.

To configure the Delta Egress Sink, the following parameters are required:

  • Connection String

    • Connection string for Azure Storage Account containing both AccountName and AccountKey property.
    • Example: AccountName=<account-name>;AccountKey=<account-key>;
  • Container Name

    • Name of the Blob Storage Container that should be used to store one or more Delta Tables. The container name must adhere to the Azure naming rules.
    • Container must exist before first use. It is not created automatically.
    • Example: delta-tables

The associated Egress Routes must provide following parameters:

  • Directory Path

    • Specifies path in the target Blob Storage Container where the Delta Table files will be stored.
    • The route DirectoryPath has following behavior:
      • If the target path is empty, empty Delta Table will be created on the path
      • If the target path is not empty but does not contain a Delta Table (there is no _delta_log/00000000000000000000.json file), Delta Table will be created on the path. Delta Table should be able to coexist with other data present on the path, but it is not recommended.
      • If the target path is not empty and contains a Delta Table, Delta Egress will try to write normally to the table without any additional checks
  • Schema

    • Specifies the schema of the target Delta Table. Please see available schemas below.

The pre-existing table may have unexpected parameters or require support for newer versions of the specification. Delta Egress does not check or respect that. Delta Egress writing to a table that was not created with the expected parameters will likely corrupt the table.

At the same time, targeting multiple Egress Routes to the same Delta Table is supported, and any writes that would not change table settings in an incompatible way are tolerated.

Output Delta Tables

Each Egress route can produce one or more Delta Tables, depending on the chosen schema. New data typically show up with up to ~1 minute latency.

Schemas

Following schemas are available:

  • Text Lines: Simple schema for any textual row-based data such as CSV or new-line delimited JSON.
  • Open Telemetry: Schema for data in OTLP format (logs, metrics, traces).

Table Settings

The generated Delta Tables have following settings:

PropertyValue
formatparquet with Snappy compression
min_reader_version1
min_writer_version7
writer_featureschangeDataFeed, appendOnly
configdelta.appendOnly = true, delta.enableChangeDataFeed = true

Change Feed

As mentioned above, the Delta Tables have delta.enableChangeDataFeed = true property set. Given that the tables are also append-only, no extra CDC files are created.

Optimization

All generated Delta Tables are optimized each ~24 hours to improve read performance. The optimization includes:

  • Compacting data files
  • Creating checkpoints.
  • Vacuuming unnecessary files older than 30 days.

Read performance for the data written after the last table optimized (up to 24 hours) might be somewhat reduced.

Usage from Spark in Databricks

note

For an end-to-end example of using Delta Egress with Databricks, check out the tutorial.

✔️ Batch

Output Delta Tables can be normally read using spark.read.format("delta").load("<todo>") (or SELECT * FROM `delta`.<todo> in SQL). To connect to Azure Blob Storage from Databricks, refer to Databricks documentation.

Since the data for all schemas are partitioned it is recommended to utilize the partitioning columns for efficient querying when possible. See details about partitioning in description of individual schemas.

Since the retention policy for temporary files is set to 30 days, time travel queries will only work for versions created inside this time window.

✔️ Streaming

Output tables are ready to be consumed in a streaming manner using Structured Streaming. Additional information can be found in Databricks documentation.

Since the retention policy for temporary files is set to 30 days, streaming queries that are lagging outside of this time window at any point will not be able to continue working and will require a checkpoint reset.

Databricks Unity Catalog support

Delta Egress does not provide any support for UC. However, nothing would prevent consumers from adding the output table to any UC as an external table.