Delta Tables
Delta Tables are a default format used in Databricks Delta Lake. In straightforward terms, a Delta Table is a collection of files in Parquet format that are stored in some storage system such as Azure Blob Storage or AWS S3. Delta Tables can function as Egress Sinks which means that they can be used to directly store incoming data from devices.
See Consume Data in Your Systems page to learn more about configuring Egress Sinks and Routes.
Supported egress events:
Kind | Is supported |
---|---|
Messages | ✅ |
Batch Completions | ❌ |
Available Storage Backends
Delta Tables can be stored in various storage backends. Spotflow Delta Egress currently supports the following ones:
Backend | Description | Recommendation | Limitations |
---|---|---|---|
Azure Data Lake Storage (ADLS) Gen2 | Object storage which provides fully Hadoop-compatible access, largest number of features and best performance for analytics workloads. Can be created by enabling Hierarchical Namespace feature on Azure Storage Account. | Recommended option for new deployments. | None. |
Azure Blob Storage | Object storage which is a predecessor of ADLS Gen2. It is not optimized for analytics workloads. | Good option for existing deployments. | Soft-delete and Blob Events must be disabled, otherwise most analytics tools are not able to read stored Delta Tables. |
Configuration
To route Messages, the Azure Blob Storage Container must be configured as an Egress Sink, and one or more Streams must have an Egress Route targeting that Egress Sink.
Each Delta Egress Sink can serve as a location for one or more Delta Tables.
To configure the Delta Egress Sink, the following parameters are required:
-
Connection String
- Connection string for Azure Storage Account containing both
AccountName
andAccountKey
property. - Example:
AccountName=<account-name>;AccountKey=<account-key>;
- Connection string for Azure Storage Account containing both
-
Container Name
- Name of the Blob Storage Container that should be used to store one or more Delta Tables. The container name must adhere to the Azure naming rules.
- Container must exist before first use. It is not created automatically.
- Example:
delta-tables
The associated Egress Routes must provide following parameters:
-
Directory Path
- Specifies path in the target Blob Storage Container where the Delta Table files will be stored.
- The route DirectoryPath has following behavior:
- If the target path is empty, empty Delta Table will be created on the path
- If the target path is not empty but does not contain a Delta Table (there is no
_delta_log/00000000000000000000.json
file), Delta Table will be created on the path. Delta Table should be able to coexist with other data present on the path, but it is not recommended. - If the target path is not empty and contains a Delta Table, Delta Egress will try to write normally to the table without any additional checks
-
Schema
- Specifies the schema of the target Delta Table. Please see available schemas below.
- This option is currently in preview and is available only via API (not Portal or CLI). If not specified this option defaults to
Text Lines
.
The pre-existing table may have unexpected parameters or require support for newer versions of the specification. Delta Egress does not check or respect that. Delta Egress writing to a table that was not created with the expected parameters will likely corrupt the table.
At the same time, targeting multiple Egress Routes to the same Delta Table is supported, and any writes that would not change table settings in an incompatible way are tolerated.
Output Delta Tables
Each Egress route can produce one or more Delta Tables, depending on the chosen schema. New data typically show up with up to ~1 minute latency.
Schemas
Following schemas are available:
- Text Lines: Simple schema for any textual row-based data such as CSV or new-line delimited JSON.
- Open Telemetry: Schema for data in OTLP format (logs, metrics, traces).
Table Settings
The generated Delta Tables have following settings:
Property | Value |
---|---|
format | parquet with Snappy compression |
min_reader_version | 1 |
min_writer_version | 7 |
writer_features | changeDataFeed , appendOnly |
config | delta.appendOnly = true , delta.enableChangeDataFeed = true |
Change Feed
As mentioned above, the Delta Tables have delta.enableChangeDataFeed = true
property set. Given that the tables are also append-only, no extra CDC files are created.
Optimization
All generated Delta Tables are optimized each ~24 hours to improve read performance. The optimization includes:
- Compacting data files
- Creating checkpoints.
- Vacuuming unnecessary files older than 30 days.
Read performance for the data written after the last table optimized (up to 24 hours) might be somewhat reduced.
Usage from Spark in Databricks
For an end-to-end example of using Delta Egress with Databricks, check out the tutorial.
✔️ Batch
Output Delta Tables can be normally read using spark.read.format("delta").load("<todo>")
(or SELECT * FROM `delta`.<todo>
in SQL). To connect to Azure Blob Storage from Databricks, refer to Databricks documentation.
Since the data for all schemas are partitioned it is recommended to utilize the partitioning columns for efficient querying when possible. See details about partitioning in description of individual schemas.
Since the retention policy for temporary files is set to 30 days, time travel queries will only work for versions created inside this time window.
✔️ Streaming
Output tables are ready to be consumed in a streaming manner using Structured Streaming. Additional information can be found in Databricks documentation.
Since the retention policy for temporary files is set to 30 days, streaming queries that are lagging outside of this time window at any point will not be able to continue working and will require a checkpoint reset.
❌ Databricks Unity Catalog support
Delta Egress does not provide any support for UC. However, nothing would prevent consumers from adding the output table to any UC as an external table.