Use row tracking for Delta tables
Row tracking allows Delta Lake to track row-level lineage in a Delta Lake table. When enabled on a Delta Lake table, row tracking adds two new metadata fields to the table:
- Row IDs provide rows with an identifier that is unique within the table. A row keeps the same ID whenever it is modified using a
MERGE
orUPDATE
statement. - Row commit versions record the last version of the table in which the row was modified. A row is assigned a new version whenever it is modified using a
MERGE
orUPDATE
statement.
Enable row tracking
Section titled “Enable row tracking”You must explicitly enable row tracking using one of the following methods:
- New table: Set the table property
delta.enableRowTracking = true
in theCREATE TABLE
command.
-- Create an empty tableCREATE TABLE student (id INT, name STRING, age INT)TBLPROPERTIES ('delta.enableRowTracking' = 'true');
-- Using a CTAS statementCREATE TABLE course_newTBLPROPERTIES ('delta.enableRowTracking' = 'true')AS SELECT * FROM course_old;
-- Using a LIKE statement to copy configurationCREATE TABLE graduate LIKE student;
-- Using a CLONE statement to copy configurationCREATE TABLE graduate CLONE student;
- Existing table: Available from Delta 3.3 and above, set the table property
'delta.enableRowTracking' = 'true'
in theALTER TABLE
command.
ALTER TABLE grade SET TBLPROPERTIES ('delta.enableRowTracking' = 'true');
- All new tables: Set the configuration
spark.databricks.delta.properties.defaults.enableRowTracking = true
for the current session in theSET
command.
SET spark.databricks.delta.properties.defaults.enableRowTracking = true;
spark.conf.set("spark.databricks.delta.properties.defaults.enableRowTracking", True)
spark.conf.set("spark.databricks.delta.properties.defaults.enableRowTracking", true)
Row tracking storage
Section titled “Row tracking storage”Enabling row tracking may increase the size of the table. Delta Lake stores row tracking metadata fields in hidden metadata columns in the data files. Some operations, such as insert-only operations do not use these hidden columns and instead track the row ids and row commit versions using metadata in the Delta Lake log. Data reorganization operations such as OPTIMIZE
and REORG
cause the row ids and row commit versions to be tracked using the hidden metadata column, even when they were stored using metadata.
Read row tracking metadata fields
Section titled “Read row tracking metadata fields”Row tracking adds the following metadata fields that can be accessed when reading a table:
Column name | Type | Values |
---|---|---|
_metadata.row_id | Long | The unique identifier of the row. |
_metadata.row_commit_version | Long | The table version at which the row was last inserted or updated. |
The row ids and row commit versions metadata fields are not automatically included when reading the table. Instead, these metadata fields must be manually selected from the hidden _metadata
column which is available for all tables in Apache Spark.
SELECT _metadata.row_id, _metadata.row_commit_version, * FROM table_name;
spark.read.table("table_name") \ .select("_metadata.row_id", "_metadata.row_commit_version", "*")
spark.read.table("table_name") .select("_metadata.row_id", "_metadata.row_commit_version", "*")
Disable row tracking
Section titled “Disable row tracking”Row tracking can be disabled to reduce the storage overhead of the metadata fields. After disabling row tracking the metadata fields remain available, but all rows always get assigned a new id and commit version whenever they are touched by an operation.
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableRowTracking = false);
spark.sql("ALTER TABLE table_name SET TBLPROPERTIES (delta.enableRowTracking = false)")
spark.sql("ALTER TABLE table_name SET TBLPROPERTIES (delta.enableRowTracking = false)")
Limitations
Section titled “Limitations”The following limitations exist:
- The row ids and row commit versions metadata fields cannot be accessed while reading the Change data feed.
- Once the Row Tracking feature is added to the table it cannot be removed without recreating the table.