Skip to content
You're viewing the beta version. Looking for legacy docs? Click here.

Table utility commands

Delta tables support a number of utility commands.

For many Delta Lake operations, you enable integration with Apache Spark DataSourceV2 and Catalog APIs (since 3.0) by setting configurations when you create a new SparkSession. See Configure SparkSession.

Remove files no longer referenced by a Delta table

Section titled “Remove files no longer referenced by a Delta table”

You can remove files no longer referenced by a Delta table and are older than the retention threshold by running the vacuum command on the table. vacuum is not triggered automatically. The default retention threshold for the files is 7 days. To change this behavior, see Data retention.

VACUUM eventsTable -- This runs VACUUM in 'FULL' mode and deletes data files outside of the retention duration and all files in the table directory not referenced by the table.
VACUUM eventsTable LITE -- This VACUUM in 'LITE' mode runs faster. -- Instead of finding all files in the table directory, `VACUUM LITE` uses the Delta transaction log to identify and remove files no longer referenced by any table versions within the retention duration. -- If `VACUUM LITE` cannot be completed because the Delta log has been pruned a `DELTA_CANNOT_VACUUM_LITE` exception is raised. -- This mode is available only in Delta 3.3 and above.
VACUUM '/data/events' -- vacuum files in path-based table
VACUUM delta.`/data/events/`
VACUUM delta.`/data/events/` RETAIN 100 HOURS -- vacuum files not required by versions more than 100 hours old
VACUUM eventsTable DRY RUN -- do dry run to get the list of files to be deleted
VACUUM eventsTable USING INVENTORY inventoryTable —- vacuum files based on a provided reservoir of files as a delta table
VACUUM eventsTable USING INVENTORY (select * from inventoryTable) —- vacuum files based on a provided reservoir of files as spark SQL query

See the Delta Lake APIs for Scala, Java, and Python syntax details.

An inventory table contains a list of file paths together with their size, type (directory or not), and the last modification time. When an INVENTORY option is provided, VACUUM will consider the files listed there instead of doing the full listing of the table directory, which can be time consuming for very large tables. The inventory table can be specified as a delta table or a spark SQL query that gives the expected table schema. The schema should be as follows:

Column NameTypeDescription
pathstringfully qualified uri
lengthintegersize in bytes
isDirbooleanboolean indicating if it is a directory
modificationTimeintegerfile update time in milliseconds

You can retrieve information on the operations, user, timestamp, and so on for each write to a Delta table by running the history command. The operations are returned in reverse chronological order. By default table history is retained for 30 days.

SQL

DESCRIBE HISTORY '/data/events/' -- get the full history of the table
DESCRIBE HISTORY delta.`/data/events/`
DESCRIBE HISTORY '/data/events/' LIMIT 1 -- get the last operation only
DESCRIBE HISTORY eventsTable

See Configure SparkSession for the steps to enable support for SQL commands in Apache Spark.

from delta.tables import *
deltaTable = DeltaTable.forPath(spark, pathToTable)
detailDF = deltaTable.detail()

See the Delta Lake APIs for Scala/Java/Python syntax details.

The output of the history operation has the following columns.

ColumnTypeDescription
versionlongTable version generated by the operation.
timestamptimestampWhen this version was committed.
userIdstringID of the user that ran the operation.
userNamestringName of the user that ran the operation.
operationstringName of the operation.
operationParametersmapParameters of the operation (for example, predicates.)
jobstructDetails of the job that ran the operation.
notebookstructDetails of notebook from which the operation was run.
clusterIdstringID of the cluster on which the operation ran.
readVersionlongVersion of the table that was read to perform the write operation.
isolationLevelstringIsolation level used for this operation.
isBlindAppendbooleanWhether this operation appended data.
operationMetricsmapMetrics of the operation (for example, number of rows and files modified.)
userMetadatastringUser-defined commit metadata if it was specified
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+
|version| timestamp|userId|userName|operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend| operationMetrics|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+
| 5|2019-07-29 14:07:47| null| null| DELETE|[predicate -> ["(...|null| null| null| 4| Serializable| false|[numTotalRows -> ...|
| 4|2019-07-29 14:07:41| null| null| UPDATE|[predicate -> (id...|null| null| null| 3| Serializable| false|[numTotalRows -> ...|
| 3|2019-07-29 14:07:29| null| null| DELETE|[predicate -> ["(...|null| null| null| 2| Serializable| false|[numTotalRows -> ...|
| 2|2019-07-29 14:06:56| null| null| UPDATE|[predicate -> (id...|null| null| null| 1| Serializable| false|[numTotalRows -> ...|
| 1|2019-07-29 14:04:31| null| null| DELETE|[predicate -> ["(...|null| null| null| 0| Serializable| false|[numTotalRows -> ...|
| 0|2019-07-29 14:01:40| null| null| WRITE|[mode -> ErrorIfE...|null| null| null| null| Serializable| true|[numFiles -> 2, n...|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+

The history operation returns a collection of operations metrics in the operationMetrics column map.

The following table lists the map key definitions by operation.

OperationMetric nameDescription
WRITE, CREATE TABLE AS SELECT, REPLACE TABLE AS SELECT, COPY INTO
numFilesNumber of files written.
numOutputBytesSize in bytes of the written contents.
numOutputRowsNumber of rows written.
STREAMING UPDATE
numAddedFilesNumber of files added.
numRemovedFilesNumber of files removed.
numOutputRowsNumber of rows written.
numOutputBytesSize of write in bytes.
DELETE
numAddedFilesNumber of files added. Not provided when partitions of the table are deleted.
numRemovedFilesNumber of files removed.
numDeletedRowsNumber of rows removed. Not provided when partitions of the table are deleted.
numCopiedRowsNumber of rows copied in the process of deleting files.
executionTimeMsTime taken to execute the entire operation.
scanTimeMsTime taken to scan the files for matches.
rewriteTimeMsTime taken to rewrite the matched files.
TRUNCATE
numRemovedFilesNumber of files removed.
executionTimeMsTime taken to execute the entire operation.
MERGE
numSourceRowsNumber of rows in the source DataFrame.
numTargetRowsInsertedNumber of rows inserted into the target table.
numTargetRowsUpdatedNumber of rows updated in the target table.
numTargetRowsDeletedNumber of rows deleted in the target table.
numTargetRowsCopiedNumber of target rows copied.
numOutputRowsTotal number of rows written out.
numTargetFilesAddedNumber of files added to the sink(target).
numTargetFilesRemovedNumber of files removed from the sink(target).
executionTimeMsTime taken to execute the entire operation.
scanTimeMsTime taken to scan the files for matches.
rewriteTimeMsTime taken to rewrite the matched files.
UPDATE
numAddedFilesNumber of files added.
numRemovedFilesNumber of files removed.
numUpdatedRowsNumber of rows updated.
numCopiedRowsNumber of rows just copied over in the process of updating files.
executionTimeMsTime taken to execute the entire operation.
scanTimeMsTime taken to scan the files for matches.
rewriteTimeMsTime taken to rewrite the matched files.
FSCKnumRemovedFilesNumber of files removed.
CONVERTnumConvertedFilesNumber of Parquet files that have been converted.
OPTIMIZE
numAddedFilesNumber of files added.
numRemovedFilesNumber of files optimized.
numAddedBytesNumber of bytes added after the table was optimized.
numRemovedBytesNumber of bytes removed.
minFileSizeSize of the smallest file after the table was optimized.
p25FileSizeSize of the 25th percentile file after the table was optimized.
p50FileSizeMedian file size after the table was optimized.
p75FileSizeSize of the 75th percentile file after the table was optimized.
maxFileSizeSize of the largest file after the table was optimized.
VACUUM
numDeletedFilesNumber of deleted files.
numVacuumedDirectoriesNumber of vacuumed directories.
numFilesToDeleteNumber of files to delete.
OperationMetric nameDescription
RESTORE
tableSizeAfterRestoreTable size in bytes after restore.
numOfFilesAfterRestoreNumber of files in the table after restore.
numRemovedFilesNumber of files removed by the restore operation.
numRestoredFilesNumber of files that were added as a result of the restore.
removedFilesSizeSize in bytes of files removed by the restore.
restoredFilesSizeSize in bytes of files added by the restore.

You can retrieve detailed information about a Delta table (for example, number of files, data size) using DESCRIBE DETAIL.

SQL

DESCRIBE DETAIL '/data/events/'
DESCRIBE DETAIL eventsTable

See Configure SparkSession for the steps to enable support for SQL commands in Apache Spark.

from delta.tables import *
deltaTable = DeltaTable.forPath(spark, pathToTable)
detailDF = deltaTable.detail()

See the Delta Lake APIs for Scala/Java/Python syntax details.

The output of this operation has only one row with the following schema.

ColumnTypeDescription
formatstringFormat of the table, that is, delta.
idstringUnique ID of the table.
namestringName of the table as defined in the metastore.
descriptionstringDescription of the table.
locationstringLocation of the table.
createdAttimestampWhen the table was created.
lastModifiedtimestampWhen the table was last modified.
partitionColumnsarray of stringsNames of the partition columns if the table is partitioned.
numFileslongNumber of the files in the latest version of the table.
sizeInBytesintThe size of the latest snapshot of the table in bytes.
propertiesstring-string mapAll the properties set for this table.
minReaderVersionintMinimum version of readers (according to the log protocol) that can read the table.
minWriterVersionintMinimum version of writers (according to the log protocol) that can write to the table.
+------+--------------------+------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+
|format| id| name|description| location| createdAt| lastModified|partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|
+------+--------------------+------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+
| delta|d31f82d2-a69f-42e...|default.deltatable| null|file:/Users/tuor/...|2020-06-05 12:20:...|2020-06-05 12:20:20| []| 10| 12345| []| 1| 2|
+------+--------------------+------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+

You can a generate manifest file for a Delta table that can be used by other processing engines (that is, other than Apache Spark) to read the Delta table. For example, to generate a manifest file that can be used by Presto and Athena to read a Delta table, you run the following:

GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`

See Configure SparkSession for the steps to enable support for SQL commands in Apache Spark.

Convert a Parquet table to a Delta table in-place. This command lists all the files in the directory, creates a Delta Lake transaction log that tracks these files, and automatically infers the data schema by reading the footers of all Parquet files. If your data is partitioned, you must specify the schema of the partition columns as a DDL-formatted string (that is, <column-name1> <type>, <column-name2> <type>, ...).

By default, this command will collect per-file statistics (e.g. minimum and maximum values for each column). These statistics will be used at query time to provide faster queries. You can disable this statistics collection in the SQL API using NO STATISTICS.

-- Convert unpartitioned Parquet table at path '<path-to-table>'
CONVERT TO DELTA parquet.`<path-to-table>`
-- Convert unpartitioned Parquet table and disable statistics collection CONVERT TO DELTA parquet.`<path-to-table>` NO STATISTICS
-- Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named 'part' and 'part2' CONVERT TO DELTA parquet.`<path-to-table>` PARTITIONED BY (part int, part2 int)
-- Convert partitioned Parquet table and disable statistics collection CONVERT TO DELTA parquet.`<path-to-table>` NO STATISTICS PARTITIONED BY (part int, part2 int)

You can convert an Iceberg table to a Delta table in place if the underlying file format of the Iceberg table is Parquet. Similar to a conversion from a Parquet table, the conversion is in-place and there won’t be any data copy or data rewrite. The original Iceberg table and the converted Delta table have separate history, so modifying the Delta table should not affect the Iceberg table as long as the source data Parquet files are not touched or deleted.

The following command creates a Delta Lake transaction log based on the Iceberg table’s native file manifest, schema and partitioning information. The converter also collects column stats during the conversion, unless NO STATISTICS is specified.

-- Convert the Iceberg table in the path <path-to-table>.
CONVERT TO DELTA iceberg.\`<path-to-table>\`
-- Convert the Iceberg table in the path <path-to-table> without collecting statistics.
CONVERT TO DELTA iceberg.\`<path-to-table>\` NO STATISTICS

You can easily convert a Delta table back to a Parquet table using the following steps:

  1. If you have performed Delta Lake operations that can change the data files (for example, delete or merge), run vacuum ) with retention of 0 hours to delete all data files that do not belong to the latest version of the table.
  2. Delete the _delta_log directory in the table directory.

You can restore a Delta table to its earlier state by using the RESTORE command. A Delta table internally maintains historic versions of the table that enable it to be restored to an earlier state. A version corresponding to the earlier state or a timestamp of when the earlier state was created are supported as options by the RESTORE command.

RESTORE TABLE db.target_table TO VERSION AS OF <version>
RESTORE TABLE delta.`/data/target/` TO TIMESTAMP AS OF <timestamp>

For example:

Table versionOperationDelta log updatesRecords in data change log updates
0INSERTAddFile(/path/to/file-1, dataChange = true)(name = Viktor, age = 29), (name = George, age = 55)
1INSERTAddFile(/path/to/file-2, dataChange = true)(name = George, age = 39)
2OPTIMIZEAddFile(/path/to/file-3, dataChange = false), RemoveFile(/path/to/file-1), RemoveFile(/path/to/file-2)(No records as Optimize compaction does not change the data in the table)
3RESTORE(version=1)RemoveFile(/path/to/file-3), AddFile(/path/to/file-1, dataChange = true), AddFile(/path/to/file-2, dataChange = true)(name = Viktor, age = 29), (name = George, age = 55), (name = George, age = 39)

In the preceding example, the RESTORE command results in updates that were already seen when reading the Delta table version 0 and 1. If a streaming query was reading this table, then these files will be considered as newly added data and will be processed again.

RESTORE reports the following metrics as a single row DataFrame once the operation is complete:

  • table_size_after_restore: The size of the table after restoring.
  • num_of_files_after_restore: The number of files in the table after restoring.
  • num_removed_files: Number of files removed (logically deleted) from the table.
  • num_restored_files: Number of files restored due to rolling back.
  • removed_files_size: Total size in bytes of the files that are removed from the table.
  • restored_files_size: Total size in bytes of the files that are restored.
Restore metrics example

You can create a shallow copy of an existing Delta table at a specific version using the shallow clone command.

Any changes made to shallow clones affect only the clones themselves and not the source table, as long as they don’t touch the source data Parquet files.

The metadata that is cloned includes: schema, partitioning information, invariants, nullability. For shallow clones, stream metadata is not cloned. Metadata not cloned are the table description and user-defined commit metadata.

CREATE TABLE delta.`/data/target/` SHALLOW CLONE delta.`/data/source/` -- Create a shallow clone of /data/source at /data/target
CREATE OR REPLACE TABLE db.target_table SHALLOW CLONE db.source_table -- Replace the target. target needs to be emptied
CREATE TABLE IF NOT EXISTS delta.`/data/target/` SHALLOW CLONE db.source_table -- No-op if the target table exists
CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source`
CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source` VERSION AS OF version
CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source` TIMESTAMP AS OF timestamp_expression -- timestamp can be like "2019-01-01" or like date_sub(current_date(), 1)

CLONE reports the following metrics as a single row DataFrame once the operation is complete:

  • source_table_size: Size of the source table that’s being cloned in bytes.
  • source_num_of_files: The number of files in the source table.

If you have created a shallow clone, any user that reads the shallow clone needs permission to read the files in the original table, since the data files remain in the source table’s directory where we cloned from. To make changes to the clone, users will need write access to the clone’s directory.

When doing machine learning, you may want to archive a certain version of a table on which you trained an ML model. Future models can be tested using this archived data set.

-- Trained model on version 15 of Delta table
CREATE TABLE delta.`/model/dataset` SHALLOW CLONE entire_dataset VERSION AS OF 15

Short-term experiments on a production table

Section titled “Short-term experiments on a production table”

To test a workflow on a production table without corrupting the table, you can easily create a shallow clone. This allows you to run arbitrary workflows on the cloned table that contains all the production data but does not affect any production workloads.

-- Perform shallow clone
CREATE OR REPLACE TABLE my_test SHALLOW CLONE my_prod_table;
UPDATE my_test WHERE user_id is null SET invalid=true; -- Run a bunch of validations. Once happy:
-- This should leverage the update information in the clone to prune to only -- changed files in the clone if possible MERGE INTO my_prod_table USING my_test ON my_test.user_id <=> my_prod_table.user_id WHEN MATCHED AND my_test.user_id is null THEN UPDATE *;
DROP TABLE my_test;

Table property overrides are particularly useful for:

  • Annotating tables with owner or user information when sharing data with different business units.
  • Archiving Delta tables and time travel is required. You can specify the log retention period independently for the archive table. For example:
CREATE OR REPLACE TABLE archive.my_table SHALLOW CLONE prod.my_table
TBLPROPERTIES (
delta.logRetentionDuration = '3650 days',
delta.deletedFileRetentionDuration = '3650 days'
)
LOCATION 'xx://archive/my_table'

Shallow clone for Parquet and Iceberg combines functionality used to clone Delta tables and convert tables to Delta Lake, you can use clone functionality to convert data from Parquet or Iceberg data sources to managed or external Delta tables with the same basic syntax.

replace has the same limitation as Delta shallow clone, the target table must be emptied before applying replace.

CREATE OR REPLACE TABLE <target_table_name> SHALLOW CLONE parquet.`/path/to/data`;
CREATE OR REPLACE TABLE <target_table_name> SHALLOW CLONE iceberg.`/path/to/data`;