Delta Lake I/O¶
About¶
Import and export data into/from Delta Lake tables (paper), for humans and machines.
Synopsis¶
Load from Delta Lake table:
ctk load {file,http,https,s3,abfs,gs,hdfs,lakefs}+deltalake://...
Export to Delta Lake table:
ctk save {file,http,https,s3,abfs,gs,hdfs,lakefs}+deltalake://...
Install¶
uv tool install --upgrade 'cratedb-toolkit[deltalake]'
Tip
For speedy installations, we recommend using the uv package manager.
Install it using brew install uv on macOS or pipx install uv on
other operating systems.
Usage¶
Delta Lake supports different backends like S3, Azure Data Lake, and Google Cloud, see Loading a Delta Table.
Please look up available configuration parameters in the reference documentation
of the uniform API library object_store (s3 options, azure options, gcs options).
Otherwise, derive your ETL commands from the examples shared below.
Load¶
Load from filesystem.
ctk load \
"file+deltalake://./var/lib/delta/demo/taxi-tiny" \
"crate://crate:crate@localhost:4200/demo/taxi-tiny"
Load from HTTP.
ctk load \
"https+deltalake://datahub.example.org/delta/demo/taxi-tiny" \
"crate://crate:crate@localhost:4200/demo/taxi-tiny"
Load from AWS S3.
ctk load \
"s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=<your_access_key_id>&AWS_SECRET_ACCESS_KEY=<your_secret_access_key>&AWS_ENDPOINT=<endpoint_url>&AWS_REGION=<s3-region>" \
"crate://crate:crate@localhost:4200/demo/taxi-tiny"
Load from Azure Data Lake.
ctk load \
"abfs+deltalake://container/path?AZURE_STORAGE_ACCOUNT_NAME=devstoreaccount1&AZURE_STORAGE_ACCOUNT_KEY=foo&..." \
"crate://crate:crate@localhost:4200/demo/example"
Load from Google Cloud.
ctk load \
"gs+deltalake://bucket1/demo/taxi-tiny?GOOGLE_SERVICE_ACCOUNT=...&GOOGLE_SERVICE_ACCOUNT_KEY=" \
"crate://crate:crate@localhost:4200/demo/example"
Load from HDFS.
ctk load \
"hdfs+deltalake://localhost:9000" \
"crate://crate:crate@localhost:4200/demo/example"
Load from LakeFS.
ctk load \
"lakefs+deltalake://bucket1/demo/taxi-tiny?endpoint=https://lakefs.example.org&access_key_id=LAKEFSID&secret_access_key=LAKEFSKEY" \
"crate://crate:crate@localhost:4200/demo/example"
Tip
After loading your data into CrateDB, query it.
ctk shell --command 'SELECT * FROM "demo"."taxi-tiny";'
ctk show table '"demo"."taxi-tiny"'
Save¶
Save to filesystem.
ctk save \
"crate://crate:crate@localhost:4200/demo/taxi-tiny" \
"file+deltalake://./var/lib/delta/demo/taxi-tiny"
Save to AWS S3.
ctk save \
"crate://crate:crate@localhost:4200/demo/taxi-tiny" \
"s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=<your_access_key_id>&AWS_SECRET_ACCESS_KEY=<your_secret_access_key>&AWS_ENDPOINT=<endpoint_url>&AWS_REGION=<s3-region>"
For other target URLs, see “Load” section.
Cloud¶
A canonical invocation for copying data from a Delta Lake table on AWS S3 to CrateDB Cloud.
Please note the ssl=true query parameter on the CrateDB cluster URL.
ctk load \
"s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=<your_access_key_id>&AWS_SECRET_ACCESS_KEY=<your_secret_access_key>&AWS_ENDPOINT=<endpoint_url>&AWS_REGION=<s3-region>" \
"crate://admin:dZ...6LqB@green-shaak-ti.eks1.eu-west-1.aws.cratedb.net:4200/testdrive/demo?ssl=true"
Options¶
Generic options¶
batch-size¶
The source URL accepts the batch-size option to configure pagination.
You will need to find an optimal value based on the shape of your data.
The default value is 75000.
Example usage
ctk load \
"file+deltalake://./var/lib/delta/?batch-size=200000" \
"crate://crate:crate@localhost:4200/demo/taxi-tiny"
ctk save \
"crate://crate:crate@localhost:4200/demo/taxi-tiny?batch-size=200000" \
"file+deltalake://./var/lib/delta/"
CrateDB options¶
if-exists¶
The target CrateDB table will be created automatically, if it does not exist. If it
does exist, the if-exists URL query parameter can be used to configure this
behavior. The default value is fail, the possible values are:
fail: Raise a ValueError. (default)replace: Drop the table before inserting new values.append: Insert new values to the existing table.
Example usage
In order to always replace the target table, i.e. to drop and re-create it
prior to inserting data, use ?if-exists=replace.
export CRATEDB_CLUSTER_URL="crate://crate:crate@localhost:4200/testdrive/demo?if-exists=replace"
ctk load ...
Delta Lake options¶
mode¶
The Delta Lake URL supports three modes of operations when used as a data sink.
mode="error"(default):raise an error if the table exists.
mode="append":add the DataFrame’s rows to the existing table.
mode="overwrite":replace the table contents with the DataFrame.
Example usage
ctk save "file+deltalake://./var/lib/delta/?...&mode=append"