Databricks¶
Load data from Databricks into CrateDB.
Databricks is a cloud-based data warehouse platform for big data analytics and artificial intelligence based on Apache Spark and Delta Lake.
Prerequisites¶
Use Docker or Podman to run all components. This approach works consistently across Linux, macOS, and Windows. You also need an account on https://databricks.com/.
Install¶
Install the most recent Python package cratedb-toolkit, or evaluate alternative installation methods.
uv tool install --upgrade 'cratedb-toolkit[io-ingest]'
Tutorial¶
5-minute step-by-step instructions about how to work with Databricks and CrateDB.
Authentication¶
For accessing Databricks SQL warehouses, you need an access token, your server hostname, and the HTTP path (endpoint of the warehouse).
On the Databricks platform, in your settings dialog (top right corner),
navigate to “User » Developer » Access tokens » Manage”, and select
“Generate new token”. Then, copy the generated token as suggested on
the result dialog (example: dapi367e0b1....), and use it in your
command below.
– https://<instance>.cloud.databricks.com/settings/user/developer
Within the primary navigation, navigate to “SQL » SQL Warehouses”, select the “Serverless Starter Warehouse” or any other, navigate to “Connection Details”, and use the values in “Server hostname” and “HTTP path” in your command below.
– https://<instance>.cloud.databricks.com/sql/warehouses/<warehouse>/connectionDetails
Services¶
Run CrateDB using Docker or Podman.
docker run --rm --name=cratedb \
--publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \
docker.io/crate:latest '-Cdiscovery.type=single-node'
Populate data¶
The Databricks starter warehouse includes a few example tables. You can use them for data transfer explorations, so you don’t need to load any data into the Databricks SQL warehouse before.
Load data¶
Use CrateDB Toolkit to load data from Databricks SQL warehouse into CrateDB table.
samples.nyctaxi.trips¶
Transfer data.
ctk load \
"databricks://token:<access_token>@<instance>.cloud.databricks.com:443/?http_path=/sql/1.0/warehouses/<warehouse>&catalog=samples&table=nyctaxi.trips" \
"crate://crate:crate@localhost:4200/testdrive/nyctaxi_trips"
Query data using crash.
crash -c 'SHOW CREATE TABLE testdrive.nyctaxi_trips'
crash -c 'SELECT * FROM testdrive.nyctaxi_trips LIMIT 5'
samples.accuweather.forecast_hourly_metric¶
Transfer data.
ctk load \
"databricks://token:<access_token>@<instance>.cloud.databricks.com:443/?http_path=/sql/1.0/warehouses/<warehouse>&catalog=samples&table=accuweather.forecast_hourly_metric" \
"crate://crate:crate@localhost:4200/testdrive/accuweather_forecast_hourly_metric"
Query data.
crash -c 'SHOW CREATE TABLE testdrive.accuweather_forecast_hourly_metric'
crash -c 'SELECT * FROM testdrive.accuweather_forecast_hourly_metric LIMIT 5'
Documentation¶
The Databricks table name can be provided by using the &table= query parameter.
Databricks options
server_hostname: Host name of the Databricks instance.http_path: Path to the Databricks instance.catalog: Database catalog name.schema: Database schema name.
Databricks access token authentication
access_token: Access token for Databricks.
A fully qualified traditional URL template for Databricks looks like this.
databricks://token:${DATABRICKS_ACCESS_TOKEN}@${DATABRICKS_SERVER_HOSTNAME}?http_path=${DATABRICKS_HTTP_PATH}&catalog=${DATABRICKS_CATALOG}&schema=${DATABRICKS_SCHEMA}
Databricks OAuth M2M authentication (service principal)
client_id: OAuth service principal’s client ID (application ID).client_secret: OAuth secret for service principal.
A fully qualified M2M URL template for Databricks looks like this.
databricks://@${DATABRICKS_SERVER_HOSTNAME}?http_path=${DATABRICKS_HTTP_PATH}&client_id=${DATABRICKS_CLIENT_ID}&client_secret=${DATABRICKS_CLIENT_SECRET}&catalog=${DATABRICKS_CATALOG}&schema=${DATABRICKS_SCHEMA}
To set up OAuth M2M authentication:
Create a service principal in your Databricks workspace.
Generate an OAuth secret for the service principal.
Ensure the service principal has the necessary permissions to access your workspace resources.
To learn more details, you can read about Databricks OAuth M2M authentication on the original documentation.
See also the documentation about SQLAlchemy’s Databricks dialect.
CrateDB options
Please make sure to replace username, password, and hostname with values matching your environment.
ssl: Use the?ssl=truequery parameter to enable SSL. Also use this when connecting to CrateDB Cloud.'crate://crate:crate@cratedb.example.org:4200/schema/table?ssl=true'
See also¶
CrateDB also provides native data import capabilities and support for different ETL applications and frameworks, see load data into CrateDB. If you have additional requirements on this or other I/O adapters, for example to support advanced processing options or different data formats, or if you want us to provide a managed variant, please let us know through any of our support channels, preferably on our community forum.