(databricks)= # Databricks :::{div} sd-text-muted Load data from Databricks into CrateDB. ::: [Databricks] is a cloud-based data warehouse platform for big data analytics and artificial intelligence based on Apache Spark and Delta Lake. ## Prerequisites Use Docker or Podman to run all components. This approach works consistently across Linux, macOS, and Windows. You also need an account on https://databricks.com/. ## Install Install the most recent Python package [cratedb-toolkit], or evaluate {ref}`alternative installation methods `. ```shell uv tool install --upgrade 'cratedb-toolkit[io-ingest]' ``` ## Tutorial 5-minute step-by-step instructions about how to work with Databricks and CrateDB. ### Authentication For accessing Databricks SQL warehouses, you need an **access token**, your **server hostname**, and the **HTTP path** (endpoint of the warehouse). On the Databricks platform, in your settings dialog (top right corner), navigate to "User » Developer » Access tokens » Manage", and select "Generate new token". Then, copy the generated token as suggested on the result dialog (example: `dapi367e0b1....`), and use it in your command below. -- `https://.cloud.databricks.com/settings/user/developer` Within the primary navigation, navigate to "SQL » SQL Warehouses", select the "Serverless Starter Warehouse" or any other, navigate to "Connection Details", and use the values in "Server hostname" and "HTTP path" in your command below. -- `https://.cloud.databricks.com/sql/warehouses//connectionDetails` ### Services Run CrateDB using Docker or Podman. ```shell docker run --rm --name=cratedb \ --publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \ docker.io/crate:latest '-Cdiscovery.type=single-node' ``` ### Populate data The Databricks starter warehouse includes a few example tables. You can use them for data transfer explorations, so you don't need to load any data into the Databricks SQL warehouse before. ### Load data Use {ref}`index` to load data from Databricks SQL warehouse into CrateDB table. #### `samples.nyctaxi.trips` Transfer data. ```shell ctk load \ "databricks://token:@.cloud.databricks.com:443/?http_path=/sql/1.0/warehouses/&catalog=samples&table=nyctaxi.trips" \ "crate://crate:crate@localhost:4200/testdrive/nyctaxi_trips" ``` Query data using [crash](https://pypi.org/project/crash/). ```shell crash -c 'SHOW CREATE TABLE testdrive.nyctaxi_trips' ``` ```shell crash -c 'SELECT * FROM testdrive.nyctaxi_trips LIMIT 5' ``` #### `samples.accuweather.forecast_hourly_metric` Transfer data. ```shell ctk load \ "databricks://token:@.cloud.databricks.com:443/?http_path=/sql/1.0/warehouses/&catalog=samples&table=accuweather.forecast_hourly_metric" \ "crate://crate:crate@localhost:4200/testdrive/accuweather_forecast_hourly_metric" ``` Query data. ```shell crash -c 'SHOW CREATE TABLE testdrive.accuweather_forecast_hourly_metric' ``` ```shell crash -c 'SELECT * FROM testdrive.accuweather_forecast_hourly_metric LIMIT 5' ``` ## Documentation The Databricks table name can be provided by using the `&table=` query parameter. :::{rubric} Databricks options ::: - `server_hostname`: Host name of the Databricks instance. - `http_path`: Path to the Databricks instance. - `catalog`: Database catalog name. - `schema`: Database schema name. :::{rubric} Databricks access token authentication ::: - `access_token`: Access token for Databricks. A fully qualified traditional URL template for Databricks looks like this. ```shell databricks://token:${DATABRICKS_ACCESS_TOKEN}@${DATABRICKS_SERVER_HOSTNAME}?http_path=${DATABRICKS_HTTP_PATH}&catalog=${DATABRICKS_CATALOG}&schema=${DATABRICKS_SCHEMA} ``` :::{rubric} Databricks OAuth M2M authentication (service principal) ::: - `client_id`: OAuth service principal's client ID (application ID). - `client_secret`: OAuth secret for service principal. A fully qualified M2M URL template for Databricks looks like this. ```shell databricks://@${DATABRICKS_SERVER_HOSTNAME}?http_path=${DATABRICKS_HTTP_PATH}&client_id=${DATABRICKS_CLIENT_ID}&client_secret=${DATABRICKS_CLIENT_SECRET}&catalog=${DATABRICKS_CATALOG}&schema=${DATABRICKS_SCHEMA} ``` To set up OAuth M2M authentication: 1. Create a service principal in your Databricks workspace. 1. Generate an OAuth secret for the service principal. 1. Ensure the service principal has the necessary permissions to access your workspace resources. To learn more details, you can read about [Databricks OAuth M2M authentication] on the original documentation. See also the documentation about [SQLAlchemy's Databricks dialect]. :::{include} ../../_cratedb-options.md ::: ## See also :::{include} /_snippet/ingest-see-also.md ::: [cratedb-toolkit]: https://pypi.org/project/cratedb-toolkit/ [Databricks]: https://www.databricks.com/ [Databricks OAuth M2M authentication]: https://getbruin.com/docs/ingestr/supported-sources/databricks.html [SQLAlchemy's Databricks dialect]: https://docs.databricks.com/aws/en/dev-tools/sqlalchemy