CrateDB I/O Subsystem¶
Import and export data into/from CrateDB.
About¶
A versatile data I/O framework and command-line application to copy data between any source and any destination. It supports many data sources, destinations, and data loading strategies out of the box.
Adapters for CrateDB let you migrate data from any proprietary enterprise data warehouse or database to CrateDB or CrateDB Cloud, to consolidate infrastructure and save operational costs.
The polyglot pipeline subsystem covers data transfer from and to AWS DMS, Databricks, DuckDB, DynamoDB, InfluxDB, MongoDB, MongoDB Atlas, MotherDuck, PostgreSQL, and many more streaming sources, databases, and data platforms or services with CrateDB and CrateDB Cloud. For a full list of integrations, see I/O adapter coverage.
Synopsis¶
You can run jobs from the command-line or by using the Python API.
CLI¶
The CLI entrypoints to the I/O subsystem are the ctk load
and ctk save commands.
Load data from external resource into CrateDB.
ctk load \
'protocol://username:password@hostname:port/resource' \
'crate://crate:crate@cratedb.example.org:4200/schema/table'
Save data from CrateDB to external resource.
ctk save \
'crate://crate:crate@cratedb.example.org:4200/schema/table' \
'protocol://username:password@hostname:port/resource'
Python API¶
Alternatively, use the Python API to import or export data.
from cratedb_toolkit import DatabaseCluster, InputOutputResource
# Connect to CrateDB database cluster.
with DatabaseCluster.from_params(cluster_url="crate://crate:crate@cratedb.example.org:4200/schema/table") as cluster:
# Load data from external resource into CrateDB.
cluster.load_table(source=InputOutputResource(url="protocol://username:password@hostname:port/resource"))
# Save data from CrateDB to external resource.
cluster.save_table(target=InputOutputResource(url="protocol://username:password@hostname:port/resource"))
Installation¶
CrateDB Toolkit uses the Python programming language with native performance extensions, which is installed on most machines today. Otherwise, we recommend to download and install Python from the original source. For installing additional Python packages, we recommend to install the uv package manager.
uv tool install --upgrade 'cratedb-toolkit'
An alternative way to install Python packages is to use pipx
or pip install --user.
pipx install 'cratedb-toolkit'
Another way to invoke CrateDB Toolkit without installing it is to use its container image with Docker, Podman, Kubernetes, and friends.
docker run --rm ghcr.io/crate/cratedb-toolkit ctk
Special considerations
Individual I/O adapters need different sets of dependency packages, please
consult relevant installation notes in the corresponding documentation
sections. Support for I/O adapter types is currently divided into two
families defined by Python package extras io-curated and io-ingest,
which are mutually exclusive to each other.
Support for files, open table formats, InfluxDB, and MongoDB.
uv tool install --upgrade 'cratedb-toolkit[io-curated]'
Support for other databases, streams, platforms, and services.
uv tool install --upgrade 'cratedb-toolkit[io-ingest]'
Alternatively, use Docker or Podman to invoke the container image.
docker run --rm ghcr.io/crate/cratedb-toolkit ctk --version
docker run --rm ghcr.io/crate/cratedb-toolkit-ingest ctk --version
General notes¶
URLs everywhere
The I/O subsystem uses URLs across the board to address data sources and sinks.
Authentication
External: Different data pipeline elements use their specific way to configure access credentials or tokens, using individual parameters.
CrateDB: CrateDB as a pipeline source or sink element uses the same way to
configure credentials across the board.
Please note you must specify a password. If your account does not use a
password, use a random string or just na.
Address arbitrary resources
The resource address will be picked from the resource locator URL path
/resource, which has different semantics based on the adapter type.
It can be a table name, a bucket name and object path, or anything else
that identifies an URL-based resource uniquely within the namespace of
the base URL.
Some adapter types also accept the ?table= URL query parameter that
can optionally encode two components separated by a dot, like
database.table or database.collection. Others encode the database
name into the hostname fragment of the URL.
Please consult individual adapter documentation pages to learn about available URL parameters and differences.
Address CrateDB schema and table
CrateDB schema and table names will be picked from the resource locator
URL path /schema/table.
When addressing CrateDB as a data sink, and omitting those parameters, the target table address will be derived from the address of the data source. When addressing CrateDB as a data source, the source table parameter is obligatory.
If you would like to specify the table name differently, use the ?table= URL
query parameter, the --table command line option, or the CRATEDB_TABLE
environment variable.
If you want to target a different database schema, use the ?schema= URL
query parameter, the --schema command line option, or the CRATEDB_SCHEMA
environment variable. If this parameter is not defined, CrateDB’s default
schema doc will be used.
Connect to CrateDB using SSL
Use the ?ssl=true query parameter, and replace username, password, and
hostname with values matching your environment. Also use this variant to
connect to CrateDB Cloud.
'crate://crate:crate@cratedb.example.org:4200/schema/table?ssl=true'
Transfer multiple resources
Currently, the pipeline system can transfer single resources / tables with most of the I/O adapter types, and multiple resources / catalogs / collections with some adapter types. A few file-based adapters provide file globbing, and the MongoDB I/O adapter permits transfer of whole MongoDB databases, including multiple collections.
This detail (resource globbing and selection) will be improved in future iterations across the board. In the meanwhile, please iterate all sibling resources in a loop where multi-resource selection is not possible yet, i.e. transfer table by table.
Incremental loading
Adapters of the io-ingest family support incremental loading, which means
you can choose to append, merge, or delete+insert data into the destination
table using different strategies.
Incremental loading allows you to ingest only the new rows from the source table into the destination table, which means that you do not have to load the entire table every time you run the data migration procedure.
This comes at a minor cost that a few bookkeeping columns exist in the target table, however that is rarely an issue.
Coverage¶
Supported data formats, database types, data platforms, analytics engines, and other services.
- File formats:
CSV, JSONL/NDJSON, Parquet
- Open table formats:
Apache Iceberg, DeltaLake
- Cloud storage:
Amazon S3, Azure Cloud Storage, Google Cloud Storage (GCS)
- Databases:
Actian Data Platform, Actian X, Amazon Athena, Amazon Redshift, Apache Drill, Apache Druid, Apache Hive and Presto, Apache Solr, Clickhouse, CockroachDB, CrateDB, Databend, Databricks, Denodo, DuckDB, EXASOL DB, Elasticsearch, Firebird, Firebolt, Google BigQuery, Google Sheets, Greenplum, HyperSQL (hsqldb), IBM DB2 and Informix, IBM Netezza Performance Server, Impala, Ingres, Kinetica, Microsoft Access, Microsoft SQL Server, MonetDB, MongoDB, MySQL and MariaDB, OpenGauss, OpenSearch, Oracle, PostgreSQL, Rockset, SAP ASE, SAP HANA, SAP Sybase SQL Anywhere, Snowflake, SQLite, Teradata Vantage, TiDB, Vector, YDB, YugabyteDB
- Streams:
Amazon Kinesis, Apache Kafka (Amazon MSK, Confluent Kafka, Redpanda, RobustMQ, WarpStream)
- Services:
Airtable, Asana, Facebook Ads, GitHub, Google Ads, Google Analytics, Google Sheets, Jira, HubSpot, Linear, LinkedIn Ads, Mailchimp, Mixpanel, Notion, Personio, Pinterest, Pipedrive, Salesforce, Shopify, Slack, Stripe, TikTok Ads, Zendesk, Zoom, etc.