CrateDB I/O Subsystem¶

Import and export data into/from CrateDB.

About¶

A versatile data I/O framework and command-line application to copy data between any source and any destination. It supports many data sources, destinations, and data loading strategies out of the box.

Adapters for CrateDB let you migrate data from any proprietary enterprise data warehouse or database to CrateDB or CrateDB Cloud, to consolidate infrastructure and save operational costs.

The polyglot pipeline subsystem covers data transfer from and to AWS DMS, Databricks, DuckDB, DynamoDB, InfluxDB, MongoDB, MongoDB Atlas, MotherDuck, PostgreSQL, and many more streaming sources, databases, and data platforms or services with CrateDB and CrateDB Cloud. For a full list of integrations, see I/O adapter coverage.

Synopsis¶

You can run jobs from the command-line or by using the Python API.

CLI¶

The CLI entrypoints to the I/O subsystem are the ctk load and ctk save commands.

Load data from external resource into CrateDB.

ctk load \
  'protocol://username:password@hostname:port/resource' \
  'crate://crate:crate@cratedb.example.org:4200/schema/table'

Save data from CrateDB to external resource.

ctk save \
  'crate://crate:crate@cratedb.example.org:4200/schema/table' \
  'protocol://username:password@hostname:port/resource'

Python API¶

Alternatively, use the Python API to import or export data.

from cratedb_toolkit import DatabaseCluster, InputOutputResource

# Connect to CrateDB database cluster.
with DatabaseCluster.from_params(cluster_url="crate://crate:crate@cratedb.example.org:4200/schema/table") as cluster:

    # Load data from external resource into CrateDB.
    cluster.load_table(source=InputOutputResource(url="protocol://username:password@hostname:port/resource"))

    # Save data from CrateDB to external resource.
    cluster.save_table(target=InputOutputResource(url="protocol://username:password@hostname:port/resource"))

Installation¶

CrateDB Toolkit uses the Python programming language with native performance extensions, which is installed on most machines today. Otherwise, we recommend to download and install Python from the original source. For installing additional Python packages, we recommend to install the uv package manager.

uv tool install --upgrade 'cratedb-toolkit'

An alternative way to install Python packages is to use pipx or pip install --user.

pipx install 'cratedb-toolkit'

Another way to invoke CrateDB Toolkit without installing it is to use its container image with Docker, Podman, Kubernetes, and friends.

docker run --rm ghcr.io/crate/cratedb-toolkit ctk

Special considerations

Individual I/O adapters need different sets of dependency packages, please consult relevant installation notes in the corresponding documentation sections. Support for I/O adapter types is currently divided into two families defined by Python package extras io-curated and io-ingest, which are mutually exclusive to each other.

Support for files, open table formats, InfluxDB, and MongoDB.

uv tool install --upgrade 'cratedb-toolkit[io-curated]'

Support for other databases, streams, platforms, and services.

uv tool install --upgrade 'cratedb-toolkit[io-ingest]'

Alternatively, use Docker or Podman to invoke the container image.

docker run --rm ghcr.io/crate/cratedb-toolkit ctk --version

docker run --rm ghcr.io/crate/cratedb-toolkit-ingest ctk --version

General notes¶

URLs everywhere

The I/O subsystem uses URLs across the board to address data sources and sinks.

Authentication

External: Different data pipeline elements use their specific way to configure access credentials or tokens, using individual parameters.

CrateDB: CrateDB as a pipeline source or sink element uses the same way to configure credentials across the board. Please note you must specify a password. If your account does not use a password, use a random string or just na.

Address arbitrary resources

The resource address will be picked from the resource locator URL path /resource, which has different semantics based on the adapter type. It can be a table name, a bucket name and object path, or anything else that identifies an URL-based resource uniquely within the namespace of the base URL.

Some adapter types also accept the ?table= URL query parameter that can optionally encode two components separated by a dot, like database.table or database.collection. Others encode the database name into the hostname fragment of the URL.

Please consult individual adapter documentation pages to learn about available URL parameters and differences.

Address CrateDB schema and table

CrateDB schema and table names will be picked from the resource locator URL path /schema/table.

When addressing CrateDB as a data sink, and omitting those parameters, the target table address will be derived from the address of the data source. When addressing CrateDB as a data source, the source table parameter is obligatory.

If you would like to specify the table name differently, use the ?table= URL query parameter, the --table command line option, or the CRATEDB_TABLE environment variable.

If you want to target a different database schema, use the ?schema= URL query parameter, the --schema command line option, or the CRATEDB_SCHEMA environment variable. If this parameter is not defined, CrateDB’s default schema doc will be used.

Connect to CrateDB using SSL

Use the ?ssl=true query parameter, and replace username, password, and hostname with values matching your environment. Also use this variant to connect to CrateDB Cloud.

'crate://crate:crate@cratedb.example.org:4200/schema/table?ssl=true'

Transfer multiple resources

Currently, the pipeline system can transfer single resources / tables with most of the I/O adapter types, and multiple resources / catalogs / collections with some adapter types. A few file-based adapters provide file globbing, and the MongoDB I/O adapter permits transfer of whole MongoDB databases, including multiple collections.

This detail (resource globbing and selection) will be improved in future iterations across the board. In the meanwhile, please iterate all sibling resources in a loop where multi-resource selection is not possible yet, i.e. transfer table by table.

Incremental loading

Adapters of the io-ingest family support incremental loading, which means you can choose to append, merge, or delete+insert data into the destination table using different strategies.

Incremental loading allows you to ingest only the new rows from the source table into the destination table, which means that you do not have to load the entire table every time you run the data migration procedure.

This comes at a minor cost that a few bookkeeping columns exist in the target table, however that is rarely an issue.

Coverage¶

Supported data formats, database types, data platforms, analytics engines, and other services.

File formats:: CSV, JSONL/NDJSON, Parquet
Open table formats:: Apache Iceberg, DeltaLake
Cloud storage:: Amazon S3, Azure Cloud Storage, Google Cloud Storage (GCS)
Databases:: Actian Data Platform, Actian X, Amazon Athena, Amazon Redshift, Apache Drill, Apache Druid, Apache Hive and Presto, Apache Solr, Clickhouse, CockroachDB, CrateDB, Databend, Databricks, Denodo, DuckDB, EXASOL DB, Elasticsearch, Firebird, Firebolt, Google BigQuery, Google Sheets, Greenplum, HyperSQL (hsqldb), IBM DB2 and Informix, IBM Netezza Performance Server, Impala, Ingres, Kinetica, Microsoft Access, Microsoft SQL Server, MonetDB, MongoDB, MySQL and MariaDB, OpenGauss, OpenSearch, Oracle, PostgreSQL, Rockset, SAP ASE, SAP HANA, SAP Sybase SQL Anywhere, Snowflake, SQLite, Teradata Vantage, TiDB, Vector, YDB, YugabyteDB
Streams:: Amazon Kinesis, Apache Kafka (Amazon MSK, Confluent Kafka, Redpanda, RobustMQ, WarpStream)
Services:: Airtable, Asana, Facebook Ads, GitHub, Google Ads, Google Analytics, Google Sheets, Jira, HubSpot, Linear, LinkedIn Ads, Mailchimp, Mixpanel, Notion, Personio, Pinterest, Pipedrive, Salesforce, Shopify, Slack, Stripe, TikTok Ads, Zendesk, Zoom, etc.