(managed-etl)=

# Managed ETL

:::{include} /_snippet/links.md
:::

:::{div} sd-text-muted
CrateDB Cloud offers a managed ETL subsystem.
:::

:::{div}
Using the [Import API] or the [Export API] of [CrateDB Cloud], you can import
or export data from/into files, or load data, also continuously using [change
data capture], from databases and streaming sources.

- File formats: CSV, JSON, Parquet
- Databases: DynamoDB, MongoDB; including CDC

This walkthrough describes how to orchestrate managed ETL jobs using either the
command-line, or the Python API. Alternatively, if you prefer an interactive
user interface, please use the web-based [CrateDB Cloud Console].
:::

:::{include} /_snippet/install-ctk.md
:::

:::{include} /_snippet/cloud-prerequisites.md
:::

## Usage

You can run jobs from the command-line or by using the Python API.

### CLI

Use the `ctk load` command to load data into database tables.
By default, the table name is derived from the filename.
Use the `--table` option to specify a custom name.
```shell
ctk load 'https://cdn.crate.io/downloads/datasets/cratedb-datasets/cloud-tutorials/data_weather.csv.gz'
ctk load 'https://cdn.crate.io/downloads/datasets/cratedb-datasets/cloud-tutorials/data_marketing.json.gz'
ctk load 'https://cdn.crate.io/downloads/datasets/cratedb-datasets/timeseries/yc.2019.07-tiny.parquet.gz' --table='yc-2019-07-tiny'
```

Use the `ctk shell` command to query and aggregate data using SQL.
```shell
ctk shell --command="SELECT * FROM data_weather LIMIT 10;"
ctk shell --command="SELECT * FROM data_weather LIMIT 10;" --format=csv
ctk shell --command="SELECT * FROM data_weather LIMIT 10;" --format=json
```
:::{note}
`ctk shell` effectively just invokes the [CrateDB Shell], with the additional
benefit of transparently authenticating users from the existing session
established by CTK/croud. Otherwise, you would need to provide authentication
credentials separately.
:::

### Python API

Use the Python API to import and query data.

```python
# Import API classes.
from cratedb_toolkit import InputOutputResource, ManagedCluster

# Define data source.
url = "https://cdn.crate.io/downloads/datasets/cratedb-datasets/machine-learning/timeseries/nab-machine-failure.csv"
source = InputOutputResource(url=url)

# Connect to CrateDB Cloud.
with ManagedCluster.from_env() as cluster:

    # Invoke the import job.
    cluster.load_table(source=source)

    # Query imported data.
    results = cluster.query('SELECT * FROM "nab-machine-failure" LIMIT 10;')
    
    # Display data.
    from pprint import pprint
    pprint(results)
```


[change data capture]: https://en.wikipedia.org/wiki/Change_data_capture
[CrateDB Shell]: https://cratedb.com/docs/crate/crash/
[Export API]: https://cratedb.com/docs/cloud/en/latest/cluster/export.html
[Import API]: https://cratedb.com/docs/cloud/en/latest/cluster/import.html