Architecture
An aqueduct is a pipeline definition and consists of 3 main parts
- Source -> the source data for this pipeline
- Stage -> transformations applied within this pipeline
- Destination -> output of the pipeline result
Source
An Aqueduct source can be:
- CSV or Parquet file(s)
- single file
- directory
- Delta table
- ODBC query (EXPERIMENTAL)
For file based sources a schema can be provided optionally.
The source is registered within the SessionContext
as a table that can be referenced using the sources configured name. A prerequisite here is that the necessary features for the underlying object stores are enabled.
This can be provided by an external SessionContext
passed into the run_pipeline
function or by registering the correct handlers for deltalake.
EXPERIMENTAL ODBC support
As an experimental feature it is possible to query various databases using ODBC. This is enabled through arrow-odbc.
Besides enabling the odbc
feature flag in your Cargo.toml
there are some other prerequisites for the executing system:
unixodbc
on unix based systems- ODBC driver for the database you want to access like ODBC Driver for SQL server or psqlodbc
- registering the driver in the ODBC manager configuration (usually located in
/etc/odbcinst.ini
)
If you have issues setting this up there are many resources online explaining how to set this up, it is a bit of a hassle.
Stage
An Aqueduct stage defines a transformation using SQL. Each stage has access to all defined sources and to every previously executed stage within the SQL context using the respectively configured names. Once executed the stage will then persist its result into the SQL context making it accessible to downstream consumers.
The stage can be set to print the result and/or the result schema to the stdout
. This is useful for development/debugging purposes.
Nested stages are executed in parallel
Destination
An Aqueduct destination can be:
- CSV or Parquet file(s)
- single file
- directory
- Delta table
- ODBC query (NOT IMPLEMENTED YET)
An Aqueduct destination is the target for the execution of the pipeline, the result of the final stage that was executed is used as the input for the destination to write the data to the underlying table/file.
File based destinations
File based destinations have support for HDFS style partitioning (output/location=1/...
) and can be set to output only a single file or multiple files based on the configuration.
Delta Table destination
For a DeltaTable there is some additional logic that is utilized to maintain the table integrity.
The destination will first cast and validate the schema of the input data and then use one of 3 configurable modes to write the data:
- Append -> appends the data to the destination
- Upsert -> merges the data to the destination, using the provided configuration for this mode to identify cohort columns that are used to determine which data should be updated
- provided merge columns are used to check equality e.g.
vec!["date", "country"]
-> update data whereold.date = new.date AND old.country = new.country
- Replace -> replaces the data using a configurable predicate to determine which data should be replaced by the operation
- provided replacement conditions are used to check equality e.g.
ReplacementCondition { column: "date", value: "1970-01-01" }
-> replace data whereold.date = '1970-01-01'