Execution¶
Aqueducts supports both local and remote pipeline execution. Use the CLI for local execution or submit jobs to remote executors running in your infrastructure.
Local Execution¶
Execute pipelines directly on your local machine using the Aqueducts CLI.
Basic Usage¶
# Run a pipeline
aqueducts run --file pipeline.yml
# Run with parameters
aqueducts run --file pipeline.yml --param key1=value1 --param key2=value2
# Multiple parameters
aqueducts run --file pipeline.yml --params key1=value1 --params key2=value2
Supported Formats¶
The CLI supports multiple pipeline definition formats:
Format Support
JSON and TOML support require appropriate feature flags during CLI installation.
Local Execution Benefits¶
- Direct access to local files and databases
- No network overhead for file-based operations
- Immediate feedback and debugging
- Full control over execution environment
Remote Execution¶
Execute pipelines on remote executors deployed within your infrastructure, closer to your data sources.
Architecture Overview¶
Remote execution follows this flow:
- CLI Client submits pipeline to Remote Executor
- Remote Executor processes data from Data Sources
- Remote Executor writes results to Destinations
- Remote Executor sends status updates back to CLI Client
Setting Up Remote Execution¶
-
Deploy an Executor:
-
Submit Pipeline from CLI:
Remote Execution Benefits¶
- Minimize network transfer by processing data close to sources
- Scale processing power independently of client machines
- Secure execution within your infrastructure boundaries
- Centralized resource management with memory limits
Managing Remote Executions¶
Monitor and control remote pipeline executions:
# Check executor status
curl http://executor-host:3031/api/health
# Cancel a running pipeline
aqueducts cancel --executor executor-host:3031 \
--api-key your_secret_key \
--execution-id abc-123
Aqueducts Executor¶
The Aqueducts Executor is a deployable application for running pipelines within your infrastructure.
Key Features¶
- Remote Execution: Run data pipelines securely within your infrastructure
- Memory Management: Configure maximum memory usage with DataFusion's memory pool
- Real-time Feedback: WebSocket communication for live progress updates
- Cloud Storage Support: Native S3, GCS, and Azure Blob Storage integration
- Database Connectivity: ODBC support for various database systems
- Exclusive Execution: Single-pipeline execution for optimal resource utilization
Docker Deployment (Recommended)¶
The Docker image includes ODBC support with PostgreSQL drivers pre-installed:
# Pull from GitHub Container Registry
docker pull ghcr.io/vigimite/aqueducts/aqueducts-executor:latest
# Run with command line arguments
docker run -d \
--name aqueducts-executor \
-p 3031:3031 \
ghcr.io/vigimite/aqueducts/aqueducts-executor:latest \
--api-key your_secret_key --max-memory 4
Environment Variables¶
Configure the executor using environment variables:
docker run -d \
--name aqueducts-executor \
-p 3031:3031 \
-e AQUEDUCTS_API_KEY=your_secret_key \
-e AQUEDUCTS_HOST=0.0.0.0 \
-e AQUEDUCTS_PORT=3031 \
-e AQUEDUCTS_MAX_MEMORY=4 \
-e AQUEDUCTS_LOG_LEVEL=info \
ghcr.io/vigimite/aqueducts/aqueducts-executor:latest
Configuration Options¶
Option | Description | Default | Environment Variable |
---|---|---|---|
--api-key |
API key for authentication | - | AQUEDUCTS_API_KEY |
--host |
Host address to bind to | 0.0.0.0 | AQUEDUCTS_HOST |
--port |
Port to listen on | 8080 | AQUEDUCTS_PORT |
--max-memory |
Maximum memory usage in GB (0 for unlimited) | 0 | AQUEDUCTS_MAX_MEMORY |
--server-url |
URL of Aqueducts server for registration (optional) | - | AQUEDUCTS_SERVER_URL |
--executor-id |
Unique identifier for this executor | auto-generated | AQUEDUCTS_EXECUTOR_ID |
--log-level |
Logging level (info, debug, trace) | info | AQUEDUCTS_LOG_LEVEL |
Docker Compose Setup¶
For local development and testing:
# Start database only (default)
docker-compose up
# Start database + executor
docker-compose --profile executor up
# Build and start from source
docker-compose --profile executor up --build
The executor will be available at:
- API:
http://localhost:3031
- Health check:
http://localhost:3031/api/health
- WebSocket:
ws://localhost:3031/ws/connect
Manual Installation¶
Install using Cargo for custom deployments:
# Standard installation with cloud storage features
cargo install aqueducts-executor
# Installation with ODBC support
cargo install aqueducts-executor --features odbc
API Endpoints¶
Endpoint | Method | Auth | Description |
---|---|---|---|
/api/health |
GET | No | Basic health check |
/ws/connect |
GET | Yes | WebSocket endpoint for bidirectional communication |
ODBC Configuration¶
For database connectivity, ODBC support requires the odbc
feature flag during installation and proper system configuration.
Installation Requirements¶
First, install Aqueducts with ODBC support:
# CLI with ODBC support
cargo install aqueducts-cli --features odbc
# Executor with ODBC support
cargo install aqueducts-executor --features odbc
System Dependencies¶
# Install UnixODBC development libraries
sudo apt-get update
sudo apt-get install unixodbc-dev
# PostgreSQL driver
sudo apt-get install odbc-postgresql
# MySQL driver
sudo apt-get install libmyodbc
# SQL Server driver (optional)
curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -
curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list | sudo tee /etc/apt/sources.list.d/msprod.list
sudo apt-get update
sudo apt-get install msodbcsql17
# Install UnixODBC development libraries
sudo dnf install unixODBC-devel
# PostgreSQL driver
sudo dnf install postgresql-odbc
# MySQL driver
sudo dnf install mysql-connector-odbc
# SQL Server driver (optional)
sudo curl -o /etc/yum.repos.d/msprod.repo https://packages.microsoft.com/config/rhel/8/prod.repo
sudo dnf install msodbcsql17
Driver Configuration¶
Data Source Configuration¶
Configure database connections in /etc/odbc.ini
(system) or ~/.odbc.ini
(user):
Connection String Examples¶
# Using DSN
sources:
- type: odbc
name: postgres_data
connection_string: "DSN=PostgreSQL-Local"
load_query: "SELECT * FROM users WHERE created_at > '2024-01-01'"
# Direct connection string
sources:
- type: odbc
name: postgres_data
connection_string: "Driver={PostgreSQL};Server=localhost;Database=mydb;UID=user;PWD=pass;"
load_query: "SELECT * FROM users LIMIT 1000"
# Using DSN
sources:
- type: odbc
name: mysql_data
connection_string: "DSN=MySQL-Local"
load_query: "SELECT * FROM products WHERE price > 100"
# Direct connection string
sources:
- type: odbc
name: mysql_data
connection_string: "Driver={MySQL};Server=localhost;Database=mydb;User=user;Password=pass;"
load_query: "SELECT * FROM orders WHERE date >= '2024-01-01'"
Testing Your Setup¶
1. Test ODBC Installation¶
2. Test Database Connection¶
# Test with isql (interactive SQL)
isql -v PostgreSQL-Local username password
# Test MySQL connection
isql -v MySQL-Local username password
3. Test with Aqueducts¶
Create a minimal test pipeline:
# yaml-language-server: $schema=https://raw.githubusercontent.com/vigimite/aqueducts/main/json_schema/aqueducts.schema.json
version: "v2"
sources:
- type: odbc
name: test_connection
connection_string: "DSN=PostgreSQL-Local"
load_query: "SELECT 1 as test_column"
stages:
- - name: verify
query: "SELECT * FROM test_connection"
show: 1
Run the test:
Common Driver Paths¶
Performance Considerations¶
Optimization Tips
- Limit query results: Use
LIMIT
clauses to avoid memory issues - Filter early: Apply
WHERE
conditions in yourload_query
- Use indexes: Ensure your database queries use appropriate indexes
- Memory management: Set executor
--max-memory
limits appropriately
ODBC Troubleshooting¶
Driver Loading Issues¶
# Check if drivers are registered
odbcinst -q -d
# Test driver loading
ldd /path/to/driver.so # Linux
otool -L /path/to/driver.so # macOS
Connection Issues¶
# Enable ODBC tracing for debugging
export ODBCSYSINI=/tmp
export ODBCINSTINI=/etc/odbcinst.ini
export ODBCINI=/etc/odbc.ini
# Test with verbose output
isql -v DSN_NAME username password
Common Error Solutions¶
- Driver not found: Verify driver paths in
odbcinst.ini
- DSN not found: Check
odbc.ini
configuration - Permission denied: Ensure ODBC files are readable
- Library loading: Install missing system dependencies
Troubleshooting¶
Common Issues¶
Local Execution:
- Pipeline validation errors: Check YAML syntax and schema compliance
- Missing features: Ensure CLI was compiled with required feature flags
- File not found: Verify file paths and permissions
Remote Execution:
- Connection timeouts: Check network connectivity and firewall rules
- Authentication failures: Verify API key configuration
- Executor busy: Only one pipeline runs at a time per executor
- Memory errors: Increase
--max-memory
or optimize pipeline queries
ODBC Issues:
- Driver not found: Install database-specific ODBC drivers
- Connection failures: Verify DSN configuration in
odbc.ini
- Permission errors: Check database credentials and network access
Performance Optimization¶
Memory Management
- Set appropriate
--max-memory
limits for executors - Break large queries into smaller stages
- Add filtering early in the pipeline
- Use partitioning for large datasets
Network Optimization
- Deploy executors close to data sources
- Use cloud storage in the same region as executors
- Minimize data movement between stages