DuckDB Integration

This guide covers the DuckDB and Hive-partitioned Parquet export paths for analytical workloads.

For destination-selection guidance across MongoDB, Neo4j, and Snowflake, see Export Destinations.

For record-to-DataFrame mapping internals, see Record Mapping Process. For workflow interaction patterns, see Workflow Interactions.

Installation

# Minimal DuckDB-only install
pip install 'imednet[duckdb]'

# Full analytical stack (includes pandas, pyarrow, sqlalchemy, openpyxl, duckdb)
pip install 'imednet[export]'

export_to_duckdb quickstart

Use imednet.integrations.export_to_duckdb() to export one study into one DuckDB table and query it immediately:

import duckdb
from imednet import ImednetSDK
from imednet.integrations import export_to_duckdb

sdk = ImednetSDK(...)
export_to_duckdb(sdk, "MY_STUDY", "./clinical.duckdb", "study_records")

conn = duckdb.connect("./clinical.duckdb", read_only=True)
print(conn.execute("SELECT COUNT(*) FROM study_records").fetchone())

export_to_duckdb_by_form

Use imednet.integrations.export_to_duckdb_by_form() when you want one table per form key:

import duckdb
from imednet import ImednetSDK
from imednet.integrations import export_to_duckdb_by_form

sdk = ImednetSDK(...)
export_to_duckdb_by_form(sdk, "MY_STUDY", "./clinical_by_form.duckdb")

conn = duckdb.connect("./clinical_by_form.duckdb", read_only=True)
print(conn.execute("SHOW TABLES").fetchall())

Hive-partitioned Parquet for multi-study workloads

Use imednet.integrations.export_to_hive_parquet() when many studies are being extracted concurrently and you want each study/form stored independently on disk. In that scenario, partitioned Parquet avoids a single shared database file lock boundary while preserving direct queryability from DuckDB.

Example layout:

lake/
├── study_key=STUDY_A/
│   ├── form_key=DEMOGRAPHICS/
│   │   └── records.parquet
│   └── form_key=LABS/
│       └── records.parquet
└── study_key=STUDY_B/
    └── form_key=DEMOGRAPHICS/
        └── records.parquet

DuckDB query with Hive partition discovery and schema union:

import duckdb

conn = duckdb.connect()
rows = conn.execute(
    """
    SELECT study_key, form_key, COUNT(*)
    FROM read_parquet('./lake/**/*.parquet', hive_partitioning = true, union_by_name = true)
    GROUP BY 1, 2
    ORDER BY 1, 2
    """
).fetchall()
print(rows)

You can also generate the query string via imednet.integrations.hive_parquet_query():

import duckdb
from imednet.integrations import hive_parquet_query

conn = duckdb.connect()
print(conn.execute(hive_parquet_query("./lake")).fetchdf())

DuckDBIngestionWorkflow

For incremental bronze/silver ingestion, use imednet_workflows.DuckDBIngestionWorkflow.

import duckdb
from imednet import ImednetSDK
from imednet_workflows import DuckDBIngestionWorkflow

sdk = ImednetSDK(...)
workflow = DuckDBIngestionWorkflow(sdk, "./centralized.duckdb")

# Incremental bronze load
inserted = workflow.ingest_revisions(
    "MY_STUDY",
    start_date="2026-01-01",
    end_date="2026-01-31",
    mode="append",
)
print(f"Inserted {inserted} revision rows")

# Build silver current-state projection
workflow.build_silver_view("MY_STUDY")

conn = duckdb.connect("./centralized.duckdb", read_only=True)
print(
    conn.execute(
        "SELECT record_id, variable_name, value "
        "FROM silver_current_state ORDER BY record_id, variable_name LIMIT 10"
    ).fetchall()
)

The two key methods are imednet_workflows.duckdb_centralizer.DuckDBIngestionWorkflow.ingest_revisions() for incremental loads and imednet_workflows.duckdb_centralizer.DuckDBIngestionWorkflow.build_silver_view() for current-state materialization.

CLI duckdb subcommand reference

Inspect the full command and option reference:

imednet export duckdb --help

Annotated examples:

# Export only selected variables to a single DuckDB table
imednet export duckdb MY_STUDY study_records ./clinical.duckdb --vars AGE,SEX,WEIGHT

# Export only selected forms (comma-separated numeric form IDs)
imednet export duckdb MY_STUDY study_records ./clinical.duckdb --forms 10,20,30

# Use variable labels as column names instead of variable names
imednet export duckdb MY_STUDY study_records ./clinical.duckdb --use-labels

Performance notes

The DuckDB path behind imednet.integrations.export_to_duckdb() and imednet.integrations.export_to_duckdb_by_form() registers a DataFrame directly in DuckDB before a CREATE TABLE AS SELECT operation. Compared to row-by-row insert patterns often used with SQLAlchemy targets, this is typically much lower overhead for large exports.

As an illustrative estimate (not a guaranteed benchmark), teams often observe whole-table loads finishing in seconds to minutes where row-wise insert flows can take materially longer for the same dataset shape.

Hive-partitioned Parquet via imednet.integrations.export_to_hive_parquet() also avoids SQLite’s 2000-column table ceiling by storing each form independently and allowing federated reads through DuckDB.

Out of scope

This guide does not cover MotherDuck/cloud DuckDB or multi-process orchestration patterns.