DuckDB Integration
This guide covers the DuckDB and Hive-partitioned Parquet export paths for analytical workloads.
For destination-selection guidance across MongoDB, Neo4j, and Snowflake, see Export Destinations.
For record-to-DataFrame mapping internals, see Record Mapping Process. For workflow interaction patterns, see Workflow Interactions.
Installation
# Minimal DuckDB-only install
pip install 'imednet[duckdb]'
# Full analytical stack (includes pandas, pyarrow, sqlalchemy, openpyxl, duckdb)
pip install 'imednet[export]'
export_to_duckdb quickstart
Use imednet.integrations.export_to_duckdb() to export one study into one DuckDB
table and query it immediately:
import duckdb
from imednet import ImednetSDK
from imednet.integrations import export_to_duckdb
sdk = ImednetSDK(...)
export_to_duckdb(sdk, "MY_STUDY", "./clinical.duckdb", "study_records")
conn = duckdb.connect("./clinical.duckdb", read_only=True)
print(conn.execute("SELECT COUNT(*) FROM study_records").fetchone())
export_to_duckdb_by_form
Use imednet.integrations.export_to_duckdb_by_form() when you want one table per
form key:
import duckdb
from imednet import ImednetSDK
from imednet.integrations import export_to_duckdb_by_form
sdk = ImednetSDK(...)
export_to_duckdb_by_form(sdk, "MY_STUDY", "./clinical_by_form.duckdb")
conn = duckdb.connect("./clinical_by_form.duckdb", read_only=True)
print(conn.execute("SHOW TABLES").fetchall())
Hive-partitioned Parquet for multi-study workloads
Use imednet.integrations.export_to_hive_parquet() when many studies are being
extracted concurrently and you want each study/form stored independently on disk. In
that scenario, partitioned Parquet avoids a single shared database file lock boundary
while preserving direct queryability from DuckDB.
Example layout:
lake/
├── study_key=STUDY_A/
│ ├── form_key=DEMOGRAPHICS/
│ │ └── records.parquet
│ └── form_key=LABS/
│ └── records.parquet
└── study_key=STUDY_B/
└── form_key=DEMOGRAPHICS/
└── records.parquet
DuckDB query with Hive partition discovery and schema union:
import duckdb
conn = duckdb.connect()
rows = conn.execute(
"""
SELECT study_key, form_key, COUNT(*)
FROM read_parquet('./lake/**/*.parquet', hive_partitioning = true, union_by_name = true)
GROUP BY 1, 2
ORDER BY 1, 2
"""
).fetchall()
print(rows)
You can also generate the query string via
imednet.integrations.hive_parquet_query():
import duckdb
from imednet.integrations import hive_parquet_query
conn = duckdb.connect()
print(conn.execute(hive_parquet_query("./lake")).fetchdf())
DuckDBIngestionWorkflow
For incremental bronze/silver ingestion, use
imednet_workflows.DuckDBIngestionWorkflow.
import duckdb
from imednet import ImednetSDK
from imednet_workflows import DuckDBIngestionWorkflow
sdk = ImednetSDK(...)
workflow = DuckDBIngestionWorkflow(sdk, "./centralized.duckdb")
# Incremental bronze load
inserted = workflow.ingest_revisions(
"MY_STUDY",
start_date="2026-01-01",
end_date="2026-01-31",
mode="append",
)
print(f"Inserted {inserted} revision rows")
# Build silver current-state projection
workflow.build_silver_view("MY_STUDY")
conn = duckdb.connect("./centralized.duckdb", read_only=True)
print(
conn.execute(
"SELECT record_id, variable_name, value "
"FROM silver_current_state ORDER BY record_id, variable_name LIMIT 10"
).fetchall()
)
The two key methods are
imednet_workflows.duckdb_centralizer.DuckDBIngestionWorkflow.ingest_revisions()
for incremental loads and
imednet_workflows.duckdb_centralizer.DuckDBIngestionWorkflow.build_silver_view()
for current-state materialization.
CLI duckdb subcommand reference
Inspect the full command and option reference:
imednet export duckdb --help
Annotated examples:
# Export only selected variables to a single DuckDB table
imednet export duckdb MY_STUDY study_records ./clinical.duckdb --vars AGE,SEX,WEIGHT
# Export only selected forms (comma-separated numeric form IDs)
imednet export duckdb MY_STUDY study_records ./clinical.duckdb --forms 10,20,30
# Use variable labels as column names instead of variable names
imednet export duckdb MY_STUDY study_records ./clinical.duckdb --use-labels
Performance notes
The DuckDB path behind imednet.integrations.export_to_duckdb() and
imednet.integrations.export_to_duckdb_by_form() registers a DataFrame directly
in DuckDB before a CREATE TABLE AS SELECT operation. Compared to row-by-row insert
patterns often used with SQLAlchemy targets, this is typically much lower overhead for
large exports.
As an illustrative estimate (not a guaranteed benchmark), teams often observe whole-table loads finishing in seconds to minutes where row-wise insert flows can take materially longer for the same dataset shape.
Hive-partitioned Parquet via imednet.integrations.export_to_hive_parquet() also
avoids SQLite’s 2000-column table ceiling by storing each form independently and allowing
federated reads through DuckDB.
Out of scope
This guide does not cover MotherDuck/cloud DuckDB or multi-process orchestration patterns.