DuckDB Integration ================== This guide covers the DuckDB and Hive-partitioned Parquet export paths for analytical workloads. For destination-selection guidance across MongoDB, Neo4j, and Snowflake, see :doc:`export_destinations`. For record-to-DataFrame mapping internals, see :doc:`record_mapping`. For workflow interaction patterns, see :doc:`workflow_interactions`. Installation ------------ .. code-block:: bash # Minimal DuckDB-only install pip install 'imednet[duckdb]' # Full analytical stack (includes pandas, pyarrow, sqlalchemy, openpyxl, duckdb) pip install 'imednet[export]' `export_to_duckdb` quickstart ----------------------------- Use :func:`imednet.integrations.export_to_duckdb` to export one study into one DuckDB table and query it immediately: .. code-block:: python import duckdb from imednet import ImednetSDK from imednet.integrations import export_to_duckdb sdk = ImednetSDK(...) export_to_duckdb(sdk, "MY_STUDY", "./clinical.duckdb", "study_records") conn = duckdb.connect("./clinical.duckdb", read_only=True) print(conn.execute("SELECT COUNT(*) FROM study_records").fetchone()) `export_to_duckdb_by_form` -------------------------- Use :func:`imednet.integrations.export_to_duckdb_by_form` when you want one table per form key: .. code-block:: python import duckdb from imednet import ImednetSDK from imednet.integrations import export_to_duckdb_by_form sdk = ImednetSDK(...) export_to_duckdb_by_form(sdk, "MY_STUDY", "./clinical_by_form.duckdb") conn = duckdb.connect("./clinical_by_form.duckdb", read_only=True) print(conn.execute("SHOW TABLES").fetchall()) Hive-partitioned Parquet for multi-study workloads --------------------------------------------------- Use :func:`imednet.integrations.export_to_hive_parquet` when many studies are being extracted concurrently and you want each study/form stored independently on disk. In that scenario, partitioned Parquet avoids a single shared database file lock boundary while preserving direct queryability from DuckDB. Example layout: .. code-block:: text lake/ ├── study_key=STUDY_A/ │ ├── form_key=DEMOGRAPHICS/ │ │ └── records.parquet │ └── form_key=LABS/ │ └── records.parquet └── study_key=STUDY_B/ └── form_key=DEMOGRAPHICS/ └── records.parquet DuckDB query with Hive partition discovery and schema union: .. code-block:: python import duckdb conn = duckdb.connect() rows = conn.execute( """ SELECT study_key, form_key, COUNT(*) FROM read_parquet('./lake/**/*.parquet', hive_partitioning = true, union_by_name = true) GROUP BY 1, 2 ORDER BY 1, 2 """ ).fetchall() print(rows) You can also generate the query string via :func:`imednet.integrations.hive_parquet_query`: .. code-block:: python import duckdb from imednet.integrations import hive_parquet_query conn = duckdb.connect() print(conn.execute(hive_parquet_query("./lake")).fetchdf()) `DuckDBIngestionWorkflow` ------------------------- For incremental bronze/silver ingestion, use :class:`imednet_workflows.DuckDBIngestionWorkflow`. .. code-block:: python import duckdb from imednet import ImednetSDK from imednet_workflows import DuckDBIngestionWorkflow sdk = ImednetSDK(...) workflow = DuckDBIngestionWorkflow(sdk, "./centralized.duckdb") # Incremental bronze load inserted = workflow.ingest_revisions( "MY_STUDY", start_date="2026-01-01", end_date="2026-01-31", mode="append", ) print(f"Inserted {inserted} revision rows") # Build silver current-state projection workflow.build_silver_view("MY_STUDY") conn = duckdb.connect("./centralized.duckdb", read_only=True) print( conn.execute( "SELECT record_id, variable_name, value " "FROM silver_current_state ORDER BY record_id, variable_name LIMIT 10" ).fetchall() ) The two key methods are :meth:`imednet_workflows.duckdb_centralizer.DuckDBIngestionWorkflow.ingest_revisions` for incremental loads and :meth:`imednet_workflows.duckdb_centralizer.DuckDBIngestionWorkflow.build_silver_view` for current-state materialization. CLI `duckdb` subcommand reference --------------------------------- Inspect the full command and option reference: .. code-block:: bash imednet export duckdb --help Annotated examples: .. code-block:: bash # Export only selected variables to a single DuckDB table imednet export duckdb MY_STUDY study_records ./clinical.duckdb --vars AGE,SEX,WEIGHT # Export only selected forms (comma-separated numeric form IDs) imednet export duckdb MY_STUDY study_records ./clinical.duckdb --forms 10,20,30 # Use variable labels as column names instead of variable names imednet export duckdb MY_STUDY study_records ./clinical.duckdb --use-labels Performance notes ----------------- The DuckDB path behind :func:`imednet.integrations.export_to_duckdb` and :func:`imednet.integrations.export_to_duckdb_by_form` registers a DataFrame directly in DuckDB before a ``CREATE TABLE AS SELECT`` operation. Compared to row-by-row insert patterns often used with SQLAlchemy targets, this is typically much lower overhead for large exports. As an illustrative estimate (not a guaranteed benchmark), teams often observe whole-table loads finishing in seconds to minutes where row-wise insert flows can take materially longer for the same dataset shape. Hive-partitioned Parquet via :func:`imednet.integrations.export_to_hive_parquet` also avoids SQLite's 2000-column table ceiling by storing each form independently and allowing federated reads through DuckDB. Out of scope ------------ This guide does not cover MotherDuck/cloud DuckDB or multi-process orchestration patterns.