408 lines
14 KiB
Markdown
408 lines
14 KiB
Markdown
# sqlmodel-pg-kit
|
||
|
||
Reusable SQLModel + PostgreSQL kit with src layout, sync/async engines, and generic CRUD repositories. Managed via uv (PEP 621/517) and built with hatchling. Includes minimal tests and a conda recipe.
|
||
|
||
## Features
|
||
- Config dataclass -> builds sync/async URLs
|
||
- Engine + Session factories (sync/async)
|
||
- Generic `Repository`/`AsyncRepository` with create/get/list/update/delete/bulk_insert
|
||
- Examples and notebooks for common SQLModel patterns
|
||
- PyPI/conda packaging setup, smoke test
|
||
|
||
## Quickstart (uv)
|
||
- Create venv: `uv venv`
|
||
- Editable install: `uv pip install -e .`
|
||
- Run tests: `uv pip install pytest && pytest -q`
|
||
|
||
## Local Postgres (Docker)
|
||
- Start the container: `docker compose up -d`
|
||
- Stop when done: `docker compose down`
|
||
- Default credentials match `DatabaseConfig`: user `appuser`, password `changeme`, database `appdb`
|
||
- Export `SQL_SSLMODE=disable` (container does not use TLS by default)
|
||
- Port: 5433 (mapped from container 5432)
|
||
|
||
## Usage
|
||
- Configure environment (either `SQL_*` or Postgres `PG*` vars). Example for container ADDR:
|
||
- `export SQL_HOST=192.168.64.8`
|
||
- `export SQL_PORT=5432`
|
||
- `export SQL_USER=postgres`
|
||
- `export SQL_PASSWORD=change-me-strong`
|
||
- `export SQL_DATABASE=appdb`
|
||
- `export SQL_SSLMODE=disable`
|
||
|
||
You can perform CRUD with SQLModel without writing raw SQL strings. This kit exposes:
|
||
- `create_all()` to create tables from models collected in `SQLModel.metadata`
|
||
- Generic repositories: `Repository(Model)` and `AsyncRepository(Model)` for CRUD and bulk insert
|
||
- Session helpers: `get_session()` and `get_async_session()` for custom queries using SQLModel/SQLAlchemy expressions
|
||
|
||
### Choose Backend: SQLite or Postgres
|
||
|
||
- SQLite (in-memory) quick demo — no environment variables needed:
|
||
```python
|
||
from typing import Optional
|
||
from sqlmodel import SQLModel, Field
|
||
from sqlmodel_pg_kit import create_all
|
||
from sqlmodel_pg_kit import db # access engine factory
|
||
|
||
# Override engine to SQLite in-memory
|
||
db.engine = db.create_engine("sqlite:///:memory:", echo=False)
|
||
|
||
class Hero(SQLModel, table=True):
|
||
id: Optional[int] = Field(default=None, primary_key=True)
|
||
name: str
|
||
|
||
create_all() # creates tables on the current engine (SQLite)
|
||
```
|
||
|
||
- SQLite (file) quick demo:
|
||
|
||
```python
|
||
from sqlmodel_pg_kit import db, create_all
|
||
|
||
db.engine = db.create_engine("sqlite:///./demo.db", echo=False)
|
||
create_all()
|
||
```
|
||
|
||
- Postgres (recommended for production) via environment variables:
|
||
|
||
```bash
|
||
export SQL_HOST=127.0.0.1
|
||
export SQL_PORT=5432
|
||
export SQL_USER=postgres
|
||
export SQL_PASSWORD=change-me-strong
|
||
export SQL_DATABASE=appdb
|
||
export SQL_SSLMODE=disable # or require/verify-full as needed
|
||
```
|
||
|
||
Then your Python code can just call:
|
||
|
||
```python
|
||
from sqlmodel_pg_kit import create_all
|
||
create_all() # uses the default Postgres engine constructed from env
|
||
```
|
||
|
||
- Postgres explicit configuration (no env needed):
|
||
|
||
```python
|
||
from sqlmodel_pg_kit import db, create_all
|
||
|
||
cfg = db.DatabaseConfig(
|
||
host="127.0.0.1", port=5432, user="postgres",
|
||
password="change-me-strong", database="appdb", sslmode="disable",
|
||
)
|
||
db.engine = db.create_engine(cfg.sync_url(), echo=False)
|
||
create_all()
|
||
```
|
||
|
||
### CSV → SQLModel → Table (Auto‑generate Model)
|
||
Auto‑generate a SQLModel class from a CSV header and import the rows.
|
||
|
||
- CLI example:
|
||
```bash
|
||
uv run python examples/06_csv_to_sqlmodel.py --csv ./data/molecules.csv --sqlite ./demo.db
|
||
# Or with Postgres via env vars:
|
||
# export SQL_HOST=...; export SQL_USER=...; ...
|
||
# uv run python examples/06_csv_to_sqlmodel.py --csv ./data/molecules.csv
|
||
```
|
||
|
||
- In code:
|
||
```python
|
||
from sqlmodel_pg_kit.csv_import import build_model_from_csv, insert_rows
|
||
from sqlmodel_pg_kit import create_all
|
||
from sqlmodel_pg_kit.db import get_session
|
||
|
||
spec, rows = build_model_from_csv("./data/molecules.csv", class_name="Molecules", table_name="molecules")
|
||
create_all()
|
||
with get_session() as s:
|
||
n = insert_rows(spec.model, rows, s)
|
||
print("inserted:", n)
|
||
```
|
||
|
||
Type inference rules per column: bool if values are true/false/1/0/yes/no; else int; else float; otherwise str. Empty/NA/NaN/None/null become NULL.
|
||
|
||
### Sync CRUD (helpers)
|
||
|
||
```bash
|
||
uv run python examples/01_sync_crud.py
|
||
```
|
||
|
||
Shows create/get/list/update/delete using Repository. Minimal snippet:
|
||
|
||
```python
|
||
from typing import Optional
|
||
from sqlmodel import SQLModel, Field
|
||
from sqlmodel_pg_kit import create_all, Repository
|
||
from sqlmodel_pg_kit.db import get_session
|
||
|
||
class Hero(SQLModel, table=True):
|
||
id: Optional[int] = Field(default=None, primary_key=True)
|
||
name: str = Field(index=True)
|
||
age: Optional[int] = None
|
||
|
||
create_all()
|
||
repo = Repository(Hero)
|
||
with get_session() as s:
|
||
h = repo.create(s, {"name": "Alice", "age": 20})
|
||
h2 = repo.get(s, h.id)
|
||
page = repo.list(s, page=1, size=5)
|
||
h3 = repo.update(s, h.id, age=21)
|
||
ok = repo.delete(s, h.id)
|
||
```
|
||
|
||
For a container-backed workflow, run `uv run python examples/07_postgres_minimal.py` (paired with `notebooks/07_postgres_minimal.ipynb`). That walkthrough now mirrors a REST-style request cycle:
|
||
- wipes existing demo rows so each run is deterministic
|
||
- upserts seed rows via `Repository.bulk_insert`
|
||
- paginates filtered results with `Repository.list(... where=..., order_by=..., page=..., size=...)`
|
||
- performs fuzzy search using `select(...).where(Model.name.ilike(...))`
|
||
- counts inventory with `session.exec(select(func.count(...))).scalar()` (compatible with SQLAlchemy 1.4/2.x)
|
||
- updates and deletes with the repository helpers you would call from PATCH/DELETE handlers
|
||
|
||
### Bulk insert + filters/pagination
|
||
|
||
```bash
|
||
uv run python examples/02_bulk_and_filters.py
|
||
```
|
||
|
||
Demonstrates `Repository.bulk_insert(rows)` and filtering with SQLModel expressions:
|
||
|
||
```python
|
||
from typing import Optional
|
||
from sqlmodel import select, SQLModel, Field
|
||
from sqlmodel_pg_kit import Repository
|
||
from sqlmodel_pg_kit.db import get_session
|
||
|
||
class Hero(SQLModel, table=True):
|
||
id: Optional[int] = Field(default=None, primary_key=True)
|
||
name: str
|
||
age: Optional[int] = None
|
||
|
||
with get_session() as s:
|
||
Repository(Hero).bulk_insert(s, [{"name":"PG Hero","age":1},{"name":"PG Hero","age":2}])
|
||
heroes = s.exec(select(Hero).where(Hero.name == "PG Hero")).all()
|
||
```
|
||
|
||
### Relationships (Team <- Hero)
|
||
|
||
```bash
|
||
uv run python examples/03_relationships.py
|
||
```
|
||
|
||
Creates a `Team`, assigns heroes, and reads back with `selectinload` eager loading.
|
||
|
||
### Async CRUD
|
||
|
||
```bash
|
||
uv run python examples/04_async_crud.py
|
||
```
|
||
Uses `get_async_session()` with `AsyncRepository` to write/read without raw SQL. You may call `create_all()` once before async operations.
|
||
|
||
## Tests via Makefile
|
||
|
||
- `make test-sqlite`: run fast smoke tests on SQLite (no Postgres needed)
|
||
- `make test`: run all tests found by pytest
|
||
- `make test-pg`: run Postgres integration once on current Python (requires SQL_*/PG* env)
|
||
- `make test-pg-once`: run Postgres integration across Python 3.10–3.13 (requires `uv` and SQL_*/PG* env)
|
||
|
||
## Cheminformatics Example
|
||
|
||
This repo includes a practical multi-table example tailored for cheminformatics workflows: `examples/05_cheminformatics.py`.
|
||
|
||
What it shows:
|
||
- Molecules with descriptors: `smiles`, `selfies`, `qed`, `sa_score` (no RDKit dependency in runtime; compute upstream and store).
|
||
- Many-to-many datasets: `Dataset` and link table `MoleculeDataset` for train/holdout/etc.
|
||
- Dataclass interop: lightweight `@dataclass MoleculeDTO` that converts to SQLModel for fast CRUD.
|
||
- Typical queries: threshold filters, pattern matching, eager-loaded relationships, join filters.
|
||
|
||
Run it (requires SQL_*/PG* env):
|
||
|
||
```bash
|
||
uv run python examples/05_cheminformatics.py
|
||
```
|
||
|
||
Step-by-step outline you can adapt:
|
||
|
||
1) Define models
|
||
```python
|
||
class Molecule(SQLModel, table=True):
|
||
id: Optional[int] = Field(default=None, primary_key=True)
|
||
smiles: str = Field(index=True)
|
||
selfies: Optional[str] = None
|
||
qed: Optional[float] = Field(default=None, index=True)
|
||
sa_score: Optional[float] = Field(default=None, index=True)
|
||
datasets: List[Dataset] = Relationship(back_populates="molecules", link_model=MoleculeDataset)
|
||
|
||
class Dataset(SQLModel, table=True):
|
||
id: Optional[int] = Field(default=None, primary_key=True)
|
||
name: str = Field(index=True)
|
||
molecules: List[Molecule] = Relationship(back_populates="datasets", link_model=MoleculeDataset)
|
||
|
||
class MoleculeDataset(SQLModel, table=True):
|
||
molecule_id: int = Field(foreign_key="molecule.id", primary_key=True)
|
||
dataset_id: int = Field(foreign_key="dataset.id", primary_key=True)
|
||
```
|
||
|
||
2) Bring your RDKit pipeline data via dataclass
|
||
|
||
```python
|
||
@dataclass
|
||
class MoleculeDTO:
|
||
smiles: str
|
||
selfies: Optional[str] = None
|
||
qed: Optional[float] = None
|
||
sa_score: Optional[float] = None
|
||
|
||
def to_model(self) -> Molecule:
|
||
return Molecule(**vars(self))
|
||
```
|
||
|
||
3) Fast CRUD
|
||
|
||
```python
|
||
from sqlmodel_pg_kit.db import get_session
|
||
create_all() # once
|
||
|
||
with get_session() as s:
|
||
s.add(Molecule(smiles="CCO", qed=0.5, sa_score=2.1))
|
||
s.commit()
|
||
|
||
with get_session() as s:
|
||
m = s.exec(select(Molecule).where(Molecule.smiles == "CCO")).one()
|
||
m.qed = 0.55
|
||
s.add(m); s.commit(); s.refresh(m)
|
||
|
||
with get_session() as s:
|
||
s.delete(m); s.commit()
|
||
```
|
||
|
||
4) Link datasets and join queries
|
||
|
||
```python
|
||
with get_session() as s:
|
||
ds = Dataset(name="train"); s.add(ds); s.commit(); s.refresh(ds)
|
||
m = s.exec(select(Molecule).where(Molecule.smiles == "CCO")).one()
|
||
s.add(MoleculeDataset(molecule_id=m.id, dataset_id=ds.id)); s.commit()
|
||
|
||
with get_session() as s:
|
||
stmt = (
|
||
select(Molecule)
|
||
.join(MoleculeDataset, Molecule.id == MoleculeDataset.molecule_id)
|
||
.join(Dataset, Dataset.id == MoleculeDataset.dataset_id)
|
||
.where(Dataset.name == "train")
|
||
)
|
||
train_mols = s.exec(stmt).all()
|
||
```
|
||
|
||
5) Tips for high-throughput tasks
|
||
- Bulk insert many rows: use SQLAlchemy Core `insert(Model).values(list_of_dicts)` then `s.commit()`.
|
||
- Paginate: `select(Model).offset((page-1)*size).limit(size)`.
|
||
- Eager-load related rows: `.options(selectinload(Model.rel))` to avoid N+1.
|
||
- Partial updates: load row, set fields, commit; or use Core `update()` when you don’t need ORM instances.
|
||
|
||
See `examples/05_cheminformatics.py` for a complete, runnable walkthrough.
|
||
|
||
## Cheminformatics: RDKit + Mordred
|
||
Below are patterns to integrate RDKit and Mordred descriptor pipelines.
|
||
|
||
- Modeling patterns:
|
||
- Wide columns: put high‑value fields directly on `Molecule` (e.g., `qed`, `sa_score`) with B‑Tree indexes for fast filters.
|
||
- JSONB payload: store a large descriptor dict in a JSONB column when you need many descriptors but query only a subset.
|
||
- Normalized table (EAV): `MoleculeDescriptor(molecule_id, name, value)` when you frequently query and index specific descriptors by name.
|
||
|
||
- JSONB example (Postgres):
|
||
|
||
```python
|
||
from typing import Optional, Dict
|
||
from sqlmodel import SQLModel, Field
|
||
from sqlalchemy import Column
|
||
from sqlalchemy.dialects.postgresql import JSONB
|
||
|
||
class Molecule(SQLModel, table=True):
|
||
id: Optional[int] = Field(default=None, primary_key=True)
|
||
smiles: str = Field(index=True)
|
||
descriptors: Dict[str, float] = Field(default_factory=dict, sa_column=Column(JSONB))
|
||
```
|
||
|
||
- EAV example:
|
||
|
||
```python
|
||
class MoleculeDescriptor(SQLModel, table=True):
|
||
molecule_id: int = Field(foreign_key="molecule.id", index=True)
|
||
name: str = Field(index=True)
|
||
value: float = Field(index=True)
|
||
```
|
||
|
||
- RDKit + Mordred pipeline sketch:
|
||
|
||
```python
|
||
# Optional installs in Jupyter:
|
||
# %pip install rdkit-pypi mordred
|
||
|
||
from rdkit import Chem
|
||
from rdkit.Chem import QED
|
||
from mordred import Calculator, descriptors
|
||
|
||
mol = Chem.MolFromSmiles("c1ccccc1O")
|
||
qed = float(QED.qed(mol))
|
||
calc = Calculator(descriptors, ignore_3D=True)
|
||
md = calc(mol) # returns mapping-like object
|
||
desc = {k: float(v) for k,v in md.items() if v is not None and isinstance(v, (int, float))}
|
||
|
||
from sqlmodel_pg_kit.db import get_session
|
||
from sqlmodel import select
|
||
|
||
with get_session() as s:
|
||
m = Molecule(smiles="c1ccccc1O", descriptors=desc)
|
||
s.add(m); s.commit(); s.refresh(m)
|
||
# filter by a descriptor threshold
|
||
# for JSONB: use SQLAlchemy JSON operators or extract into wide columns for hot features
|
||
```
|
||
|
||
Indexing tips:
|
||
- Create B‑Tree indexes on hot numeric columns: `qed`, `sa_score`, etc.
|
||
- For JSONB, consider GIN indexes with jsonb_path_ops on frequently accessed keys.
|
||
- For EAV, add `(name, value)` composite indexes and partial indexes for the top N descriptor names.
|
||
|
||
## Jupyter Notebooks
|
||
A ready‑to‑run tutorial notebook is included under `notebooks/`:
|
||
- `notebooks/01_cheminformatics_quickstart.ipynb`: end‑to‑end walkthrough (install, configuration, CRUD, joins, optional RDKit/Mordred computation). You can execute it in your Jupyter environment.
|
||
- `notebooks/05_cheminformatics.ipynb`: a teaching version of `examples/05_cheminformatics.py`, step‑by‑step CRUD and joins tailored for learning.
|
||
- `notebooks/01_sync_crud.ipynb`: helpers-based create/get/list/update/delete (with optional SQLite override).
|
||
- `notebooks/02_bulk_and_filters.ipynb`: bulk insert and SQLModel filtering examples.
|
||
- `notebooks/03_relationships.ipynb`: Team ↔ Hero relationships with eager loading.
|
||
- `notebooks/04_async_crud.ipynb`: async session CRUD, with optional async SQLite override.
|
||
- `notebooks/06_csv_import.ipynb`: CSV → SQLModel 自动建模与入库(含 SQLite/PG 切换与筛选查询)。
|
||
|
||
Typical start in Jupyter:
|
||
|
||
```python
|
||
%pip install -e . pytest # kit + tests
|
||
# Optional dependencies for chem pipelines:
|
||
# %pip install rdkit-pypi mordred
|
||
```
|
||
|
||
### Micromamba environment
|
||
If you already have a micromamba env named `sqlmodel`:
|
||
|
||
```bash
|
||
micromamba activate sqlmodel
|
||
jupyter lab # or jupyter notebook
|
||
```
|
||
|
||
Then open one of the notebooks under `notebooks/` to follow along.
|
||
|
||
## Build & Publish
|
||
- PyPI build: `uv build`
|
||
- PyPI publish: `uv publish` (configure token via `uv keyring set ...`)
|
||
- Conda build: `conda build conda/`
|
||
|
||
## Notes
|
||
- For production, prefer `sslmode=verify-full`, least-privileged DB users, and consider PgBouncer.
|
||
- Bring your own Alembic migrations; set `target_metadata = SQLModel.metadata` in env.py.
|
||
|
||
## Tests via Makefile
|
||
|
||
- `make test-sqlite`: run fast smoke tests on SQLite (no Postgres needed)
|
||
- `make test`: run all tests found by pytest
|
||
- `make test-pg`: run Postgres integration once on current Python (requires SQL_*/PG* env)
|
||
- `make test-pg-once`: run Postgres integration across Python 3.10–3.13 (requires `uv` and SQL_*/PG* env)
|