Files
sqlmodel-pg-kit/README.md

408 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# sqlmodel-pg-kit
Reusable SQLModel + PostgreSQL kit with src layout, sync/async engines, and generic CRUD repositories. Managed via uv (PEP 621/517) and built with hatchling. Includes minimal tests and a conda recipe.
## Features
- Config dataclass -> builds sync/async URLs
- Engine + Session factories (sync/async)
- Generic `Repository`/`AsyncRepository` with create/get/list/update/delete/bulk_insert
- Examples and notebooks for common SQLModel patterns
- PyPI/conda packaging setup, smoke test
## Quickstart (uv)
- Create venv: `uv venv`
- Editable install: `uv pip install -e .`
- Run tests: `uv pip install pytest && pytest -q`
## Local Postgres (Docker)
- Start the container: `docker compose up -d`
- Stop when done: `docker compose down`
- Default credentials match `DatabaseConfig`: user `appuser`, password `changeme`, database `appdb`
- Export `SQL_SSLMODE=disable` (container does not use TLS by default)
- Port: 5433 (mapped from container 5432)
## Usage
- Configure environment (either `SQL_*` or Postgres `PG*` vars). Example for container ADDR:
- `export SQL_HOST=192.168.64.8`
- `export SQL_PORT=5432`
- `export SQL_USER=postgres`
- `export SQL_PASSWORD=change-me-strong`
- `export SQL_DATABASE=appdb`
- `export SQL_SSLMODE=disable`
You can perform CRUD with SQLModel without writing raw SQL strings. This kit exposes:
- `create_all()` to create tables from models collected in `SQLModel.metadata`
- Generic repositories: `Repository(Model)` and `AsyncRepository(Model)` for CRUD and bulk insert
- Session helpers: `get_session()` and `get_async_session()` for custom queries using SQLModel/SQLAlchemy expressions
### Choose Backend: SQLite or Postgres
- SQLite (in-memory) quick demo — no environment variables needed:
```python
from typing import Optional
from sqlmodel import SQLModel, Field
from sqlmodel_pg_kit import create_all
from sqlmodel_pg_kit import db # access engine factory
# Override engine to SQLite in-memory
db.engine = db.create_engine("sqlite:///:memory:", echo=False)
class Hero(SQLModel, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
name: str
create_all() # creates tables on the current engine (SQLite)
```
- SQLite (file) quick demo:
```python
from sqlmodel_pg_kit import db, create_all
db.engine = db.create_engine("sqlite:///./demo.db", echo=False)
create_all()
```
- Postgres (recommended for production) via environment variables:
```bash
export SQL_HOST=127.0.0.1
export SQL_PORT=5432
export SQL_USER=postgres
export SQL_PASSWORD=change-me-strong
export SQL_DATABASE=appdb
export SQL_SSLMODE=disable # or require/verify-full as needed
```
Then your Python code can just call:
```python
from sqlmodel_pg_kit import create_all
create_all() # uses the default Postgres engine constructed from env
```
- Postgres explicit configuration (no env needed):
```python
from sqlmodel_pg_kit import db, create_all
cfg = db.DatabaseConfig(
host="127.0.0.1", port=5432, user="postgres",
password="change-me-strong", database="appdb", sslmode="disable",
)
db.engine = db.create_engine(cfg.sync_url(), echo=False)
create_all()
```
### CSV → SQLModel → Table (Autogenerate Model)
Autogenerate a SQLModel class from a CSV header and import the rows.
- CLI example:
```bash
uv run python examples/06_csv_to_sqlmodel.py --csv ./data/molecules.csv --sqlite ./demo.db
# Or with Postgres via env vars:
# export SQL_HOST=...; export SQL_USER=...; ...
# uv run python examples/06_csv_to_sqlmodel.py --csv ./data/molecules.csv
```
- In code:
```python
from sqlmodel_pg_kit.csv_import import build_model_from_csv, insert_rows
from sqlmodel_pg_kit import create_all
from sqlmodel_pg_kit.db import get_session
spec, rows = build_model_from_csv("./data/molecules.csv", class_name="Molecules", table_name="molecules")
create_all()
with get_session() as s:
n = insert_rows(spec.model, rows, s)
print("inserted:", n)
```
Type inference rules per column: bool if values are true/false/1/0/yes/no; else int; else float; otherwise str. Empty/NA/NaN/None/null become NULL.
### Sync CRUD (helpers)
```bash
uv run python examples/01_sync_crud.py
```
Shows create/get/list/update/delete using Repository. Minimal snippet:
```python
from typing import Optional
from sqlmodel import SQLModel, Field
from sqlmodel_pg_kit import create_all, Repository
from sqlmodel_pg_kit.db import get_session
class Hero(SQLModel, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
name: str = Field(index=True)
age: Optional[int] = None
create_all()
repo = Repository(Hero)
with get_session() as s:
h = repo.create(s, {"name": "Alice", "age": 20})
h2 = repo.get(s, h.id)
page = repo.list(s, page=1, size=5)
h3 = repo.update(s, h.id, age=21)
ok = repo.delete(s, h.id)
```
For a container-backed workflow, run `uv run python examples/07_postgres_minimal.py` (paired with `notebooks/07_postgres_minimal.ipynb`). That walkthrough now mirrors a REST-style request cycle:
- wipes existing demo rows so each run is deterministic
- upserts seed rows via `Repository.bulk_insert`
- paginates filtered results with `Repository.list(... where=..., order_by=..., page=..., size=...)`
- performs fuzzy search using `select(...).where(Model.name.ilike(...))`
- counts inventory with `session.exec(select(func.count(...))).scalar()` (compatible with SQLAlchemy 1.4/2.x)
- updates and deletes with the repository helpers you would call from PATCH/DELETE handlers
### Bulk insert + filters/pagination
```bash
uv run python examples/02_bulk_and_filters.py
```
Demonstrates `Repository.bulk_insert(rows)` and filtering with SQLModel expressions:
```python
from typing import Optional
from sqlmodel import select, SQLModel, Field
from sqlmodel_pg_kit import Repository
from sqlmodel_pg_kit.db import get_session
class Hero(SQLModel, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
name: str
age: Optional[int] = None
with get_session() as s:
Repository(Hero).bulk_insert(s, [{"name":"PG Hero","age":1},{"name":"PG Hero","age":2}])
heroes = s.exec(select(Hero).where(Hero.name == "PG Hero")).all()
```
### Relationships (Team <- Hero)
```bash
uv run python examples/03_relationships.py
```
Creates a `Team`, assigns heroes, and reads back with `selectinload` eager loading.
### Async CRUD
```bash
uv run python examples/04_async_crud.py
```
Uses `get_async_session()` with `AsyncRepository` to write/read without raw SQL. You may call `create_all()` once before async operations.
## Tests via Makefile
- `make test-sqlite`: run fast smoke tests on SQLite (no Postgres needed)
- `make test`: run all tests found by pytest
- `make test-pg`: run Postgres integration once on current Python (requires SQL_*/PG* env)
- `make test-pg-once`: run Postgres integration across Python 3.103.13 (requires `uv` and SQL_*/PG* env)
## Cheminformatics Example
This repo includes a practical multi-table example tailored for cheminformatics workflows: `examples/05_cheminformatics.py`.
What it shows:
- Molecules with descriptors: `smiles`, `selfies`, `qed`, `sa_score` (no RDKit dependency in runtime; compute upstream and store).
- Many-to-many datasets: `Dataset` and link table `MoleculeDataset` for train/holdout/etc.
- Dataclass interop: lightweight `@dataclass MoleculeDTO` that converts to SQLModel for fast CRUD.
- Typical queries: threshold filters, pattern matching, eager-loaded relationships, join filters.
Run it (requires SQL_*/PG* env):
```bash
uv run python examples/05_cheminformatics.py
```
Step-by-step outline you can adapt:
1) Define models
```python
class Molecule(SQLModel, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
smiles: str = Field(index=True)
selfies: Optional[str] = None
qed: Optional[float] = Field(default=None, index=True)
sa_score: Optional[float] = Field(default=None, index=True)
datasets: List[Dataset] = Relationship(back_populates="molecules", link_model=MoleculeDataset)
class Dataset(SQLModel, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
name: str = Field(index=True)
molecules: List[Molecule] = Relationship(back_populates="datasets", link_model=MoleculeDataset)
class MoleculeDataset(SQLModel, table=True):
molecule_id: int = Field(foreign_key="molecule.id", primary_key=True)
dataset_id: int = Field(foreign_key="dataset.id", primary_key=True)
```
2) Bring your RDKit pipeline data via dataclass
```python
@dataclass
class MoleculeDTO:
smiles: str
selfies: Optional[str] = None
qed: Optional[float] = None
sa_score: Optional[float] = None
def to_model(self) -> Molecule:
return Molecule(**vars(self))
```
3) Fast CRUD
```python
from sqlmodel_pg_kit.db import get_session
create_all() # once
with get_session() as s:
s.add(Molecule(smiles="CCO", qed=0.5, sa_score=2.1))
s.commit()
with get_session() as s:
m = s.exec(select(Molecule).where(Molecule.smiles == "CCO")).one()
m.qed = 0.55
s.add(m); s.commit(); s.refresh(m)
with get_session() as s:
s.delete(m); s.commit()
```
4) Link datasets and join queries
```python
with get_session() as s:
ds = Dataset(name="train"); s.add(ds); s.commit(); s.refresh(ds)
m = s.exec(select(Molecule).where(Molecule.smiles == "CCO")).one()
s.add(MoleculeDataset(molecule_id=m.id, dataset_id=ds.id)); s.commit()
with get_session() as s:
stmt = (
select(Molecule)
.join(MoleculeDataset, Molecule.id == MoleculeDataset.molecule_id)
.join(Dataset, Dataset.id == MoleculeDataset.dataset_id)
.where(Dataset.name == "train")
)
train_mols = s.exec(stmt).all()
```
5) Tips for high-throughput tasks
- Bulk insert many rows: use SQLAlchemy Core `insert(Model).values(list_of_dicts)` then `s.commit()`.
- Paginate: `select(Model).offset((page-1)*size).limit(size)`.
- Eager-load related rows: `.options(selectinload(Model.rel))` to avoid N+1.
- Partial updates: load row, set fields, commit; or use Core `update()` when you dont need ORM instances.
See `examples/05_cheminformatics.py` for a complete, runnable walkthrough.
## Cheminformatics: RDKit + Mordred
Below are patterns to integrate RDKit and Mordred descriptor pipelines.
- Modeling patterns:
- Wide columns: put highvalue fields directly on `Molecule` (e.g., `qed`, `sa_score`) with BTree indexes for fast filters.
- JSONB payload: store a large descriptor dict in a JSONB column when you need many descriptors but query only a subset.
- Normalized table (EAV): `MoleculeDescriptor(molecule_id, name, value)` when you frequently query and index specific descriptors by name.
- JSONB example (Postgres):
```python
from typing import Optional, Dict
from sqlmodel import SQLModel, Field
from sqlalchemy import Column
from sqlalchemy.dialects.postgresql import JSONB
class Molecule(SQLModel, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
smiles: str = Field(index=True)
descriptors: Dict[str, float] = Field(default_factory=dict, sa_column=Column(JSONB))
```
- EAV example:
```python
class MoleculeDescriptor(SQLModel, table=True):
molecule_id: int = Field(foreign_key="molecule.id", index=True)
name: str = Field(index=True)
value: float = Field(index=True)
```
- RDKit + Mordred pipeline sketch:
```python
# Optional installs in Jupyter:
# %pip install rdkit-pypi mordred
from rdkit import Chem
from rdkit.Chem import QED
from mordred import Calculator, descriptors
mol = Chem.MolFromSmiles("c1ccccc1O")
qed = float(QED.qed(mol))
calc = Calculator(descriptors, ignore_3D=True)
md = calc(mol) # returns mapping-like object
desc = {k: float(v) for k,v in md.items() if v is not None and isinstance(v, (int, float))}
from sqlmodel_pg_kit.db import get_session
from sqlmodel import select
with get_session() as s:
m = Molecule(smiles="c1ccccc1O", descriptors=desc)
s.add(m); s.commit(); s.refresh(m)
# filter by a descriptor threshold
# for JSONB: use SQLAlchemy JSON operators or extract into wide columns for hot features
```
Indexing tips:
- Create BTree indexes on hot numeric columns: `qed`, `sa_score`, etc.
- For JSONB, consider GIN indexes with jsonb_path_ops on frequently accessed keys.
- For EAV, add `(name, value)` composite indexes and partial indexes for the top N descriptor names.
## Jupyter Notebooks
A readytorun tutorial notebook is included under `notebooks/`:
- `notebooks/01_cheminformatics_quickstart.ipynb`: endtoend walkthrough (install, configuration, CRUD, joins, optional RDKit/Mordred computation). You can execute it in your Jupyter environment.
- `notebooks/05_cheminformatics.ipynb`: a teaching version of `examples/05_cheminformatics.py`, stepbystep CRUD and joins tailored for learning.
- `notebooks/01_sync_crud.ipynb`: helpers-based create/get/list/update/delete (with optional SQLite override).
- `notebooks/02_bulk_and_filters.ipynb`: bulk insert and SQLModel filtering examples.
- `notebooks/03_relationships.ipynb`: Team ↔ Hero relationships with eager loading.
- `notebooks/04_async_crud.ipynb`: async session CRUD, with optional async SQLite override.
- `notebooks/06_csv_import.ipynb`: CSV → SQLModel 自动建模与入库(含 SQLite/PG 切换与筛选查询)。
Typical start in Jupyter:
```python
%pip install -e . pytest # kit + tests
# Optional dependencies for chem pipelines:
# %pip install rdkit-pypi mordred
```
### Micromamba environment
If you already have a micromamba env named `sqlmodel`:
```bash
micromamba activate sqlmodel
jupyter lab # or jupyter notebook
```
Then open one of the notebooks under `notebooks/` to follow along.
## Build & Publish
- PyPI build: `uv build`
- PyPI publish: `uv publish` (configure token via `uv keyring set ...`)
- Conda build: `conda build conda/`
## Notes
- For production, prefer `sslmode=verify-full`, least-privileged DB users, and consider PgBouncer.
- Bring your own Alembic migrations; set `target_metadata = SQLModel.metadata` in env.py.
## Tests via Makefile
- `make test-sqlite`: run fast smoke tests on SQLite (no Postgres needed)
- `make test`: run all tests found by pytest
- `make test-pg`: run Postgres integration once on current Python (requires SQL_*/PG* env)
- `make test-pg-once`: run Postgres integration across Python 3.103.13 (requires `uv` and SQL_*/PG* env)