first add

2025-08-17 22:18:45 +08:00
commit 093d8efd3b
32 changed files with 3531 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,436 @@
+# sqlmodel-pg-kit
+
+Reusable SQLModel + PostgreSQL kit with src layout, sync/async engines, and generic CRUD repositories. Managed via uv (PEP 621/517) and built with hatchling. Includes minimal tests and a conda recipe.
+
+## Features
+- Config dataclass -> builds sync/async URLs
+- Engine + Session factories (sync/async)
+- Generic `Repository`/`AsyncRepository` with create/get/list/update/delete/bulk_insert
+- Examples and notebooks for common SQLModel patterns
+- PyPI/conda packaging setup, smoke test
+
+## Quickstart (uv)
+- Create venv: `uv venv`
+- Editable install: `uv pip install -e .`
+- Run tests: `uv pip install pytest && pytest -q`
+
+## Usage
+- Configure environment (either `SQL_*` or Postgres `PG*` vars). Example for container ADDR:
+  - `export SQL_HOST=192.168.64.8`
+  - `export SQL_PORT=5432`
+  - `export SQL_USER=postgres`
+  - `export SQL_PASSWORD=change-me-strong`
+  - `export SQL_DATABASE=appdb`
+  - `export SQL_SSLMODE=disable`
+
+You can perform CRUD with SQLModel without writing raw SQL strings. This kit exposes:
+- `create_all()` to create tables from models collected in `SQLModel.metadata`
+- Generic repositories: `Repository(Model)` and `AsyncRepository(Model)` for CRUD and bulk insert
+- Session helpers: `get_session()` and `get_async_session()` for custom queries using SQLModel/SQLAlchemy expressions
+
+### Choose Backend: SQLite or Postgres
+
+- SQLite (in-memory) quick demo — no environment variables needed:
+```python
+from typing import Optional
+from sqlmodel import SQLModel, Field
+from sqlmodel_pg_kit import create_all
+from sqlmodel_pg_kit import db  # access engine factory
+
+# Override engine to SQLite in-memory
+db.engine = db.create_engine("sqlite:///:memory:", echo=False)
+
+class Hero(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    name: str
+
+create_all()  # creates tables on the current engine (SQLite)
+```
+
+- SQLite (file) quick demo:
+
+```python
+from sqlmodel_pg_kit import db, create_all
+
+db.engine = db.create_engine("sqlite:///./demo.db", echo=False)
+create_all()
+```
+
+- Postgres (recommended for production) via environment variables:
+
+```bash
+export SQL_HOST=127.0.0.1
+export SQL_PORT=5432
+export SQL_USER=postgres
+export SQL_PASSWORD=change-me-strong
+export SQL_DATABASE=appdb
+export SQL_SSLMODE=disable  # or require/verify-full as needed
+```
+
+Then your Python code can just call:
+
+```python
+from sqlmodel_pg_kit import create_all
+create_all()  # uses the default Postgres engine constructed from env
+```
+
+- Postgres explicit configuration (no env needed):
+
+```python
+from sqlmodel_pg_kit import db, create_all
+
+cfg = db.DatabaseConfig(
+    host="127.0.0.1", port=5432, user="postgres",
+    password="change-me-strong", database="appdb", sslmode="disable",
+)
+db.engine = db.create_engine(cfg.sync_url(), echo=False)
+create_all()
+```
+
+### CSV → SQLModel → Table (Auto‑generate Model)
+Auto‑generate a SQLModel class from a CSV header and import the rows.
+
+- CLI example:
+```bash
+uv run python examples/06_csv_to_sqlmodel.py --csv ./data/molecules.csv --sqlite ./demo.db
+# Or with Postgres via env vars:
+# export SQL_HOST=...; export SQL_USER=...; ...
+# uv run python examples/06_csv_to_sqlmodel.py --csv ./data/molecules.csv
+```
+
+- In code:
+```python
+from sqlmodel_pg_kit.csv_import import build_model_from_csv, insert_rows
+from sqlmodel_pg_kit import create_all
+from sqlmodel_pg_kit.db import get_session
+
+spec, rows = build_model_from_csv("./data/molecules.csv", class_name="Molecules", table_name="molecules")
+create_all()
+with get_session() as s:
+    n = insert_rows(spec.model, rows, s)
+print("inserted:", n)
+```
+
+Type inference rules per column: bool if values are true/false/1/0/yes/no; else int; else float; otherwise str. Empty/NA/NaN/None/null become NULL.
+
+### Sync CRUD (helpers)
+
+```bash
+uv run python examples/01_sync_crud.py
+```
+
+Shows create/get/list/update/delete using Repository. Minimal snippet:
+
+```python
+from typing import Optional
+from sqlmodel import SQLModel, Field
+from sqlmodel_pg_kit import create_all, Repository
+from sqlmodel_pg_kit.db import get_session
+
+class Hero(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    name: str = Field(index=True)
+    age: Optional[int] = None
+
+create_all()
+repo = Repository(Hero)
+with get_session() as s:
+    h = repo.create(s, {"name": "Alice", "age": 20})
+    h2 = repo.get(s, h.id)
+    page = repo.list(s, page=1, size=5)
+    h3 = repo.update(s, h.id, age=21)
+    ok = repo.delete(s, h.id)
+```
+
+### Bulk insert + filters/pagination
+
+```bash
+uv run python examples/02_bulk_and_filters.py
+```
+
+Demonstrates `Repository.bulk_insert(rows)` and filtering with SQLModel expressions:
+
+```python
+from typing import Optional
+from sqlmodel import select, SQLModel, Field
+from sqlmodel_pg_kit import Repository
+from sqlmodel_pg_kit.db import get_session
+
+class Hero(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    name: str
+    age: Optional[int] = None
+
+with get_session() as s:
+    Repository(Hero).bulk_insert(s, [{"name":"PG Hero","age":1},{"name":"PG Hero","age":2}])
+    heroes = s.exec(select(Hero).where(Hero.name == "PG Hero")).all()
+```
+
+### Relationships (Team <- Hero)
+
+```bash
+uv run python examples/03_relationships.py
+```
+
+Creates a `Team`, assigns heroes, and reads back with `selectinload` eager loading.
+
+### Async CRUD
+
+```bash
+uv run python examples/04_async_crud.py
+```
+Uses `get_async_session()` with `AsyncRepository` to write/read without raw SQL. You may call `create_all()` once before async operations.
+
+## Tests via Makefile
+
+- `make test-sqlite`: run fast smoke tests on SQLite (no Postgres needed)
+- `make test`: run all tests found by pytest
+- `make test-pg`: run Postgres integration once on current Python (requires SQL_*/PG* env)
+- `make test-pg-once`: run Postgres integration across Python 3.10–3.13 (requires `uv` and SQL_*/PG* env)
+
+## Cheminformatics Example
+
+This repo includes a practical multi-table example tailored for cheminformatics workflows: `examples/05_cheminformatics.py`.
+
+What it shows:
+- Molecules with descriptors: `smiles`, `selfies`, `qed`, `sa_score` (no RDKit dependency in runtime; compute upstream and store).
+- Many-to-many datasets: `Dataset` and link table `MoleculeDataset` for train/holdout/etc.
+- Dataclass interop: lightweight `@dataclass MoleculeDTO` that converts to SQLModel for fast CRUD.
+- Typical queries: threshold filters, pattern matching, eager-loaded relationships, join filters.
+
+Run it (requires SQL_*/PG* env):
+
+```bash
+uv run python examples/05_cheminformatics.py
+```
+
+Step-by-step outline you can adapt:
+
+1) Define models
+```python
+class Molecule(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    smiles: str = Field(index=True)
+    selfies: Optional[str] = None
+    qed: Optional[float] = Field(default=None, index=True)
+    sa_score: Optional[float] = Field(default=None, index=True)
+    datasets: List[Dataset] = Relationship(back_populates="molecules", link_model=MoleculeDataset)
+
+class Dataset(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    name: str = Field(index=True)
+    molecules: List[Molecule] = Relationship(back_populates="datasets", link_model=MoleculeDataset)
+
+class MoleculeDataset(SQLModel, table=True):
+    molecule_id: int = Field(foreign_key="molecule.id", primary_key=True)
+    dataset_id: int = Field(foreign_key="dataset.id", primary_key=True)
+```
+
+2) Bring your RDKit pipeline data via dataclass
+
+```python
+@dataclass
+class MoleculeDTO:
+    smiles: str
+    selfies: Optional[str] = None
+    qed: Optional[float] = None
+    sa_score: Optional[float] = None
+
+    def to_model(self) -> Molecule:
+        return Molecule(**vars(self))
+```
+
+3) Fast CRUD
+
+```python
+from sqlmodel_pg_kit.db import get_session
+create_all()  # once
+
+with get_session() as s:
+    s.add(Molecule(smiles="CCO", qed=0.5, sa_score=2.1))
+    s.commit()
+
+with get_session() as s:
+    m = s.exec(select(Molecule).where(Molecule.smiles == "CCO")).one()
+    m.qed = 0.55
+    s.add(m); s.commit(); s.refresh(m)
+
+with get_session() as s:
+    s.delete(m); s.commit()
+```
+
+4) Link datasets and join queries
+
+```python
+with get_session() as s:
+    ds = Dataset(name="train"); s.add(ds); s.commit(); s.refresh(ds)
+    m = s.exec(select(Molecule).where(Molecule.smiles == "CCO")).one()
+    s.add(MoleculeDataset(molecule_id=m.id, dataset_id=ds.id)); s.commit()
+
+with get_session() as s:
+    stmt = (
+        select(Molecule)
+        .join(MoleculeDataset, Molecule.id == MoleculeDataset.molecule_id)
+        .join(Dataset, Dataset.id == MoleculeDataset.dataset_id)
+        .where(Dataset.name == "train")
+    )
+    train_mols = s.exec(stmt).all()
+```
+
+5) Tips for high-throughput tasks
+- Bulk insert many rows: use SQLAlchemy Core `insert(Model).values(list_of_dicts)` then `s.commit()`.
+- Paginate: `select(Model).offset((page-1)*size).limit(size)`.
+- Eager-load related rows: `.options(selectinload(Model.rel))` to avoid N+1.
+- Partial updates: load row, set fields, commit; or use Core `update()` when you don’t need ORM instances.
+
+See `examples/05_cheminformatics.py` for a complete, runnable walkthrough.
+
+## Cheminformatics: RDKit + Mordred
+Below are patterns to integrate RDKit and Mordred descriptor pipelines.
+
+- Modeling patterns:
+  - Wide columns: put high‑value fields directly on `Molecule` (e.g., `qed`, `sa_score`) with B‑Tree indexes for fast filters.
+  - JSONB payload: store a large descriptor dict in a JSONB column when you need many descriptors but query only a subset.
+  - Normalized table (EAV): `MoleculeDescriptor(molecule_id, name, value)` when you frequently query and index specific descriptors by name.
+
+- JSONB example (Postgres):
+
+```python
+from typing import Optional, Dict
+from sqlmodel import SQLModel, Field
+from sqlalchemy import Column
+from sqlalchemy.dialects.postgresql import JSONB
+
+class Molecule(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    smiles: str = Field(index=True)
+    descriptors: Dict[str, float] = Field(default_factory=dict, sa_column=Column(JSONB))
+```
+
+- EAV example:
+
+```python
+class MoleculeDescriptor(SQLModel, table=True):
+    molecule_id: int = Field(foreign_key="molecule.id", index=True)
+    name: str = Field(index=True)
+    value: float = Field(index=True)
+```
+
+- RDKit + Mordred pipeline sketch:
+
+```python
+# Optional installs in Jupyter:
+# %pip install rdkit-pypi mordred
+
+from rdkit import Chem
+from rdkit.Chem import QED
+from mordred import Calculator, descriptors
+
+mol = Chem.MolFromSmiles("c1ccccc1O")
+qed = float(QED.qed(mol))
+calc = Calculator(descriptors, ignore_3D=True)
+md = calc(mol)  # returns mapping-like object
+desc = {k: float(v) for k,v in md.items() if v is not None and isinstance(v, (int, float))}
+
+from sqlmodel_pg_kit.db import get_session
+from sqlmodel import select
+
+with get_session() as s:
+    m = Molecule(smiles="c1ccccc1O", descriptors=desc)
+    s.add(m); s.commit(); s.refresh(m)
+    # filter by a descriptor threshold
+    # for JSONB: use SQLAlchemy JSON operators or extract into wide columns for hot features
+```
+
+Indexing tips:
+- Create B‑Tree indexes on hot numeric columns: `qed`, `sa_score`, etc.
+- For JSONB, consider GIN indexes with jsonb_path_ops on frequently accessed keys.
+- For EAV, add `(name, value)` composite indexes and partial indexes for the top N descriptor names.
+
+## Jupyter Notebooks
+A ready‑to‑run tutorial notebook is included under `notebooks/`:
+- `notebooks/01_cheminformatics_quickstart.ipynb`: end‑to‑end walkthrough (install, configuration, CRUD, joins, optional RDKit/Mordred computation). You can execute it in your Jupyter environment.
+- `notebooks/05_cheminformatics.ipynb`: a teaching version of `examples/05_cheminformatics.py`, step‑by‑step CRUD and joins tailored for learning.
+- `notebooks/01_sync_crud.ipynb`: helpers-based create/get/list/update/delete (with optional SQLite override).
+- `notebooks/02_bulk_and_filters.ipynb`: bulk insert and SQLModel filtering examples.
+- `notebooks/03_relationships.ipynb`: Team ↔ Hero relationships with eager loading.
+- `notebooks/04_async_crud.ipynb`: async session CRUD, with optional async SQLite override.
+- `notebooks/06_csv_import.ipynb`: CSV → SQLModel 自动建模与入库（含 SQLite/PG 切换与筛选查询）。
+
+Typical start in Jupyter:
+
+```python
+%pip install -e . pytest  # kit + tests
+# Optional dependencies for chem pipelines:
+# %pip install rdkit-pypi mordred
+```
+
+### Micromamba environment
+If you already have a micromamba env named `sqlmodel`:
+
+```bash
+micromamba activate sqlmodel
+jupyter lab  # or jupyter notebook
+```
+
+Then open one of the notebooks under `notebooks/` to follow along.
+
+## Build & Publish
+- PyPI build: `uv build`
+- PyPI publish: `uv publish` (configure token via `uv keyring set ...`)
+- Conda build: `conda build conda/`
+
+## Notes
+- For production, prefer `sslmode=verify-full`, least-privileged DB users, and consider PgBouncer.
+- Bring your own Alembic migrations; set `target_metadata = SQLModel.metadata` in env.py.
+
+## apple container start command
+
+```shell
+# 停止并删除现有容器  
+container stop pg-app  
+container delete pg-app  
+  
+container run \
+  --name pg-app \
+  --detach \
+  --mount type=tmpfs,target=/var/lib/postgresql/data \
+  --volume ./docker-init:/docker-entrypoint-initdb.d \
+  --env POSTGRES_USER=postgres \
+  --env POSTGRES_PASSWORD=change-me-strong \
+  --env POSTGRES_INITDB_ARGS="--encoding=UTF8 --locale=en_US.UTF-8" \
+  postgres:16
+
+container ls --all
+container logs pg-app
+```
+
+## set environment variables
+
+```bash
+export SQL_HOST=192.168.64.8
+export SQL_PORT=5432
+export SQL_USER=postgres
+export SQL_PASSWORD=change-me-strong
+export SQL_DATABASE=appdb
+export SQL_SSLMODE=disable
+```
+
+## make test with uv
+
+测试连接
+现在您可以使用psql测试PostgreSQL连接：
+
+```bash
+psql -h $SQL_HOST -p $SQL_PORT -U $SQL_USER -d $SQL_DATABASE -c '\l'
+```
+
+```bash
+make test-pg-once
+```
+
+## Verify data with psql
+- List tables: `\dt`
+- Inspect example tables (depending on the example you ran), e.g.:
+  - `SELECT * FROM hero ORDER BY id DESC LIMIT 10;`
+  - `SELECT * FROM dataset ORDER BY id DESC LIMIT 10;`