feat(storage): implement NAS content storage with read/write capabilities

feat(docker): configure NAS content and EPUB source directories in docker-compose feat(migrations): add tables for SourceAsset, ImportJob, ChapterContentRef, and AssetNovelMapping feat(scripts): create backfill script for populating ChapterContentRef from MongoDB chapters
2026-04-30 01:53:52 +07:00
parent 82f502acd2
commit 9a3bb4b6ce
10 changed files with 1297 additions and 18 deletions
@@ -12,3 +12,8 @@ build/
 .env.local
 # Local debug/test scripts
 test_*.py
 # Local NAS mount test data
 data/epub-source/*
 data/content/*
 data/nas-content/*
@@ -90,6 +90,51 @@ Notes:
 - `api-local` listens on port `8001` and automatically points to `postgres` + `mongo` containers.
 - `web` listens on port `3000` and calls API internally through `http://api:8000`.
 ### NAS mount points (chapter content + EPUB source)
 API containers now reserve two mount folders:
 - `/data/content`: converted chapter files (`txt` + `raw_html`)
 - `/data/epub-source`: source EPUB library
 Default env mapping (already wired in compose):
 ```env
 NAS_CONTENT_ROOT=/data/content
 EPUB_SOURCE_ROOT=/data/epub-source
 ```
 If you want to bind to host folders for local testing:
 ```yaml
 services:
  api:
    volumes:
      - /absolute/local/path/content:/data/content
      - /absolute/local/path/epub-source:/data/epub-source
 ```
 If you want to use NFS-backed docker volumes, define them under `volumes:`. Example:
 ```yaml
 volumes:
  nas_chapter_content:
    driver: local
    driver_opts:
      type: nfs
      o: addr=100.93.79.10,nolock,soft,rw
      device: ":/volume2/apps/reader-content"
  nas_epub_source:
    driver: local
    driver_opts:
      type: nfs
      o: addr=100.93.79.10,nolock,soft,rw
      device: ":/volume2/apps/reader-epub"
 ```
 For your EPUB structure (folder per novel, multiple `.epub` parts inside), mount the parent folder to `/data/epub-source`.
 ## Implemented Endpoints
 - GET /api/health
@@ -109,6 +154,52 @@ Notes:
 - GET /api/truyen/suggest
 - GET /api/chapters/{chapterId}
 ## NAS Migration Ops
 ### 1) Apply SQL migration manually
 Run SQL in `migrations/2026_04_nas_content_storage.sql` against PostgreSQL.
 ### 2) Backfill existing chapter content from Mongo -> NAS + ChapterContentRef
 Dry-run first:
 ```bash
 python scripts/backfill_chapter_content_refs.py --limit 1000 --dry-run
 ```
 Then execute:
 ```bash
 python scripts/backfill_chapter_content_refs.py --limit 1000
 ```
 You can run multiple batches by increasing/changing `--limit`.
 Checkpoint/resume mode:
 ```bash
 python scripts/backfill_chapter_content_refs.py --limit 1000 --state-file .backfill_state.json
 ```
 Or continue from a known ObjectId:
 ```bash
 python scripts/backfill_chapter_content_refs.py --limit 1000 --after-id 680f7f3a2f0d53f4f2b7a123
 ```
 ## Chapter Read Cutover Flag
 Set in `.env`:
 ```env
 CHAPTER_CONTENT_MODE=nas_first
 ```
 Values:
 - `nas_first` (default): read NAS ref first, fallback Mongo.
 - `mongo_first`: keep Mongo-first during cautious rollout.
 ## Notes
 - Web session auth is supported via NextAuth session cookies (next-auth.session-token and secure variants).
@@ -0,0 +1,30 @@
 # Rollout Checklist - NAS Chapter Storage
 ## Pre-Deploy
 - [ ] Backup PostgreSQL schema + critical tables
 - [ ] Verify NAS mount/access permissions in API runtime
 - [ ] Enable feature flags (default: Mongo fallback on)
 ## Deploy Order
 1. Deploy DB migrations
 2. Deploy API with dual-read disabled by default
 3. Enable discover/approve/convert job APIs
 4. Run pilot import set (small curated EPUB batch)
 5. Enable NAS-first for pilot users/env
 6. Gradually ramp NAS-first traffic
 ## Runtime Verification
 - [ ] `/api/health` stable
 - [ ] Chapter read success rate >= target
 - [ ] NAS read timeout/error rate below threshold
 - [ ] Mongo fallback rate trending down
 ## Rollback
 - [ ] Switch feature flag to Mongo-first immediately
 - [ ] Stop import jobs
 - [ ] Keep imported refs for investigation (no destructive cleanup)
 ## Post-Deploy
 - [ ] Compare chapter counts and random content samples
 - [ ] Review failed/review_required import queue
 - [ ] Publish release notes for web/mobile teams
@@ -121,7 +121,10 @@ async def resolve_current_user(db: AsyncSession, request: Request) -> dict[str,
    return await _get_user_from_session_cookie(db, request)
-async def require_current_user(db: AsyncSession, request: Request) -> dict[str, Any]:
+async def require_current_user(
    request: Request,
    db: AsyncSession = Depends(get_db_session),
 ) -> dict[str, Any]:
    user = await resolve_current_user(db, request)
    if not user:
        raise HTTPException(status_code=401, detail="Unauthorized")
@@ -20,6 +20,9 @@ class Settings(BaseSettings):
    r2_secret_access_key: str = ""
    r2_bucket_name: str = ""
    r2_public_base_url: str = ""
    nas_content_root: str = "./data/content"
    epub_source_root: str = "./data/epub-source"
    chapter_content_mode: str = "nas_first"  # nas_first | mongo_first
    deepseek_key: str = ""
    deepseek_model: str = "deepseek-chat"
@@ -0,0 +1,33 @@
 from __future__ import annotations
 import hashlib
 from pathlib import Path
 from app.config import settings
 class NasContentStorage:
    def __init__(self, root_dir: str):
        self.root = Path(root_dir).resolve()
        self.root.mkdir(parents=True, exist_ok=True)
    def _resolve(self, href: str) -> Path:
        rel = href.strip().lstrip("/")
        target = (self.root / rel).resolve()
        if self.root not in target.parents and target != self.root:
            raise ValueError("Invalid storage href")
        return target
    def read_text(self, href: str) -> str:
        path = self._resolve(href)
        return path.read_text(encoding="utf-8")
    def write_text(self, href: str, content: str) -> dict[str, str | int]:
        path = self._resolve(href)
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(content, encoding="utf-8")
        digest = hashlib.sha256(content.encode("utf-8")).hexdigest()
        return {"href": href, "sha256": digest, "size": len(content.encode("utf-8"))}
 storage = NasContentStorage(settings.nas_content_root)
@@ -9,6 +9,12 @@ services:
      - .env
    ports:
      - "8000:8000"
    environment:
      NAS_CONTENT_ROOT: ${NAS_CONTENT_ROOT:-/data/content}
      EPUB_SOURCE_ROOT: ${EPUB_SOURCE_ROOT:-/data/epub-source}
    volumes:
      - nas_chapter_content:/data/content
      - nas_epub_source:/data/epub-source
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/api/health').read()"]
@@ -30,8 +36,13 @@ services:
    environment:
      DATABASE_URL: postgresql://reader:reader@postgres:5432/reader
      MONGODB_URI: mongodb://mongo:27017/reader
      NAS_CONTENT_ROOT: ${NAS_CONTENT_ROOT:-/data/content}
      EPUB_SOURCE_ROOT: ${EPUB_SOURCE_ROOT:-/data/epub-source}
    ports:
      - "8001:8000"
    volumes:
      - nas_chapter_content:/data/content
      - nas_epub_source:/data/epub-source
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/api/health').read()"]
@@ -104,3 +115,5 @@ volumes:
  web_uploads:
  postgres_data:
  mongo_data:
  nas_chapter_content:
  nas_epub_source:
@@ -0,0 +1,43 @@
 CREATE EXTENSION IF NOT EXISTS unaccent;
 CREATE TABLE IF NOT EXISTS "SourceAsset" (
    id TEXT PRIMARY KEY,
    path TEXT NOT NULL,
    sha256 TEXT NOT NULL,
    opf_identifier TEXT,
    title TEXT,
    author TEXT,
    status TEXT NOT NULL DEFAULT 'discovered',
    "createdAt" TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    "updatedAt" TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE UNIQUE INDEX IF NOT EXISTS "SourceAsset_sha256_key" ON "SourceAsset"(sha256);
 CREATE TABLE IF NOT EXISTS "ImportJob" (
    id TEXT PRIMARY KEY,
    "sourceAssetId" TEXT NOT NULL REFERENCES "SourceAsset"(id) ON DELETE CASCADE,
    status TEXT NOT NULL DEFAULT 'pending',
    error TEXT,
    "createdAt" TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    "updatedAt" TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE TABLE IF NOT EXISTS "ChapterContentRef" (
    "chapterId" TEXT PRIMARY KEY,
    "txtHref" TEXT NOT NULL,
    "rawHtmlHref" TEXT NOT NULL,
    "contentHash" TEXT NOT NULL,
    "createdAt" TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    "updatedAt" TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE TABLE IF NOT EXISTS "AssetNovelMapping" (
    id TEXT PRIMARY KEY,
    "sourceAssetId" TEXT NOT NULL REFERENCES "SourceAsset"(id) ON DELETE CASCADE,
    "novelId" TEXT NOT NULL,
    status TEXT NOT NULL DEFAULT 'pending',
    note TEXT,
    "createdAt" TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    "updatedAt" TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
@@ -0,0 +1,118 @@
 from __future__ import annotations
 import argparse
 import asyncio
 import hashlib
 import json
 from pathlib import Path
 from bson import ObjectId
 from sqlalchemy import text
 from app.config import settings
 from app.database import SessionLocal, mongo_db
 from app.storage import storage
 async def backfill(limit: int, dry_run: bool, after_id: str | None, state_file: str | None) -> None:
    query = {
        "$or": [
            {"content": {"$exists": True, "$type": "string", "$ne": ""}},
            {"contentHtml": {"$exists": True, "$type": "string", "$ne": ""}},
        ]
    }
    if after_id:
        query["_id"] = {"$gt": ObjectId(after_id)}
    docs = (
        await mongo_db["chapters"]
        .find(query, {"content": 1, "contentHtml": 1})
        .sort("_id", 1)
        .limit(limit)
        .to_list(limit)
    )
    mapped = 0
    skipped = 0
    async with SessionLocal() as db:
        for doc in docs:
            chapter_id = str(doc.get("_id") or "")
            if not chapter_id:
                skipped += 1
                continue
            exists = (
                await db.execute(
                    text('SELECT "chapterId" FROM "ChapterContentRef" WHERE "chapterId" = :id LIMIT 1'),
                    {"id": chapter_id},
                )
            ).mappings().first()
            if exists:
                skipped += 1
                continue
            txt = str(doc.get("content") or "").strip()
            raw_html = str(doc.get("contentHtml") or doc.get("content") or "")
            if not txt:
                skipped += 1
                continue
            txt_href = f"legacy/{chapter_id}.txt"
            raw_href = f"legacy/{chapter_id}.raw.html"
            content_hash = hashlib.sha256(txt.encode("utf-8")).hexdigest()
            if not dry_run:
                storage.write_text(txt_href, txt)
                storage.write_text(raw_href, raw_html)
                await db.execute(
                    text(
                        'INSERT INTO "ChapterContentRef" ("chapterId", "txtHref", "rawHtmlHref", "contentHash") '
                        'VALUES (:chapter_id, :txt_href, :raw_href, :hash) '
                        'ON CONFLICT ("chapterId") DO NOTHING'
                    ),
                    {
                        "chapter_id": chapter_id,
                        "txt_href": txt_href,
                        "raw_href": raw_href,
                        "hash": content_hash,
                    },
                )
            mapped += 1
        if not dry_run:
            await db.commit()
    last_id = str(docs[-1]["_id"]) if docs else None
    summary = {
        "scanned": len(docs),
        "mapped": mapped,
        "skipped": skipped,
        "dryRun": dry_run,
        "contentRoot": settings.nas_content_root,
        "nextAfterId": last_id,
    }
    if state_file and last_id and not dry_run:
        Path(state_file).write_text(json.dumps({"afterId": last_id}, ensure_ascii=True), encoding="utf-8")
    print(summary)
 def main() -> None:
    parser = argparse.ArgumentParser(description="Backfill ChapterContentRef from Mongo chapters")
    parser.add_argument("--limit", type=int, default=1000)
    parser.add_argument("--dry-run", action="store_true")
    parser.add_argument("--after-id", type=str, default="")
    parser.add_argument("--state-file", type=str, default="")
    args = parser.parse_args()
    after_id = args.after_id.strip() or None
    state_file = args.state_file.strip() or None
    if state_file and not after_id:
        p = Path(state_file)
        if p.exists():
            try:
                after_id = json.loads(p.read_text(encoding="utf-8")).get("afterId")
            except Exception:
                after_id = None
    asyncio.run(backfill(limit=args.limit, dry_run=args.dry_run, after_id=after_id, state_file=state_file))
 if __name__ == "__main__":
    main()