feat(storage): implement NAS content storage with read/write capabilities
Build and Push Reader API Image / docker (push) Successful in 1m3s

feat(docker): configure NAS content and EPUB source directories in docker-compose

feat(migrations): add tables for SourceAsset, ImportJob, ChapterContentRef, and AssetNovelMapping

feat(scripts): create backfill script for populating ChapterContentRef from MongoDB chapters
This commit is contained in:
2026-04-30 01:53:52 +07:00
parent 82f502acd2
commit 9a3bb4b6ce
10 changed files with 1297 additions and 18 deletions
+5
View File
@@ -12,3 +12,8 @@ build/
.env.local .env.local
# Local debug/test scripts # Local debug/test scripts
test_*.py test_*.py
# Local NAS mount test data
data/epub-source/*
data/content/*
data/nas-content/*
+91
View File
@@ -90,6 +90,51 @@ Notes:
- `api-local` listens on port `8001` and automatically points to `postgres` + `mongo` containers. - `api-local` listens on port `8001` and automatically points to `postgres` + `mongo` containers.
- `web` listens on port `3000` and calls API internally through `http://api:8000`. - `web` listens on port `3000` and calls API internally through `http://api:8000`.
### NAS mount points (chapter content + EPUB source)
API containers now reserve two mount folders:
- `/data/content`: converted chapter files (`txt` + `raw_html`)
- `/data/epub-source`: source EPUB library
Default env mapping (already wired in compose):
```env
NAS_CONTENT_ROOT=/data/content
EPUB_SOURCE_ROOT=/data/epub-source
```
If you want to bind to host folders for local testing:
```yaml
services:
api:
volumes:
- /absolute/local/path/content:/data/content
- /absolute/local/path/epub-source:/data/epub-source
```
If you want to use NFS-backed docker volumes, define them under `volumes:`. Example:
```yaml
volumes:
nas_chapter_content:
driver: local
driver_opts:
type: nfs
o: addr=100.93.79.10,nolock,soft,rw
device: ":/volume2/apps/reader-content"
nas_epub_source:
driver: local
driver_opts:
type: nfs
o: addr=100.93.79.10,nolock,soft,rw
device: ":/volume2/apps/reader-epub"
```
For your EPUB structure (folder per novel, multiple `.epub` parts inside), mount the parent folder to `/data/epub-source`.
## Implemented Endpoints ## Implemented Endpoints
- GET /api/health - GET /api/health
@@ -109,6 +154,52 @@ Notes:
- GET /api/truyen/suggest - GET /api/truyen/suggest
- GET /api/chapters/{chapterId} - GET /api/chapters/{chapterId}
## NAS Migration Ops
### 1) Apply SQL migration manually
Run SQL in `migrations/2026_04_nas_content_storage.sql` against PostgreSQL.
### 2) Backfill existing chapter content from Mongo -> NAS + ChapterContentRef
Dry-run first:
```bash
python scripts/backfill_chapter_content_refs.py --limit 1000 --dry-run
```
Then execute:
```bash
python scripts/backfill_chapter_content_refs.py --limit 1000
```
You can run multiple batches by increasing/changing `--limit`.
Checkpoint/resume mode:
```bash
python scripts/backfill_chapter_content_refs.py --limit 1000 --state-file .backfill_state.json
```
Or continue from a known ObjectId:
```bash
python scripts/backfill_chapter_content_refs.py --limit 1000 --after-id 680f7f3a2f0d53f4f2b7a123
```
## Chapter Read Cutover Flag
Set in `.env`:
```env
CHAPTER_CONTENT_MODE=nas_first
```
Values:
- `nas_first` (default): read NAS ref first, fallback Mongo.
- `mongo_first`: keep Mongo-first during cautious rollout.
## Notes ## Notes
- Web session auth is supported via NextAuth session cookies (next-auth.session-token and secure variants). - Web session auth is supported via NextAuth session cookies (next-auth.session-token and secure variants).
+30
View File
@@ -0,0 +1,30 @@
# Rollout Checklist - NAS Chapter Storage
## Pre-Deploy
- [ ] Backup PostgreSQL schema + critical tables
- [ ] Verify NAS mount/access permissions in API runtime
- [ ] Enable feature flags (default: Mongo fallback on)
## Deploy Order
1. Deploy DB migrations
2. Deploy API with dual-read disabled by default
3. Enable discover/approve/convert job APIs
4. Run pilot import set (small curated EPUB batch)
5. Enable NAS-first for pilot users/env
6. Gradually ramp NAS-first traffic
## Runtime Verification
- [ ] `/api/health` stable
- [ ] Chapter read success rate >= target
- [ ] NAS read timeout/error rate below threshold
- [ ] Mongo fallback rate trending down
## Rollback
- [ ] Switch feature flag to Mongo-first immediately
- [ ] Stop import jobs
- [ ] Keep imported refs for investigation (no destructive cleanup)
## Post-Deploy
- [ ] Compare chapter counts and random content samples
- [ ] Review failed/review_required import queue
- [ ] Publish release notes for web/mobile teams
+4 -1
View File
@@ -121,7 +121,10 @@ async def resolve_current_user(db: AsyncSession, request: Request) -> dict[str,
return await _get_user_from_session_cookie(db, request) return await _get_user_from_session_cookie(db, request)
async def require_current_user(db: AsyncSession, request: Request) -> dict[str, Any]: async def require_current_user(
request: Request,
db: AsyncSession = Depends(get_db_session),
) -> dict[str, Any]:
user = await resolve_current_user(db, request) user = await resolve_current_user(db, request)
if not user: if not user:
raise HTTPException(status_code=401, detail="Unauthorized") raise HTTPException(status_code=401, detail="Unauthorized")
+3
View File
@@ -20,6 +20,9 @@ class Settings(BaseSettings):
r2_secret_access_key: str = "" r2_secret_access_key: str = ""
r2_bucket_name: str = "" r2_bucket_name: str = ""
r2_public_base_url: str = "" r2_public_base_url: str = ""
nas_content_root: str = "./data/content"
epub_source_root: str = "./data/epub-source"
chapter_content_mode: str = "nas_first" # nas_first | mongo_first
deepseek_key: str = "" deepseek_key: str = ""
deepseek_model: str = "deepseek-chat" deepseek_model: str = "deepseek-chat"
+956 -16
View File
File diff suppressed because it is too large Load Diff
+33
View File
@@ -0,0 +1,33 @@
from __future__ import annotations
import hashlib
from pathlib import Path
from app.config import settings
class NasContentStorage:
def __init__(self, root_dir: str):
self.root = Path(root_dir).resolve()
self.root.mkdir(parents=True, exist_ok=True)
def _resolve(self, href: str) -> Path:
rel = href.strip().lstrip("/")
target = (self.root / rel).resolve()
if self.root not in target.parents and target != self.root:
raise ValueError("Invalid storage href")
return target
def read_text(self, href: str) -> str:
path = self._resolve(href)
return path.read_text(encoding="utf-8")
def write_text(self, href: str, content: str) -> dict[str, str | int]:
path = self._resolve(href)
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(content, encoding="utf-8")
digest = hashlib.sha256(content.encode("utf-8")).hexdigest()
return {"href": href, "sha256": digest, "size": len(content.encode("utf-8"))}
storage = NasContentStorage(settings.nas_content_root)
+13
View File
@@ -9,6 +9,12 @@ services:
- .env - .env
ports: ports:
- "8000:8000" - "8000:8000"
environment:
NAS_CONTENT_ROOT: ${NAS_CONTENT_ROOT:-/data/content}
EPUB_SOURCE_ROOT: ${EPUB_SOURCE_ROOT:-/data/epub-source}
volumes:
- nas_chapter_content:/data/content
- nas_epub_source:/data/epub-source
restart: unless-stopped restart: unless-stopped
healthcheck: healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/api/health').read()"] test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/api/health').read()"]
@@ -30,8 +36,13 @@ services:
environment: environment:
DATABASE_URL: postgresql://reader:reader@postgres:5432/reader DATABASE_URL: postgresql://reader:reader@postgres:5432/reader
MONGODB_URI: mongodb://mongo:27017/reader MONGODB_URI: mongodb://mongo:27017/reader
NAS_CONTENT_ROOT: ${NAS_CONTENT_ROOT:-/data/content}
EPUB_SOURCE_ROOT: ${EPUB_SOURCE_ROOT:-/data/epub-source}
ports: ports:
- "8001:8000" - "8001:8000"
volumes:
- nas_chapter_content:/data/content
- nas_epub_source:/data/epub-source
restart: unless-stopped restart: unless-stopped
healthcheck: healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/api/health').read()"] test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/api/health').read()"]
@@ -104,3 +115,5 @@ volumes:
web_uploads: web_uploads:
postgres_data: postgres_data:
mongo_data: mongo_data:
nas_chapter_content:
nas_epub_source:
@@ -0,0 +1,43 @@
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE TABLE IF NOT EXISTS "SourceAsset" (
id TEXT PRIMARY KEY,
path TEXT NOT NULL,
sha256 TEXT NOT NULL,
opf_identifier TEXT,
title TEXT,
author TEXT,
status TEXT NOT NULL DEFAULT 'discovered',
"createdAt" TIMESTAMPTZ NOT NULL DEFAULT NOW(),
"updatedAt" TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE UNIQUE INDEX IF NOT EXISTS "SourceAsset_sha256_key" ON "SourceAsset"(sha256);
CREATE TABLE IF NOT EXISTS "ImportJob" (
id TEXT PRIMARY KEY,
"sourceAssetId" TEXT NOT NULL REFERENCES "SourceAsset"(id) ON DELETE CASCADE,
status TEXT NOT NULL DEFAULT 'pending',
error TEXT,
"createdAt" TIMESTAMPTZ NOT NULL DEFAULT NOW(),
"updatedAt" TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE IF NOT EXISTS "ChapterContentRef" (
"chapterId" TEXT PRIMARY KEY,
"txtHref" TEXT NOT NULL,
"rawHtmlHref" TEXT NOT NULL,
"contentHash" TEXT NOT NULL,
"createdAt" TIMESTAMPTZ NOT NULL DEFAULT NOW(),
"updatedAt" TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE IF NOT EXISTS "AssetNovelMapping" (
id TEXT PRIMARY KEY,
"sourceAssetId" TEXT NOT NULL REFERENCES "SourceAsset"(id) ON DELETE CASCADE,
"novelId" TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
note TEXT,
"createdAt" TIMESTAMPTZ NOT NULL DEFAULT NOW(),
"updatedAt" TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
+118
View File
@@ -0,0 +1,118 @@
from __future__ import annotations
import argparse
import asyncio
import hashlib
import json
from pathlib import Path
from bson import ObjectId
from sqlalchemy import text
from app.config import settings
from app.database import SessionLocal, mongo_db
from app.storage import storage
async def backfill(limit: int, dry_run: bool, after_id: str | None, state_file: str | None) -> None:
query = {
"$or": [
{"content": {"$exists": True, "$type": "string", "$ne": ""}},
{"contentHtml": {"$exists": True, "$type": "string", "$ne": ""}},
]
}
if after_id:
query["_id"] = {"$gt": ObjectId(after_id)}
docs = (
await mongo_db["chapters"]
.find(query, {"content": 1, "contentHtml": 1})
.sort("_id", 1)
.limit(limit)
.to_list(limit)
)
mapped = 0
skipped = 0
async with SessionLocal() as db:
for doc in docs:
chapter_id = str(doc.get("_id") or "")
if not chapter_id:
skipped += 1
continue
exists = (
await db.execute(
text('SELECT "chapterId" FROM "ChapterContentRef" WHERE "chapterId" = :id LIMIT 1'),
{"id": chapter_id},
)
).mappings().first()
if exists:
skipped += 1
continue
txt = str(doc.get("content") or "").strip()
raw_html = str(doc.get("contentHtml") or doc.get("content") or "")
if not txt:
skipped += 1
continue
txt_href = f"legacy/{chapter_id}.txt"
raw_href = f"legacy/{chapter_id}.raw.html"
content_hash = hashlib.sha256(txt.encode("utf-8")).hexdigest()
if not dry_run:
storage.write_text(txt_href, txt)
storage.write_text(raw_href, raw_html)
await db.execute(
text(
'INSERT INTO "ChapterContentRef" ("chapterId", "txtHref", "rawHtmlHref", "contentHash") '
'VALUES (:chapter_id, :txt_href, :raw_href, :hash) '
'ON CONFLICT ("chapterId") DO NOTHING'
),
{
"chapter_id": chapter_id,
"txt_href": txt_href,
"raw_href": raw_href,
"hash": content_hash,
},
)
mapped += 1
if not dry_run:
await db.commit()
last_id = str(docs[-1]["_id"]) if docs else None
summary = {
"scanned": len(docs),
"mapped": mapped,
"skipped": skipped,
"dryRun": dry_run,
"contentRoot": settings.nas_content_root,
"nextAfterId": last_id,
}
if state_file and last_id and not dry_run:
Path(state_file).write_text(json.dumps({"afterId": last_id}, ensure_ascii=True), encoding="utf-8")
print(summary)
def main() -> None:
parser = argparse.ArgumentParser(description="Backfill ChapterContentRef from Mongo chapters")
parser.add_argument("--limit", type=int, default=1000)
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--after-id", type=str, default="")
parser.add_argument("--state-file", type=str, default="")
args = parser.parse_args()
after_id = args.after_id.strip() or None
state_file = args.state_file.strip() or None
if state_file and not after_id:
p = Path(state_file)
if p.exists():
try:
after_id = json.loads(p.read_text(encoding="utf-8")).get("afterId")
except Exception:
after_id = None
asyncio.run(backfill(limit=args.limit, dry_run=args.dry_run, after_id=after_id, state_file=state_file))
if __name__ == "__main__":
main()