🔄 卡若AI 同步 2026-02-22 09:12 | 更新：总索引与入口、火种知识模型、运营中枢参考资料、运营中枢工作台 | 排除 >20MB: 8 个

2026-02-22 09:12:01 +08:00
parent a46942b3fb
commit 42453c643d
7 changed files with 387 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -54,6 +54,9 @@ Serverruntime/
 # 大文件外置目录（本地保留，不上传）
 _大文件外置/

+# 本地代码库索引（体积可能较大，各环境自行建索引）
+04_卡火（火）/火种_知识模型/本地代码库索引/index/
+
 # === 自动排除：超过5MB的文件（脚本自动管理，勿手动修改）===
 01_卡资（金）/_团队成员/金盾/存客宝备份/cunkebao_v3/Server/application/common/Server.rar
 01_卡资（金）/_团队成员/金盾/存客宝备份/cunkebao_v3/Server/fonts.ttf
--- a/04_卡火（火）/火种_知识模型/本地代码库索引/SKILL.md
+++ b/04_卡火（火）/火种_知识模型/本地代码库索引/SKILL.md
@@ -0,0 +1,94 @@
+---
+name: 本地代码库索引
+description: 使用 Ollama 本地 embedding 对卡若AI 代码库做索引与语义检索，不上传云端
+triggers: 本地索引、本地搜索、不上传云端、本地代码库、索引卡若AI
+owner: 火种
+group: 火
+version: "1.0"
+updated: "2026-02-22"
+---
+
+# 本地代码库索引
+
+> **管理员**：卡火（火）  
+> **口头禅**："让我想想..."  
+> **职责**：在本地对卡若AI 代码库做 embedding 索引与语义检索，**不上传任何数据到云端**
+
+---
+
+## 一、能做什么
+
+- **建索引**：扫描卡若AI 目录，用 `nomic-embed-text` 本地向量化，存入本地文件
+- **语义搜索**：根据自然语言问题，在本地检索最相关的代码/文档片段
+- **完全本地**：embedding 与索引全部在本机，无云端上传
+
+---
+
+## 二、执行步骤
+
+### 2.1 前置条件
+
+1. **Ollama 已安装并运行**：`ollama serve` 在后台
+2. **nomic-embed-text 已拉取**：`ollama pull nomic-embed-text`
+3. **检查**：`curl http://localhost:11434/api/tags` 能看到 `nomic-embed-text`
+
+### 2.2 建索引（首次或更新）
+
+```bash
+cd /Users/karuo/Documents/个人/卡若AI
+python3 04_卡火（火）/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py index
+```
+
+- 默认索引目录：`/Users/karuo/Documents/个人/卡若AI`（可配置）
+- 默认排除：`node_modules`、`.git`、`__pycache__`、`.venv` 等
+- 索引结果存入：`04_卡火（火）/火种_知识模型/本地代码库索引/index/local_index.json`
+
+### 2.3 语义搜索
+
+```bash
+python3 04_卡火（火）/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py search "如何做语义搜索"
+```
+
+或
+
+```bash
+python3 04_卡火（火）/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py search "本地模型embed怎么用" --top 5
+```
+
+- 返回：文件路径、片段内容、相似度分数
+
+### 2.4 在 Cursor 对话中使用
+
+1. **关闭 Cursor 云索引**：Settings → Indexing & Docs → Pause Indexing
+2. **建好本地索引**（见 2.2）
+3. 对话时说：「用本地索引查 XXX」或「@本地索引 搜索 YYY」
+4. AI 会执行 `python3 .../local_codebase_index.py search "XXX"` 并基于结果回答
+
+---
+
+## 三、与 Cursor 的配合
+
+| Cursor 操作           | 建议                         |
+|:----------------------|:-----------------------------|
+| Codebase Indexing     | **Pause** 或 **Delete**      |
+| 本地索引              | 定期运行 `index` 更新        |
+| 对话检索              | 说「本地索引搜索 XXX」       |
+
+详见：`运营中枢/参考资料/Cursor索引与本地索引方案.md`
+
+---
+
+## 四、相关文件
+
+| 文件 | 说明 |
+|:-----|:-----|
+| `脚本/local_codebase_index.py` | 索引与检索主脚本 |
+| `index/local_index.json` | 本地索引数据（建索引后生成） |
+| `运营中枢/参考资料/Cursor索引与本地索引方案.md` | 方案说明 |
+
+---
+
+## 五、依赖
+
+- 前置：`04_卡火（火）/火种_知识模型/本地模型`（Ollama + nomic-embed-text）
+- 外部：`ollama`、`requests`（与 local_llm_sdk 相同）
--- a/04_卡火（火）/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py
+++ b/04_卡火（火）/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py
@@ -0,0 +1,200 @@
+#!/usr/bin/env python3
+"""
+卡若AI 本地代码库索引
+
+对卡若AI 目录做本地 embedding 索引，支持语义检索。不上传任何数据到云端。
+依赖：Ollama + nomic-embed-text，与 local_llm_sdk 相同。
+
+用法：
+  python local_codebase_index.py index          # 建索引
+  python local_codebase_index.py search "问题"   # 语义搜索
+  python local_codebase_index.py status         # 查看索引状态
+"""
+
+import os
+import sys
+import json
+import math
+import argparse
+from pathlib import Path
+from typing import List, Dict, Any
+
+# 项目根目录
+_REPO_ROOT = Path(__file__).resolve().parents[4]
+_SCRIPT_DIR = Path(__file__).resolve().parent
+_INDEX_DIR = _SCRIPT_DIR.parent / "index"
+_INDEX_FILE = _INDEX_DIR / "local_index.json"
+
+# 索引配置
+INDEX_ROOT = os.environ.get("KARUO_INDEX_ROOT", str(_REPO_ROOT))
+EXCLUDE_DIRS = {
+    "node_modules", ".git", "__pycache__", ".venv", "venv",
+    "dist", "build", ".next", ".cursor", ".github", ".gitea",
+    "chroma_db", "大文件外置"
+}
+EXCLUDE_SUFFIXES = {".pyc", ".pyo", ".map", ".min.js", ".lock", ".log"}
+CHUNK_SIZE = 800   # 每块约 800 字符，便于 embedding
+CHUNK_OVERLAP = 80
+
+# 纳入索引的后缀
+INCLUDE_SUFFIXES = {".md", ".py", ".js", ".ts", ".tsx", ".json", ".mdc", ".txt", ".sh"}
+
+
+def _add_local_llm():
+    """确保能导入 local_llm_sdk"""
+    sdk_dir = _REPO_ROOT / "04_卡火（火）" / "火种_知识模型" / "本地模型" / "脚本"
+    if str(sdk_dir) not in sys.path:
+        sys.path.insert(0, str(sdk_dir))
+
+
+def _chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
+    """将长文本切成 overlapping 块"""
+    chunks = []
+    start = 0
+    while start < len(text):
+        end = min(start + size, len(text))
+        chunk = text[start:end].strip()
+        if chunk:
+            chunks.append(chunk)
+        start += size - overlap
+    return chunks
+
+
+def _collect_files(root: str) -> List[Dict[str, str]]:
+    """收集要索引的文件，返回 [{path, content}]"""
+    items = []
+    root_path = Path(root)
+    for fp in root_path.rglob("*"):
+        if not fp.is_file():
+            continue
+        rel = fp.relative_to(root_path)
+        parts = rel.parts
+        if any(d in parts for d in EXCLUDE_DIRS):
+            continue
+        if fp.suffix.lower() in EXCLUDE_SUFFIXES:
+            continue
+        if fp.suffix.lower() not in INCLUDE_SUFFIXES:
+            continue
+        try:
+            content = fp.read_text(encoding="utf-8", errors="ignore")
+        except Exception:
+            continue
+        if len(content.strip()) < 20:
+            continue
+        items.append({"path": str(rel), "content": content})
+    return items
+
+
+def _embed_via_ollama(text: str) -> List[float]:
+    """通过 Ollama 获取文本 embedding"""
+    _add_local_llm()
+    from local_llm_sdk import get_llm
+    llm = get_llm()
+    result = llm.embed(text[:8000], show_notice=False)
+    if result.get("success") and result.get("embedding"):
+        return result["embedding"]
+    raise RuntimeError(f"Embed 失败: {result}")
+
+
+def cmd_index():
+    """建索引"""
+    import time
+    print(f"📁 索引根目录: {INDEX_ROOT}")
+    print("📂 收集文件中...")
+    files = _collect_files(INDEX_ROOT)
+    print(f"   共 {len(files)} 个文件")
+    if not files:
+        print("   无文件可索引")
+        return
+    _add_local_llm()
+    from local_llm_sdk import get_llm
+    llm = get_llm()
+    records = []
+    total = 0
+    for i, f in enumerate(files):
+        path, content = f["path"], f["content"]
+        chunks = _chunk_text(content)
+        for j, chunk in enumerate(chunks):
+            if len(chunk) < 20:
+                continue
+            try:
+                emb = llm.embed(chunk[:8000], show_notice=False)
+                if emb.get("success") and emb.get("embedding"):
+                    records.append({
+                        "path": path,
+                        "chunk": chunk,
+                        "embedding": emb["embedding"]
+                    })
+                    total += 1
+            except Exception as e:
+                print(f"   ⚠️ {path} 块 {j}: {e}")
+        if (i + 1) % 20 == 0:
+            print(f"   已处理 {i+1}/{len(files)} 文件, {total} 块")
+        time.sleep(0.3)
+    _INDEX_DIR.mkdir(parents=True, exist_ok=True)
+    with open(_INDEX_FILE, "w", encoding="utf-8") as f:
+        json.dump({"records": records, "root": INDEX_ROOT}, f, ensure_ascii=False, indent=0)
+    print(f"✅ 索引完成: {len(records)} 块 → {_INDEX_FILE}")
+
+
+def cmd_search(query: str, top_k: int = 5):
+    """语义搜索"""
+    if not _INDEX_FILE.exists():
+        print("❌ 索引不存在，请先运行: python local_codebase_index.py index")
+        return
+    with open(_INDEX_FILE, "r", encoding="utf-8") as f:
+        data = json.load(f)
+    records = data.get("records", [])
+    if not records:
+        print("❌ 索引为空")
+        return
+    query_emb = _embed_via_ollama(query)
+    scores = []
+    for r in records:
+        v = r["embedding"]
+        dot = sum(a * b for a, b in zip(query_emb, v))
+        n1 = math.sqrt(sum(a * a for a in query_emb))
+        n2 = math.sqrt(sum(b * b for b in v))
+        score = dot / (n1 * n2) if n1 and n2 else 0
+        scores.append((score, r))
+    scores.sort(key=lambda x: -x[0])
+    print(f"\n🔍 查询: {query}\n")
+    for i, (score, r) in enumerate(scores[:top_k], 1):
+        print(f"--- [{i}] {r['path']} (score={score:.3f}) ---")
+        txt = r["chunk"][:400].replace("\n", " ")
+        print(f"{txt}{'...' if len(r['chunk']) > 400 else ''}\n")
+
+
+def cmd_status():
+    """查看索引状态"""
+    if not _INDEX_FILE.exists():
+        print("❌ 索引未创建。运行: python local_codebase_index.py index")
+        return
+    with open(_INDEX_FILE, "r", encoding="utf-8") as f:
+        data = json.load(f)
+    n = len(data.get("records", []))
+    root = data.get("root", "?")
+    print(f"📁 索引根: {root}")
+    print(f"📊 索引块数: {n}")
+    print(f"📄 索引文件: {_INDEX_FILE}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="卡若AI 本地代码库索引")
+    sub = parser.add_subparsers(dest="cmd", required=True)
+    sub.add_parser("index")
+    sp = sub.add_parser("search")
+    sp.add_argument("query", help="搜索问题")
+    sp.add_argument("--top", "-n", type=int, default=5, help="返回前 N 个结果")
+    sub.add_parser("status")
+    args = parser.parse_args()
+    if args.cmd == "index":
+        cmd_index()
+    elif args.cmd == "search":
+        cmd_search(args.query, top_k=args.top)
+    elif args.cmd == "status":
+        cmd_status()
+
+
+if __name__ == "__main__":
+    main()
--- a/SKILL_REGISTRY.md
+++ b/SKILL_REGISTRY.md
@@ -77,6 +77,7 @@
 | F06 | 智能追问 | 火眼 | 追问模式、需求澄清 | `04_卡火（火）/火眼_智能追问/智能追问/SKILL.md` | 通过追问澄清模糊需求 |
 | F07 | 读书笔记(模型) | 火种 | 五行拆书 | `04_卡火（火）/火种_知识模型/读书笔记/SKILL.md` | 本地模型辅助拆书 |
 | F08 | 本地模型 | 火种 | ollama、qwen、本地AI | `04_卡火（火）/火种_知识模型/本地模型/SKILL.md` | Ollama/Qwen 本地部署 |
+| F21 | 本地代码库索引 | 火种 | 本地索引、本地搜索、不上传云端 | `04_卡火（火）/火种_知识模型/本地代码库索引/SKILL.md` | 本地 embedding 索引与语义检索，不上传云端 |

 ## 土组 · 卡土（商业复制裂变）

--- a/运营中枢/参考资料/Cursor索引与本地索引方案.md
+++ b/运营中枢/参考资料/Cursor索引与本地索引方案.md
@@ -0,0 +1,87 @@
+# Cursor 索引 vs 本地索引 · 方案说明
+
+> 版本：1.0 | 更新：2026-02-22
+> 问题：Cursor 的 Codebase Indexing 会把 embeddings 上传到云端，能否完全在本地操作？
+
+---
+
+## 一、Cursor 官方现状
+
+### 1.1 当前行为（根据 Cursor Settings → Indexing & Docs）
+
+| 数据类型       | 存储位置     | 说明                 |
+|:--------------|:-------------|:---------------------|
+| 代码文件本身  | 本地         | 代码始终留在本机     |
+| Embeddings    | **云端**     | 用于语义理解的向量   |
+| Metadata      | **云端**     | 文件路径、行号等     |
+
+**结论**：Cursor 目前**不支持**纯本地索引。没有「禁用云上传」选项，只能关闭索引或接受云端存储。
+
+### 1.2 社区诉求
+
+- Cursor Forum 有 Feature Request：[It's possible to embedding codes entirely at local?](https://forum.cursor.com/t/its-possible-to-embedding-codes-entirely-at-local/15911)
+- 用户期望：像 chat 可以自托管 Ollama 一样，embedding 也能用本地 API
+- **截至 2025**：官方尚未实现该能力
+
+---
+
+## 二、可选方案对比
+
+| 方案                      | 数据位置       | 与 Cursor 集成 | 实现难度 |
+|:--------------------------|:---------------|:---------------|:---------|
+| 关闭 Cursor 索引          | 无             | 原生           | 极低     |
+| 卡若AI 本地代码库索引     | **完全本地**   | 通过 Skill 调用 | 中       |
+
+---
+
+## 三、卡若AI 本地索引方案（推荐）
+
+### 3.1 能力基础
+
+卡若AI 已有：
+- **nomic-embed-text**：Ollama 本地 embedding 模型（274MB）
+- **local_llm_sdk**：`embed()`、`semantic_search()`、`batch_embed()`
+- **运营中枢/local_llm**：统一调用入口
+
+### 3.2 方案架构
+
+```
+本地磁盘
+├── 代码/文档（.md、.py、.js 等）
+├── Ollama nomic-embed-text（本地 embedding）
+├── 向量数据库 / JSON 存储（本地）
+└── 检索脚本（index + search）
+```
+
+**流程**：
+1. **建索引**：扫描卡若AI 目录 → 分块 → 本地 embed → 存本地
+2. **检索**：用户提问 → 本地 embed 查询 → 相似度检索 → 返回结果
+3. **Cursor 使用**：在对话中通过「本地索引搜索」Skill 或 `@本地索引` 触发
+
+### 3.3 与 Cursor 的配合方式
+
+| 步骤 | 操作 |
+|:-----|:-----|
+| ① | 在 Cursor Settings → Indexing & Docs 中 **Pause Indexing** 或 **Delete Index** |
+| ② | 运行卡若AI 本地索引 Skill 的 `index` 命令，对本项目做本地索引 |
+| ③ | 对话时：说「用本地索引查 XXX」「@本地索引 搜索 YYY」 |
+| ④ | AI 调用 `scripts/local_codebase_index.py search "XXX"`，获取本地检索结果后回答 |
+
+**注意**：Cursor 的 AI 仍会用其内置的 codebase 理解能力（基于 @ 文件、打开文件等），但**不会**再把 embeddings 传云端。本地索引作为**补充**，用于你希望「完全本地」的语义搜索场景。
+
+---
+
+## 四、何时使用
+
+- ✅ 敏感项目、不希望任何 embedding 上传
+- ✅ 离线环境、无法连接 Cursor 云端
+- ✅ 需要语义搜索但接受「先建索引、再检索」的流程
+- ❌ 不适用于：必须和 Cursor 原生索引深度绑定的功能（如实时 @ 整个 repo 的智能补全）
+
+---
+
+## 五、参考
+
+- Cursor Forum: [It's possible to embedding codes entirely at local?](https://forum.cursor.com/t/its-possible-to-embedding-codes-entirely-at-local/15911)
+- 本地模型 SKILL：`04_卡火（火）/火种_知识模型/本地模型/SKILL.md`
+- 本地代码库索引 SKILL：`04_卡火（火）/火种_知识模型/本地代码库索引/SKILL.md`
--- a/运营中枢/工作台/gitea_push_log.md
+++ b/运营中枢/工作台/gitea_push_log.md
@@ -59,3 +59,4 @@
 | 2026-02-22 06:47:15 | 🔄 卡若AI 同步 2026-02-22 06:47 | 更新：卡木、运营中枢工作台 | 排除 >20MB: 8 个 |
 | 2026-02-22 07:20:22 | 🔄 卡若AI 同步 2026-02-22 07:20 | 更新：总索引与入口、金仓、卡木、运营中枢工作台 | 排除 >20MB: 9 个 |
 | 2026-02-22 07:27:08 | 🔄 卡若AI 同步 2026-02-22 07:27 | 更新：金仓、卡木、运营中枢工作台 | 排除 >20MB: 9 个 |
+| 2026-02-22 09:06:58 | 🔄 卡若AI 同步 2026-02-22 09:06 | 更新：总索引与入口、金仓、运营中枢工作台 | 排除 >20MB: 8 个 |
--- a/运营中枢/工作台/代码管理.md
+++ b/运营中枢/工作台/代码管理.md
@@ -62,3 +62,4 @@
 | 2026-02-22 06:47:15 | 成功 | 成功 | 🔄 卡若AI 同步 2026-02-22 06:47 | 更新：卡木、运营中枢工作台 | 排除 >20MB: 8 个 | [仓库](http://open.quwanzhi.com:3000/fnvtk/karuo-ai) [百科](http://open.quwanzhi.com:3000/fnvtk/karuo-ai/wiki) |
 | 2026-02-22 07:20:22 | 成功 | 成功 | 🔄 卡若AI 同步 2026-02-22 07:20 | 更新：总索引与入口、金仓、卡木、运营中枢工作台 | 排除 >20MB: 9 个 | [仓库](http://open.quwanzhi.com:3000/fnvtk/karuo-ai) [百科](http://open.quwanzhi.com:3000/fnvtk/karuo-ai/wiki) |
 | 2026-02-22 07:27:08 | 成功 | 成功 | 🔄 卡若AI 同步 2026-02-22 07:27 | 更新：金仓、卡木、运营中枢工作台 | 排除 >20MB: 9 个 | [仓库](http://open.quwanzhi.com:3000/fnvtk/karuo-ai) [百科](http://open.quwanzhi.com:3000/fnvtk/karuo-ai/wiki) |
+| 2026-02-22 09:06:58 | 成功 | 成功 | 🔄 卡若AI 同步 2026-02-22 09:06 | 更新：总索引与入口、金仓、运营中枢工作台 | 排除 >20MB: 8 个 | [仓库](http://open.quwanzhi.com:3000/fnvtk/karuo-ai) [百科](http://open.quwanzhi.com:3000/fnvtk/karuo-ai/wiki) |