🔄 卡若AI 同步 2026-02-22 13:08 | 更新:卡木、总索引与入口、运营中枢工作台 | 排除 >20MB: 8 个
This commit is contained in:
106
03_卡木(木)/木叶_视频内容/抖音视频解析/SKILL.md
Normal file
106
03_卡木(木)/木叶_视频内容/抖音视频解析/SKILL.md
Normal file
@@ -0,0 +1,106 @@
|
||||
---
|
||||
name: 抖音视频解析
|
||||
description: 抖音链接 → 解析ID → 提取文案 → 下载视频。输入任一抖音视频链接,自动解析aweme_id/video_id/file_id、提取标题与文案、下载无水印视频。
|
||||
triggers: 抖音视频、抖音链接、抖音解析、抖音下载、提取抖音文案、抖音无水印
|
||||
owner: 木叶
|
||||
group: 木
|
||||
version: "1.0"
|
||||
updated: "2026-02-22"
|
||||
---
|
||||
|
||||
# 抖音视频解析
|
||||
|
||||
> **输入**:抖音视频链接(短链 `v.douyin.com/xxx` 或完整 `www.douyin.com/video/xxx`)
|
||||
> **输出**:解析出的 ID、文案(标题/正文/话题)、下载的视频文件
|
||||
|
||||
---
|
||||
|
||||
## 核心能力
|
||||
|
||||
1. **解析 ID**:从链接或页面提取 `aweme_id`、`video_id`、`file_id`
|
||||
2. **提取文案**:从页面 metadata 提取标题、正文、话题标签
|
||||
3. **下载视频**:获取无水印视频并保存到本地
|
||||
|
||||
---
|
||||
|
||||
## 触发词
|
||||
|
||||
- 抖音视频、抖音链接
|
||||
- 抖音解析、抖音下载
|
||||
- 提取抖音文案、抖音无水印
|
||||
|
||||
---
|
||||
|
||||
## 执行步骤
|
||||
|
||||
### 用户提供抖音链接时
|
||||
|
||||
1. **解析链接**:识别短链 / 完整链接,提取 `aweme_id`
|
||||
2. **获取页面**:requests 获取页面(移动端 UA);失败时可用 MCP 浏览器访问
|
||||
3. **提取文案**:从页面 title、meta、`__vid`、`ROUTER_DATA` 等提取标题、正文、话题
|
||||
4. **提取视频 URL**:从 `<source src="...">` 或 JSON 中获取 CDN/Play 链接
|
||||
5. **下载视频**:requests 流式下载,优先无水印链接(`playwm`→`play`)
|
||||
|
||||
### 一键命令
|
||||
|
||||
```bash
|
||||
cd /Users/karuo/Documents/个人/卡若AI/03_卡木(木)/木叶_视频内容/抖音视频解析/脚本
|
||||
python3 douyin_parse.py "https://v.douyin.com/SpVK8mlOUUo/"
|
||||
|
||||
# 仅解析不下载
|
||||
python3 douyin_parse.py "https://v.douyin.com/xxx" --no-download
|
||||
|
||||
# 指定输出目录
|
||||
python3 douyin_parse.py "https://v.douyin.com/xxx" -o /path/to/output
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 相关文件
|
||||
|
||||
| 文件 | 说明 |
|
||||
|------|------|
|
||||
| `脚本/douyin_parse.py` | 解析与下载主脚本 |
|
||||
| `参考资料/ID与文案解析规则.md` | ID、文案、视频 URL 提取规则说明 |
|
||||
|
||||
---
|
||||
|
||||
## 输出目录
|
||||
|
||||
- **视频文件**:`/Users/karuo/Documents/卡若Ai的文件夹/视频/`(或 `-o` 指定)
|
||||
- **文案 JSON**:同目录下 `{aweme_id}_文案.json`
|
||||
|
||||
---
|
||||
|
||||
## AI 执行说明(Cursor)
|
||||
|
||||
当用户给出抖音链接时:
|
||||
|
||||
1. 读本 SKILL.md
|
||||
2. 执行 `python3 douyin_parse.py "用户提供的链接"`
|
||||
3. 若 requests 被拦截(403/超时),使用 MCP 浏览器:
|
||||
- `browser_navigate` 到链接
|
||||
- `browser_snapshot` 获取页面 title(含文案)
|
||||
- 从 snapshot 或页面源码提取 `__vid`、`video_id`、`file_id`、`<source src>`
|
||||
- 将提取结果传给脚本或手动拼装下载 URL
|
||||
4. 结果按复盘格式回复用户
|
||||
|
||||
---
|
||||
|
||||
## 依赖
|
||||
|
||||
- Python 3.8+
|
||||
- requests
|
||||
- 可选:MCP 浏览器(requests 失败时)
|
||||
|
||||
---
|
||||
|
||||
## 解析规则(简要)
|
||||
|
||||
| 字段 | 来源 | 示例 |
|
||||
|------|------|------|
|
||||
| aweme_id / __vid | URL 或 `__vid=` | 7607519346462286491 |
|
||||
| video_id | `video_id=` | v02f52g10003d69l7afog65sirkjgcag |
|
||||
| file_id | `file_id=` | f7a8f7b2af594e6d93f3588e7ff4ec66 |
|
||||
| 文案 | 页面 title、meta、ROUTER_DATA | 标题+正文+话题 |
|
||||
| 视频 URL | `<source src>` 或 play API | douyinvod.com / aweme/v1/play |
|
||||
57
03_卡木(木)/木叶_视频内容/抖音视频解析/参考资料/ID与文案解析规则.md
Normal file
57
03_卡木(木)/木叶_视频内容/抖音视频解析/参考资料/ID与文案解析规则.md
Normal file
@@ -0,0 +1,57 @@
|
||||
# 抖音视频 ID 与文案解析规则
|
||||
|
||||
> 用于从抖音页面/HTML 中提取 aweme_id、video_id、file_id、文案、视频 URL。
|
||||
|
||||
---
|
||||
|
||||
## 一、链接格式
|
||||
|
||||
| 格式 | 示例 | 说明 |
|
||||
|------|------|------|
|
||||
| 短链 | `https://v.douyin.com/SpVK8mlOUUo/` | 需 resolve 到完整链接 |
|
||||
| 完整链接 | `https://www.douyin.com/video/7607519346462286491` | 可直接提取 aweme_id |
|
||||
|
||||
---
|
||||
|
||||
## 二、ID 解析规则
|
||||
|
||||
| 字段 | 来源 | 正则/位置 | 示例 |
|
||||
|------|------|----------|------|
|
||||
| **aweme_id** | URL `/video/(\d+)` 或 `__vid=` | `r"/video/(\d+)"` | `7607519346462286491` |
|
||||
| **video_id** | HTML `video_id=` | `r'video_id["\']?\s*[:=]\s*["\']?([a-zA-Z0-9_]+)'` | `v02f52g10003d69l7afog65sirkjgcag` |
|
||||
| **file_id** | HTML `file_id=` | `r'file_id["\']?\s*[:=]\s*["\']?([a-f0-9]{32})'` | `f7a8f7b2af594e6d93f3588e7ff4ec66` |
|
||||
|
||||
---
|
||||
|
||||
## 三、文案提取规则
|
||||
|
||||
| 字段 | 来源 | 说明 |
|
||||
|------|------|------|
|
||||
| **title** | `<title>` 或 ROUTER_DATA | 页面标题,通常含标题+正文 |
|
||||
| **desc** | ROUTER_DATA 或 title 后半部分 | 正文描述 |
|
||||
| **hashtags** | 正文/标题中的 `#xxx` | 话题标签 |
|
||||
|
||||
---
|
||||
|
||||
## 四、视频 URL 提取
|
||||
|
||||
| 来源 | 说明 |
|
||||
|------|------|
|
||||
| `<source src="...">` | 视频标签中的 CDN 直链 |
|
||||
| `window._ROUTER_DATA` | JSON 中的 play_addr.url_list 或 url_list |
|
||||
| 无水印 | 将 URL 中的 `playwm` 替换为 `play` |
|
||||
|
||||
---
|
||||
|
||||
## 五、Play API 格式(备用)
|
||||
|
||||
```
|
||||
https://www.douyin.com/aweme/v1/play/
|
||||
?file_id=xxx
|
||||
&video_id=xxx
|
||||
&sign=xxx
|
||||
&uifid=xxx
|
||||
...
|
||||
```
|
||||
|
||||
需 sign、uifid 等动态参数,CDN 直链或 ROUTER_DATA 更稳定。
|
||||
224
03_卡木(木)/木叶_视频内容/抖音视频解析/脚本/douyin_parse.py
Normal file
224
03_卡木(木)/木叶_视频内容/抖音视频解析/脚本/douyin_parse.py
Normal file
@@ -0,0 +1,224 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
抖音视频解析:链接 → 解析ID → 提取文案 → 下载视频
|
||||
输入:抖音短链 (v.douyin.com) 或完整链接 (www.douyin.com/video/xxx)
|
||||
输出:aweme_id, video_id, file_id, 文案(标题/正文/话题), 视频文件
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
import requests
|
||||
|
||||
# 默认输出目录:卡若Ai的文件夹/视频
|
||||
DEFAULT_OUTPUT = Path.home() / "Documents" / "卡若Ai的文件夹" / "视频"
|
||||
DEFAULT_OUTPUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# 移动端 UA,减少被拦截
|
||||
MOBILE_UA = (
|
||||
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) "
|
||||
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1"
|
||||
)
|
||||
|
||||
|
||||
def parse_url_to_aweme_id(url: str) -> str | None:
|
||||
"""从抖音链接提取 aweme_id"""
|
||||
url = url.strip()
|
||||
# 完整链接可直接提取
|
||||
m = re.search(r"/video/(\d+)", url)
|
||||
if m:
|
||||
return m.group(1)
|
||||
return None
|
||||
|
||||
|
||||
def fetch_and_parse(url: str) -> tuple[dict, str | None]:
|
||||
"""
|
||||
请求视频页面,解析 ID、文案、视频 URL。
|
||||
支持短链 v.douyin.com 或完整链接。
|
||||
返回 (info_dict, video_url)
|
||||
"""
|
||||
url = url.strip()
|
||||
# 短链需先 resolve 到完整链接
|
||||
if "v.douyin.com" in url:
|
||||
try:
|
||||
r = requests.get(url, allow_redirects=True, timeout=15, headers={"User-Agent": MOBILE_UA})
|
||||
url = r.url
|
||||
html = r.text
|
||||
except Exception as e:
|
||||
return {"error": str(e), "aweme_id": None}, None
|
||||
else:
|
||||
try:
|
||||
r = requests.get(url, headers={"User-Agent": MOBILE_UA, "Referer": "https://www.douyin.com/"}, timeout=15)
|
||||
r.raise_for_status()
|
||||
html = r.text
|
||||
except Exception as e:
|
||||
return {"error": str(e), "aweme_id": None}, None
|
||||
|
||||
aweme_id = parse_url_to_aweme_id(url)
|
||||
info = {
|
||||
"aweme_id": aweme_id or "unknown",
|
||||
"video_id": None,
|
||||
"file_id": None,
|
||||
"title": "",
|
||||
"desc": "",
|
||||
"hashtags": [],
|
||||
"author": "",
|
||||
}
|
||||
video_url = None
|
||||
|
||||
# 1. 解析 __vid, video_id, file_id
|
||||
for pattern, key in [
|
||||
(r'["\']?__vid["\']?\s*[:=]\s*["\']?(\d+)["\']?', "aweme_id"),
|
||||
(r'video_id["\']?\s*[:=]\s*["\']?([a-zA-Z0-9_]+)["\']?', "video_id"),
|
||||
(r'file_id["\']?\s*[:=]\s*["\']?([a-f0-9]{32})["\']?', "file_id"),
|
||||
]:
|
||||
m = re.search(pattern, html)
|
||||
if m:
|
||||
info[key] = m.group(1)
|
||||
|
||||
# 2. 从 <source src="..."> 提取视频 URL
|
||||
src_match = re.search(r'<source[^>]+src=["\']([^"\']+)["\']', html)
|
||||
if src_match:
|
||||
video_url = src_match.group(1)
|
||||
if "&" in video_url:
|
||||
video_url = video_url.replace("&", "&")
|
||||
|
||||
# 3. 从 ROUTER_DATA 提取视频 URL 和文案(备选)
|
||||
router = re.search(r"window\._ROUTER_DATA\s*=\s*(\{.*?\});?\s*</script>", html, re.DOTALL)
|
||||
if router:
|
||||
try:
|
||||
data = json.loads(router.group(1).strip())
|
||||
# 深度查找 play_addr / url_list
|
||||
def find_url(obj):
|
||||
if isinstance(obj, dict):
|
||||
if "play_addr" in obj and "url_list" in obj.get("play_addr", {}):
|
||||
return obj["play_addr"]["url_list"][0]
|
||||
if "url_list" in obj and obj["url_list"]:
|
||||
return obj["url_list"][0]
|
||||
for v in obj.values():
|
||||
u = find_url(v)
|
||||
if u:
|
||||
return u
|
||||
elif isinstance(obj, list):
|
||||
for item in obj:
|
||||
u = find_url(item)
|
||||
if u:
|
||||
return u
|
||||
return None
|
||||
|
||||
u = find_url(data)
|
||||
if u and not video_url:
|
||||
video_url = u.replace("playwm", "play") # 无水印
|
||||
|
||||
# 文案
|
||||
def find_desc(obj, key="desc"):
|
||||
if isinstance(obj, dict):
|
||||
if key in obj and obj[key]:
|
||||
return str(obj[key])
|
||||
for v in obj.values():
|
||||
r = find_desc(v, key)
|
||||
if r:
|
||||
return r
|
||||
elif isinstance(obj, list):
|
||||
for item in obj:
|
||||
r = find_desc(item, key)
|
||||
if r:
|
||||
return r
|
||||
return ""
|
||||
|
||||
info["desc"] = find_desc(data) or info["desc"]
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# 4. 从 <title> 提取标题(含文案)
|
||||
title_match = re.search(r"<title>([^<]+)</title>", html)
|
||||
if title_match:
|
||||
raw = title_match.group(1).strip()
|
||||
if " - 抖音" in raw:
|
||||
raw = raw.replace(" - 抖音", "")
|
||||
parts = raw.split(None, 1)
|
||||
info["title"] = parts[0] if parts else raw
|
||||
if len(parts) > 1 and not info["desc"]:
|
||||
info["desc"] = parts[1]
|
||||
|
||||
# 5. 话题标签
|
||||
tag_matches = re.findall(r"#([^#\s]+)", info.get("desc", "") + " " + info.get("title", ""))
|
||||
info["hashtags"] = list(dict.fromkeys(tag_matches)) # 去重保序
|
||||
|
||||
# 6. 若 title 为空,用 desc 首行
|
||||
if not info["title"] and info["desc"]:
|
||||
info["title"] = info["desc"].split("\n")[0].strip()[:80]
|
||||
|
||||
# 无水印处理
|
||||
if video_url and "playwm" in video_url:
|
||||
video_url = video_url.replace("playwm", "play")
|
||||
|
||||
return info, video_url
|
||||
|
||||
|
||||
def download_video(url: str, out_path: Path) -> bool:
|
||||
"""下载视频到本地"""
|
||||
try:
|
||||
r = requests.get(url, headers={"User-Agent": MOBILE_UA}, stream=True, timeout=60)
|
||||
r.raise_for_status()
|
||||
with open(out_path, "wb") as f:
|
||||
for chunk in r.iter_content(chunk_size=8192):
|
||||
if chunk:
|
||||
f.write(chunk)
|
||||
return True
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="抖音视频解析:链接 → ID + 文案 + 下载")
|
||||
parser.add_argument("url", help="抖音视频链接(短链或完整)")
|
||||
parser.add_argument("-o", "--output", type=Path, default=DEFAULT_OUTPUT, help="输出目录")
|
||||
parser.add_argument("--no-download", action="store_true", help="仅解析,不下载视频")
|
||||
args = parser.parse_args()
|
||||
|
||||
url = args.url.strip()
|
||||
if not url:
|
||||
print("请提供抖音视频链接", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# 1. 请求并解析页面
|
||||
info, video_url = fetch_and_parse(url)
|
||||
aweme_id = info.get("aweme_id")
|
||||
if not aweme_id or aweme_id == "unknown":
|
||||
print("无法解析视频,请检查链接格式或网络", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
if info.get("error"):
|
||||
print(f"解析失败: {info['error']}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# 3. 输出文案 JSON
|
||||
args.output.mkdir(parents=True, exist_ok=True)
|
||||
caption_path = args.output / f"{aweme_id}_文案.json"
|
||||
with open(caption_path, "w", encoding="utf-8") as f:
|
||||
json.dump(info, f, ensure_ascii=False, indent=2)
|
||||
print(f"✅ 文案已保存: {caption_path}")
|
||||
|
||||
# 4. 下载视频
|
||||
if not args.no_download and video_url:
|
||||
safe_title = re.sub(r'[^\w\s-]', '', info.get("title", aweme_id))[:50]
|
||||
out_file = args.output / f"{aweme_id}_{safe_title}.mp4"
|
||||
if download_video(video_url, out_file):
|
||||
print(f"✅ 视频已下载: {out_file}")
|
||||
else:
|
||||
print("⚠️ 视频下载失败,请检查网络或尝试 MCP 浏览器获取页面", file=sys.stderr)
|
||||
elif args.no_download:
|
||||
print("已跳过下载 (--no-download)")
|
||||
else:
|
||||
print("⚠️ 未解析到视频 URL,可尝试 MCP 浏览器访问页面", file=sys.stderr)
|
||||
|
||||
# 5. 打印摘要
|
||||
print("\n--- 解析结果 ---")
|
||||
print(json.dumps(info, ensure_ascii=False, indent=2))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -61,6 +61,7 @@
|
||||
| # | 技能 | 成员 | 触发词 | SKILL 路径 | 一句话 |
|
||||
|:--|:---|:---|:---|:---|:---|
|
||||
| M01 | 视频切片 | 木叶 | **视频剪辑、切片发布、切片动效包装、程序化包装、片头片尾、批量封面、视频包装** | `03_卡木(木)/木叶_视频内容/视频切片/SKILL.md` | 长视频切片+字幕+发布;联动切片动效包装(片头/片尾/程序化) |
|
||||
| M01b | 抖音视频解析 | 木叶 | **抖音视频、抖音链接、抖音解析、抖音下载、提取抖音文案、抖音无水印** | `03_卡木(木)/木叶_视频内容/抖音视频解析/SKILL.md` | 链接→解析ID→提取文案→下载无水印视频 |
|
||||
| M02 | 网站逆向分析 | 木根 | 逆向分析、模拟登录 | `03_卡木(木)/木根_逆向分析/网站逆向分析/SKILL.md` | 网站 API 分析、SDK 生成 |
|
||||
| M03 | 项目生成 | 木果 | 生成项目、五行模板 | `03_卡木(木)/木果_项目模板/项目生成/SKILL.md` | 按五行模板生成新项目 |
|
||||
| M04 | 开发模板 | 木果 | 创建项目、初始化模板 | `03_卡木(木)/木果_项目模板/开发模板/SKILL.md` | 前后端项目模板库 |
|
||||
|
||||
@@ -86,3 +86,4 @@
|
||||
| 2026-02-22 11:44:40 | 🔄 卡若AI 同步 2026-02-22 11:44 | 更新:水桥平台对接、卡木、运营中枢工作台 | 排除 >20MB: 8 个 |
|
||||
| 2026-02-22 11:47:38 | 🔄 卡若AI 同步 2026-02-22 11:47 | 更新:水桥平台对接、运营中枢工作台 | 排除 >20MB: 8 个 |
|
||||
| 2026-02-22 11:58:17 | 🔄 卡若AI 同步 2026-02-22 11:58 | 更新:金仓、水桥平台对接、卡木、运营中枢工作台 | 排除 >20MB: 8 个 |
|
||||
| 2026-02-22 12:42:56 | 🔄 卡若AI 同步 2026-02-22 12:42 | 更新:金仓、卡木、运营中枢工作台 | 排除 >20MB: 8 个 |
|
||||
|
||||
@@ -89,3 +89,4 @@
|
||||
| 2026-02-22 11:44:40 | 成功 | 成功 | 🔄 卡若AI 同步 2026-02-22 11:44 | 更新:水桥平台对接、卡木、运营中枢工作台 | 排除 >20MB: 8 个 | [仓库](http://open.quwanzhi.com:3000/fnvtk/karuo-ai) [百科](http://open.quwanzhi.com:3000/fnvtk/karuo-ai/wiki) |
|
||||
| 2026-02-22 11:47:38 | 成功 | 成功 | 🔄 卡若AI 同步 2026-02-22 11:47 | 更新:水桥平台对接、运营中枢工作台 | 排除 >20MB: 8 个 | [仓库](http://open.quwanzhi.com:3000/fnvtk/karuo-ai) [百科](http://open.quwanzhi.com:3000/fnvtk/karuo-ai/wiki) |
|
||||
| 2026-02-22 11:58:17 | 成功 | 成功 | 🔄 卡若AI 同步 2026-02-22 11:58 | 更新:金仓、水桥平台对接、卡木、运营中枢工作台 | 排除 >20MB: 8 个 | [仓库](http://open.quwanzhi.com:3000/fnvtk/karuo-ai) [百科](http://open.quwanzhi.com:3000/fnvtk/karuo-ai/wiki) |
|
||||
| 2026-02-22 12:42:56 | 成功 | 成功 | 🔄 卡若AI 同步 2026-02-22 12:42 | 更新:金仓、卡木、运营中枢工作台 | 排除 >20MB: 8 个 | [仓库](http://open.quwanzhi.com:3000/fnvtk/karuo-ai) [百科](http://open.quwanzhi.com:3000/fnvtk/karuo-ai/wiki) |
|
||||
|
||||
Reference in New Issue
Block a user