Show HN:Kage —— 将任意网站转换为单个二进制文件以供离线浏览
Show HN: Kage – Shadow any website to a single binary for offline viewing

原始链接: https://github.com/tamnd/kage

**Kage** 是一款能够将网站克隆为功能完整、支持离线访问的文件夹的工具。它会移除所有的 JavaScript,以确保页面能够永久访问。 与通常会导致页面布局错乱或元素无法加载的标准“另存为”功能不同,Kage 使用无头浏览器来完整渲染网站,捕获人类可见的最终 DOM,并将所有相关资源(CSS、图片、字体)保存到本地路径。 **主要功能:** * **离线可靠性:** 生成纯净、无代码的 HTML,无需网络连接或追踪器即可运行。 * **便携式封装:** 可将镜像“打包”为 ZIM 归档文件(兼容开放的 Kiwix 生态系统),或生成独立的二进制可执行文件,无需安装额外软件即可访问网站。 * **灵活的镜像功能:** 支持增量更新、URL 范围限制以及自动处理懒加载图片。 * **轻量化:** 通过命令行界面(CLI)运行,利用系统内置的 Chrome/Chromium 或捆绑的容器镜像。 无论你是想在飞机上阅读文章,还是为了防止链接失效而保存内容,Kage 都能为你创建任何网站的永久、可搜索且可共享的快照。完整文档请访问 [kage.tamnd.com](https://kage.tamnd.com)。

Hacker News 社区正在讨论一款名为 **Kage** 的新工具,该工具旨在将整个网站镜像并封装为单个二进制文件,以供离线浏览。与 `wget` 等传统镜像工具不同,Kage 专为处理现代依赖 JavaScript 的网站(如 Next.js 应用)而构建,这些网站在被捕获前需要先进行渲染。 讨论重点包括: * **与同类工具的对比:** 用户将 Kage 与可以将单个页面保存为便携式 HTML 文件的 *SingleFile* 进行了比较。虽然 Kage 侧重于镜像整个网站,但作者表示有兴趣加入单文件导出功能。 * **潜在改进:** 社区建议包括:增加流量限制以减轻服务器负担、实现媒体资源过滤,以及集成 `mitmproxy` 以创建高保真、存档级的快照。 * **使用场景:** 潜在应用包括为网络连接不佳的地区创建离线可访问的公司维基,以及存档文章或复杂的网页内容。 开发者 *tamnd* 正积极通过 GitHub issues 收集用户反馈和功能需求,以改进工具功能,例如增加对克隆网站特定部分的精细控制。
相关文章

原文

ci Release Go Reference Go Report Card License

kage (影, "shadow") clones a website into a folder you can browse offline, with every script stripped out. It opens each page in real headless Chrome, waits for the page to settle, snapshots the DOM a human would have seen, then deletes all the JavaScript and pulls the CSS, images, and fonts down to local paths. What lands on disk looks like the live site and runs no code.

InstallQuick startCommandsClonePackNative windowHow it works

kage cloning paulgraham.com, packing it into one file, and serving it back offline

You already know the problem. You hit "Save As" on a page you want to keep, and six months later you open it to find a blank screen, a spinner that never stops, or a copy that still tries to phone home to an analytics server that no longer exists. The page was never really yours. It was a thin client for someone else's JavaScript.

kage takes the other road. It drives a real browser, lets the page finish doing whatever it does, grabs the finished result, and then rips every script out of it. No tracking, no network calls, no surprises. Just .html files you can open straight off disk, hand to a friend, or pack into a single file and forget about for a decade.

Full docs and guides live at kage.tamnd.com.

go install github.com/tamnd/kage/cmd/kage@latest

Prefer a prebuilt binary? Grab an archive, a .deb/.rpm/.apk, or a checksum from releases. Or skip installing Chrome yourself and use the container image, which bundles Chromium:

docker run --rm -v "$PWD/out:/out" ghcr.io/tamnd/kage clone paulgraham.com

kage drives a real browser, so it needs Chrome or Chromium on the host. It finds a system install on its own; point it somewhere specific with --chrome or the KAGE_CHROME environment variable. The container needs nothing extra.

Shell completion ships in the box: kage completion bash|zsh|fish|powershell.

Let's mirror Paul Graham's essays so you can read them on a plane, on a laptop with no wifi, or in the year 2050 after the site has finally changed its design:

# 1. Clone the site into $HOME/data/kage/paulgraham.com/
kage clone paulgraham.com

# 2. Read it back offline in your browser
kage serve $HOME/data/kage/paulgraham.com
# open http://127.0.0.1:8800

That's the whole loop. Every essay, every image, every stylesheet, frozen on your disk and runnable with zero network. The next two steps are optional but nice: collapse the whole thing into one file, and pop it open in its own window.

# 3. Squeeze the mirror into a single shareable file
kage pack paulgraham.com               # -> paulgraham.com.zim
kage open paulgraham.com.zim

# 4. Or into one executable that *is* the site
kage pack paulgraham.com --format binary -o paulgraham
./paulgraham                           # serves itself, needs nothing installed
Command What it does
kage clone <url> render a site in headless Chrome and write a browsable, script-free mirror
kage serve [dir] preview a cloned folder over a local HTTP server
kage pack <mirror-dir> collapse a mirror into one ZIM archive, or a self-contained viewer binary
kage open <file.zim> serve a packed ZIM back for offline reading
# The whole site, into $HOME/data/kage/<host>/
kage clone https://paulgraham.com

# Just the first 50 pages, two links deep, for a quick taste
kage clone paulgraham.com --max-pages 50 --max-depth 2

# Only one section of a bigger site
kage clone go.dev --scope-prefix /doc

# Pull in subdomains too, and scroll each page to trip lazy-loaded images
kage clone example.com --subdomains --scroll

# Come back next month and re-render in place to catch new essays
kage clone paulgraham.com --refresh

A clone is a polite, breadth-first crawl. It reads robots.txt, seeds itself from sitemap.xml, and stays on the seed host unless you tell it otherwise. It is also stubbornly idempotent: each page is keyed by the file it writes, so the same essay reached over http and https, with or without a trailing slash, gets fetched exactly once. Hit Ctrl-C and it saves its place on the way out; run it again and it picks up where it stopped. --refresh re-renders in place, --force wipes the host and starts clean.

The flags you'll actually reach for:

Flag Default Meaning
-o, --out $HOME/data/kage Output root; the mirror lands in <out>/<host>/
-p, --max-pages 0 Stop after N pages (0 = no limit)
-d, --max-depth 0 How many links deep to follow (0 = no limit)
--scope-prefix Only crawl paths starting with this prefix
--subdomains false Treat subdomains of the seed host as in scope
--exclude Path prefixes to skip (repeatable)
--scroll false Auto-scroll each page to trigger lazy loading
--workers 4 How many pages to render at once
--no-robots false Ignore robots.txt (be nice)
-f, --force false Delete any existing mirror for the host first
--chrome Path to the Chrome/Chromium binary

kage clone --help has the rest, including render-timing, concurrency, and asset-size knobs.

kage serve runs a tiny static file server over a cloned folder so links and assets resolve the way they would on a real host:

kage serve $HOME/data/kage/paulgraham.com
# open http://127.0.0.1:8800

A mirror is a folder, which is great for browsing and lousy for moving around. Copying thousands of little files is slow, and "here, have this directory" is a clumsy thing to hand someone. kage pack collapses the whole mirror into one artifact, and you choose the shape: an open ZIM archive, or a single executable that is the site.

kage pack paulgraham.com               # -> paulgraham.com.zim
kage open paulgraham.com.zim

ZIM is an open file format built for exactly this: a whole website (or a whole Wikipedia) squeezed into one compressed, indexed, read-only file. kage writes the entire mirror into it, text zstd-compressed and media stored as-is. It is the format behind Kiwix, the offline-content project people use to carry Wikipedia, Stack Overflow, and Project Gutenberg onto boats, into classrooms with no internet, and onto a phone for a long flight. Because the format is a documented standard and not a kage invention, a paulgraham.com.zim you make today will still open in any ZIM reader years from now.

So you are not locked into kage. kage open is the quickest way back in, but the very same file works across the wider Kiwix ecosystem:

kage open paulgraham.com.zim            # read it back with kage
kiwix-serve paulgraham.com.zim          # or serve it with Kiwix at http://localhost

You can also double-click the file in the Kiwix desktop app, or load it on Kiwix for Android or iOS to read your mirror on your phone. One caveat: kage writes a structurally valid archive with the standard metadata, but it does not build the full-text search index that Kiwix's own packs ship with, so browsing and clicking work everywhere while in-reader search is limited.

Packing is deterministic. The same mirror always produces a byte-identical file, with the archive UUID derived from the content instead of randomized, so a pack is safe to checksum and cache. A bare host name resolves against the default output directory, which is why kage pack paulgraham.com just works right after kage clone paulgraham.com.

--format binary glues the archive onto a copy of kage and hands you a single executable that serves the site offline when you run it. Whoever you send it to needs nothing installed: not kage, not a ZIM reader, nothing.

kage pack paulgraham.com --format binary -o paulgraham
./paulgraham

The appended archive is platform-independent; only the base executable carries the architecture. By default kage appends to itself, so you get a viewer for the machine you ran it on. Point --base at a kage built for another OS (grab one from a release; every platform ships one) to produce a viewer for that platform from your own machine. kage reads the base's executable header to figure out the target, so a Windows viewer automatically gets a .exe name:

# Sitting on a Mac, build a Windows viewer
kage pack paulgraham.com --format binary --base kage-windows-amd64.exe   # -> paulgraham.exe

The trade is size. The binary carries a whole kage, so it weighs around 13 MiB plus the site no matter how small the mirror is. When you only need the content, the ZIM is far leaner.

A real window, not a browser tab

By default a packed binary opens your system browser, which means the site shows up as yet another tab, address bar and all, next to the 47 you already have open. Build kage with the webview tag and it opens the site in its own window instead, backed by the operating system's WebView (WKWebView on macOS, WebView2 on Windows, WebKitGTK on Linux). Paul Graham's essays, offline, in something that looks and feels like a real app:

paulgraham.com served offline in a native kage window

make build-webview                       # or: CGO_ENABLED=1 go build -tags webview ./cmd/kage
kage pack paulgraham.com --format binary --base bin/kage -o paulgraham
./paulgraham                             # opens a window, no browser in sight

This build needs cgo and links the platform WebView, so it stays opt-in. The default build is pure Go (CGO_ENABLED=0) and the prebuilt release binaries open the browser, which keeps the cross-compiled release simple. kage open honours the same tag, so built with -tags webview it shows a ZIM in a native window too.

seed URL ─▶ headless Chrome ─▶ final DOM ─▶ strip JS ─▶ localise assets ─▶ disk
              (render)          (snapshot)   (sanitize)   (rewrite links)

A pool of Chrome tabs renders pages; a separate pool fetches assets over plain HTTP. Every URL maps deterministically to a local path, so links get rewritten before the asset they point at has even finished downloading. The output looks like this:

paulgraham.com/
├── index.html                  # the home page, scripts stripped
├── greatwork.html              # /greatwork.html, an essay
├── _kage/                      # reserved: assets and crawl state
│   ├── paulgraham.com/site.css  # localised stylesheet (url() rewritten)
│   ├── paulgraham.com/pg.png
│   └── state.json              # visited set, for resuming
└── ...

pack rides on the same idea: the mirror's links are already mirror-relative paths, and those map one-to-one onto the archive's content entries, so a click in a served page hits the right entry with no rewriting at all.

git clone https://github.com/tamnd/kage
cd kage
make build          # -> bin/kage (pure Go, opens the browser)
make build-webview  # -> bin/kage with the native-window viewer (needs cgo)
make test           # full suite, including the Chrome-driven end-to-end tests
make test-short     # skip the tests that launch a browser

The repo is split by concern:

cmd/kage/   thin main: pins the main thread, then hands off to cli.Execute
cli/        the cobra command tree and flag wiring
clone/      the crawl: frontier, render workers, asset workers, resume state
browser/    headless Chrome control and DOM snapshotting
sanitize/   strip scripts, handlers, and javascript: URLs from the DOM
asset/      download and localise CSS, images, and fonts
urlx/       the deterministic URL-to-path mapping
zim/        a pure-Go ZIM reader and writer
pack/       mirror to ZIM or self-contained binary, and the offline HTTP handler
viewer/     present a served site: system browser, or native window (webview tag)
docs/       the tago documentation site

Push a version tag and GitHub Actions runs GoReleaser, which builds the archives, the .deb/.rpm/.apk packages, a multi-arch GHCR image with Chromium bundled, checksums, SBOMs, and a cosign signature:

git tag v0.1.1
git push --tags

The image tag carries no v prefix (ghcr.io/tamnd/kage:0.1.1). The Homebrew and Scoop steps self-disable until their tokens exist, so the first release works with no extra secrets.

MIT. See LICENSE.

联系我们 contact @ memedata.com