我们在互联网上运行 iSCSI。

我们在互联网上运行 iSCSI。
How we run iSCSI over the internet

原始链接: https://scsipub.com/blog/how-we-run-iscsi-over-the-internet

## scsipub：面向公网的 iSCSI 目标 scsipub 是一个 iSCSI 目标，旨在直接通过公网为客户端提供服务，这与传统的机架级光纤通道设置不同。它最初是为支持 Raspberry Pi 和 ESP32 USB 桥的 PXE 启动而创建的，优先考虑稳健性和简单性。该系统使用 Ranch 监听 TCP (3260) 和 TLS (3261) 端口，使用轻量级的 BEAM 进程处理每个连接——利用 Erlang 的并发模型来实现效率。会话将经历安全协商和参数设置，然后处理 SCSI 命令。系统采用“让它崩溃”的理念；进程失败会触发发起方重试，避免复杂的错误恢复。数据存储在只读基础镜像之上的稀疏叠加文件中，从而最大限度地减少磁盘使用量。多 LUN 支持和 SCSI 持久保留等功能支持集群功能。安全性通过 TLS 实现，并使用 Caddy 自动轮换证书，并侧重于隔离恶意请求，而不是进行详尽的输入验证。目前，scsipub 专注于单个数据中心，不支持 S3 后端或 RDMA 等功能。未来的开发将侧重于在重负载的 iSCSI 下对系统进行压力测试，以识别性能限制和潜在的故障模式。

## scsipub：通过互联网的 iSCSI 一个名为 scsipub 的新项目允许用户直接在公共互联网上运行 iSCSI 目标，提供可通过 `iscsiadm` 访问的块设备。免费层提供一个 64MB 的临时磁盘，无需注册，付费层提供持久会话、多个 LUN 以及完整的 SCSI-3 持久保留 – 从而支持两节点故障转移集群。该项目的创建者 Tom 在相关文章中详细介绍了架构决策，强调了 Ranch 2.x、Elixir 的 BEAM、写时复制叠加层以及 Caddy 用于 TLS 的应用。他还公开讨论了局限性，例如缺乏多区域支持或 RDMA。除了核心服务外，scsipub 还提供配套项目：一个 ESP32-S3 iSCSI 到 USB 桥接器和一个树莓派网络启动解决方案。Tom 愿意解答有关协议、部署和 BEAM 设计的问题。

iSCSI is a protocol from the era when “the network” meant a rack-scale fibre channel replacement. Initiators and targets trusted each other, CHAP was optional theatre, and a packet from an initiator carried the implicit assumption “we’re on the same L2 segment.”

scsipub serves iSCSI targets to arbitrary clients on the public internet. That’s a different set of assumptions. This post is the decision log — the small choices that add up to “this works and doesn’t break from day one.”

It started as the missing dependency for two adjacent projects of mine — a Raspberry Pi netboot shim and an ESP32-based USB-mass- storage bridge — both of which needed an iSCSI target out on the open internet to point demos at, and there wasn’t one. Building a target turned out to be the bigger of the three problems.

The listener

Both ports are Ranch 2.x listeners — plain TCP on 3260, TLS on 3261. Scsipub.Target.Listener returns a pair of child specs that the application supervisor adds at boot:

def child_specs(opts) do
  tcp_spec = tcp_child_spec(opts[:port] || 3260, protocol_opts)

  if opts[:tls_certfile] && opts[:tls_keyfile] &&
       File.exists?(opts[:tls_certfile]) && File.exists?(opts[:tls_keyfile]) do
    tls_spec = tls_child_spec(opts[:tls_port] || 3261, certfile, keyfile, protocol_opts)
    [tcp_spec, tls_spec]
  else
    [tcp_spec]
  end
end

Ranch runs a small acceptor pool in front of a :ranch_protocol callback. When a connection arrives, Ranch spawns a fresh BEAM process and hands it the socket. For iSCSI that’s the unit we want: one process per TCP connection, one TCP connection per initiator session, one initiator session per user-visible mountable disk.

“One BEAM process per connection” only works because processes here aren’t OS threads. A BEAM process is ~2.5 KB of initial heap and some bookkeeping — the scheduler happily runs tens of thousands of them on a single core. iSCSI sessions sit idle waiting for SCSI PDUs most of the time, which is the ideal shape for green threads: cheap to park, cheap to wake.

Contrast with the C implementations: target_core_iblock and friends carry a thread pool and a queue, and tuning the pool size is an ongoing concern. We don’t tune anything and the BEAM happily handled 446 req/s in our web-side load test before latency started climbing — and that’s the Phoenix surface with its DB hops, not the iSCSI listener, which has smaller payloads and no SQL in the hot path at all.

One process per session

The protocol module is Scsipub.Target.Session, a plain GenServer. Its state machine walks through three phases:

phase: :security_negotiation  # csg=0, CHAP challenge/response
phase: :operational           # csg=1, negotiate parameters
phase: :full_feature           # csg=1 transit done, handling SCSI PDUs

Each PDU comes in on the socket, gets parsed into a struct, and routed to a handler. If a handler raises — malformed PDU, unexpected state transition, disk error — the process dies. That’s on purpose. The supervisor doesn’t restart it, because there’s no meaningful recovery; the initiator will notice the TCP close and try to log in again. State doesn’t leak between sessions because state doesn’t leave the process.

This is the standard Erlang story (“let it crash”), but it’s more than a platitude for iSCSI. The real-world alternative — carefully defending every parser branch against every attacker-shaped PDU — is how RFC 7143’s more colourful edge cases turn into CVEs in other implementations. We don’t defend; we fence. One bad PDU kills one session.

The Registry (Scsipub.Sessions.Registry, ETS-backed) is how a session announces itself once it reaches Full Feature Phase:

Registry.set_pid(iqn, self())

The Registry monitors the pid and auto-cleans the entry on :DOWN. The admin dashboard reads from the same ETS table to show live connections.

COW overlays

The base image is a regular file — .img, .iso, or .qcow2 decompressed to raw on fetch. It’s read-only. Every concurrent session gets its own overlay file, sparse-allocated to the same size as the base:

/var/lib/scsipub/overlays/
  71a61232479cc467.img          ← overlay, sparse
  71a61232479cc467.img.bitmap   ← 1 bit per sector

The bitmap tracks which 512-byte sectors have been written. Reads check the bit: if set, the overlay has the sector; if clear, fall through to the base image. Writes set the bit and write to the overlay.

The layout means:

The base image is never touched. CI verifies this — we SHA-256 the base before and after an integration run.
The overlay file is sparse. A session that only writes the MBR costs ~512 bytes on disk, not “the full virtual size of the disk.” Filesystem holes do the work.
Disconnecting is cheap. Non-persistent tiers delete the overlay on the TCP close; persistent tiers keep it until the session’s TTL elapses or the user destroys it explicitly.
Writes are counted. Each overlay write bumps a counter against write_limit from the user’s tier config. Hit the limit and the target responds WRITE_PROTECT until the session ends.

The Janitor, a GenServer on a 10-minute tick, sweeps the overlay directory and deletes files that don’t match any live session in the database. That’s how we clean up from the rare case where a process dies before its terminate callback runs.

Caddy in front, TLS everywhere

Caddy terminates HTTPS on port 443 and reverse-proxies to the Phoenix app on port 4000. The same Let’s Encrypt certificate also protects the iSCSI-TLS listener on port 3261 — which is the interesting part, because the iSCSI listener isn’t behind Caddy. It binds :ranch_ssl directly.

Caddy writes the ACME-obtained cert to its internal storage (/var/lib/caddy/.local/share/caddy/...), which the app user can’t read. The bridge is a tiny systemd service running inotifywait against that directory and copying the cert into /var/lib/scsipub/tls/ — owned by a shared group both users can read — whenever the bytes change.

The iSCSI listener picks up rotations without a restart because its sni_fun re-reads the PEM on every TLS handshake, with guardrails:

# lib/scsipub/target/tls_certs.ex
def sni_opts(certfile, keyfile) do
  now = System.monotonic_time(:second)

  case :persistent_term.get(cache_key, nil) do
    {_cert_mtime, _key_mtime, loaded_at, opts}
    when now - loaded_at < @min_reload_interval ->
      opts                           # 60s cooldown — serve cache unconditionally

    {cert_mtime, key_mtime, _loaded_at, opts} ->
      if stat_unchanged?(certfile, keyfile, cert_mtime, key_mtime) do
        opts                         # mtime unchanged — still fresh
      else
        reload_and_cache(...)        # rotation happened — re-read PEM
      end

    nil ->
      reload_and_cache(...)          # cold cache — first load
  end
end

Two guards, in order: a 60-second cooldown that serves the cached opts without any syscall (absorbs a thundering-herd handshake burst), and an mtime check after the cooldown that only pays for a fresh PEM read when the files have actually changed. Both matter — sni_fun is on the hot path for every TLS handshake, and without them a rotation every few months would still cost two stat syscalls per mount.

Things open-iscsi cares about

If you’re building against the open-iscsi initiator that ships in every Linux distro, the protocol is less “what’s on the wire” and more “what iscsiadm does with what’s on the wire.” Three concrete examples that each cost us a day.

`/` in the IQN type-name separator

Our first cut of anonymous target names was iqn.2025-01.pub.scsipub:image/ubuntu. That parses fine as an IQN. iscsiadm even does discovery against it happily. What it can’t do is log in:

iscsiadm: Could not make /etc/iscsi/nodes/iqn.2025-01.pub.scsipub:image/ubuntu

open-iscsi stores its persistent state in /etc/iscsi/nodes/<iqn>/... — it uses the IQN verbatim as a filesystem path. Any / in the name becomes a subdirectory boundary, and the create-if-missing path walk fails. We switched to . as the type/name separator (iqn.2025-01.pub.scsipub:image.ubuntu), which parses the same way and sidesteps the whole problem.

SendTargets has to advertise an address the client can reach

When an initiator does discovery, the target replies with a list of TargetName + TargetAddress records. The initiator saves that address as the portal for future logins — even if the discovery request itself went through a different IP.

In our CI, the target runs inside a CI container and the initiator inside a QEMU VM. QEMU’s user-mode networking NATs to 10.0.2.2 from the VM’s perspective. If we let the server advertise whatever sockname() returns — 127.0.0.1:3260 — iscsiadm dutifully saves that as the portal, and every subsequent login attempt tries to reach the runner’s loopback from inside the VM and fails forever.

# lib/scsipub/target/session.ex
defp advertise_address(socket, transport) do
  case Application.get_env(:scsipub, :public_host) do
    host when is_binary(host) -> "#{host}:#{port(socket, transport)}"
    _ -> sockname_string(socket, transport)
  end
end

Pin :public_host (we ship this as PHX_HOST in deploy env) and SendTargets returns something the client can actually get back to.

The `-o new` dance for static logins

Once you’ve been bitten by the SendTargets-saves-the-portal behaviour enough times, you learn to skip discovery for anything that needs a non-default portal. For example: iSCSI-over-TLS via stunnel. The natural flow would be “discover via the tunnel, then log in.” But the discovery response names the server’s public portal, not 127.0.0.1:3260 where stunnel is terminating, so iscsiadm saves the wrong portal and logs in plain instead of through the tunnel.

The fix is static login:

IQN=iqn.2025-01.pub.scsipub:blank

iscsiadm -m node -T $IQN -p 127.0.0.1:3260 -o new
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 \
  -o update -n node.session.auth.authmethod -v None
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 --login

-o new creates a fresh node record at the portal you specify instead of using whatever the discovery step saved. Our landing page renders exactly that command sequence for the TLS path, because the alternative is an infuriating 30 minutes with iscsiadm --debug=6.

Bonus: stale records retry forever

Once a node record exists under /etc/iscsi/nodes/, iscsid retries the login indefinitely if the session drops. If the target has been destroyed server-side, that manifests as a steady 1-every-3-second stream of “unknown target” login attempts in our server logs. The cure is on the client:

iscsiadm -m node -T <iqn> -o delete

On the server we throttle the log line (once per (ip, target) per 5 minutes at warning level, debug after that) so a stale initiator doesn’t bury real warnings under 17,280 lines of the same complaint per day. See Scsipub.Target.Session.log_unknown_target/2.

Cluster primitives: PR and multi-LUN

What turns this from “a fancy iSCSI sandbox” into “a target real cluster software can drive” is two SAM-5 / SPC-4 features — multi-LUN sessions and SCSI-3 Persistent Reservations. The wire protocol already supports both; the work is on our side, plumbing them into the Session and into something that survives a BEAM restart.

Multiple LUNs per session

A SCSI Logical Unit Number is the byte in each CDB that selects which device behind a target the initiator is addressing. Real storage products expose one target with N LUNs all the time; our Session struct holds a map keyed by LUN number, and the SCSI dispatcher routes by pdu.lun:

case Map.get(state.lun_backends, pdu.lun) do
  nil -> {:error, :logical_unit_not_supported}
  cow -> Handler.dispatch(pdu.cdb, pdu.data, cow, ...)
end

There’s an anonymous demo target wired up — iqn.2025-01.pub.scsipub:multi exposes two LUNs, each backed by a different image — and the session-creation API on the paid side takes an images: [...] array. The unglamorous half of the work was cleanup: multi-LUN sessions write to <sid>.lun0.img, <sid>.lun1.img, etc., and a terminator that only knew about state.overlay_path (the single-LUN field) leaked overlays on disconnect. The fix is a separate cleanup_multi_lun_overlays/1 walker, gated on state.overlay_path == nil so the single-LUN path’s own File.rm doesn’t double-close the same fd.

Persistent reservations

SCSI-3 PR is the primitive cluster software uses to fence a node out of shared storage. The per-LUN state is small: a set of registered initiator keys, plus an optional “reservation” naming one of them as holder along with a type (Write Exclusive, Exclusive Access, and four flavours combining “Registrants Only” and “All Registrants”). Pacemaker, ESXi HA, and Windows MSCS all drive this via sg_persist.

The state machine is Scsipub.Sessions.PR — a pure module, no DB or process baggage, so it’s tested as a struct. The runtime layer (SharedLU, one GenServer per (session_id, lun)) wraps it with write-through to the persistent_reservations Postgres table on every successful PR OUT. SPC-4 says PR state must survive a target reboot, and the table is the only honest way to honour that. A BEAM-restart unit test cycles the SharedLU through stop+restart and asserts the registrations and reservation reload identical.

Two subtle bits of plumbing.

The I_T nexus identifier is the iSCSI InitiatorName, not the CHAP user. Two initiators behind the same CHAP credential are distinct nexuses by design, and trusting CHAP_N would let a second client write under the first’s reservation. The Session struct keeps both:

:initiator_name,        # CHAP_N for paid sessions
:iscsi_initiator_name,  # the InitiatorName from the first login PDU
                        # — what PR identifies by

The other surprise was that Linux’s open-iscsi doesn’t send PR OUT parameter lists as immediate data. It uses the R2T (Ready To Transfer) flow, the same way it does WRITE — which makes sense, the spec lets it, but the original implementation only handled the immediate path. sg_persist --register returned Invalid opcode until R2T-driven PR OUT joined the existing two-phase command machinery that SCSI WRITE already used.

Two-initiator scenario, end to end:

# Initiator A: register a key, reserve Write Exclusive
sg_persist --out --register --param-sark=$KEY_A $DEV_A
sg_persist --out --reserve --param-rk=$KEY_A --prout-type=1 $DEV_A

# Initiator B (different InitiatorName, may share CHAP user):
# READ is allowed, WRITE returns RESERVATION CONFLICT.
dd if=$DEV_B bs=512 count=1 iflag=direct >/dev/null   # ok
dd if=/dev/zero of=$DEV_B bs=512 count=1 oflag=direct # EBUSY

# A releases; B's write now succeeds.
sg_persist --out --release --param-rk=$KEY_A --prout-type=1 $DEV_A

The CI integration suite runs that exact sequence. Combined with the restart-resume contract above, that’s enough to back a 2-node failover cluster off a target on the public internet — the BEAM deploy ritual (SIGTERM, wait for sessions to checkpoint, SIGKILL, restart, Resumer wakes the suspended LUs) doesn’t lose reservations along the way.

What we’re not solving

Deliberate omissions, for the record:

Multi-region. Everything runs in a single datacenter. A multi-region story would need per-session persistence to be a distributed system problem; it currently isn’t, and we like that.
S3- or NBD-backed base images. Images are local sparse files. Upload via the admin UI or an ecto run script; that’s the whole ingestion story. Cloud-backed storage changes the read-path latency distribution meaningfully enough that we’d want to think about it rather than bolt it on.
iSER / RDMA. No. scsipub is a public-internet service; RDMA is a rack-scale protocol. If you need 40 Gbit/s into a block device, the physics say you aren’t on the public internet anyway.
MPIO. Not yet. The initiator side of multipath works fine, but until we have multi-region there’s nowhere to failover to.
Per-session encryption above TLS. The iSCSI protocol has IPsec and a few other approaches for payload secrecy; none are widely deployed, and adding our own on top of TLS would just be framing for framing’s sake.

What comes next

The two projects scsipub originally existed to serve are now both shipped and have their own posts — the Pi netboot shim and how it killed the SD-card shuffle is at Netboot a Pi fleet from iSCSI; the ESP32 USB-mass-storage bridge for lab equipment is at An ESP32 as a network-attached USB stick.

Past that, the interesting question is what happens when a Phoenix app serving iSCSI meets someone who really wants to use it — tens of thousands of sessions, sustained writes, a pathological initiator. We’ve done a load test up to a few hundred concurrent web requests; we haven’t yet found the shape of the BEAM’s failure mode under actual iSCSI load. That’s the next thing to measure.