链接较小的 Haskell 二进制文件

链接较小的 Haskell 二进制文件
Linking Smaller Haskell Binaries (2023)

原始链接: https://brandon.si/code/linking-smaller-haskell-binaries/

## 减小 Haskell 二进制文件大小：链接与代码折叠 Haskell 二进制文件由于传递依赖性可能出乎意料地很大。本文探讨了在链接时减小它们大小的技术，并以 `pandoc` 项目为例进行演示。提出了两种主要策略。首先，使用 GHC 选项 `-split-sections` 和 `--gc-sections`（通过 `-fuse-ld=lld` 使用 `lld` 作为链接器）可以将二进制文件大小减少 27%，从而实现死代码移除。更具实验性的是，使用 `lld` 的*相同代码折叠* (ICF) 可以进一步缩小二进制文件（在本例中减少了另外 23%）。ICF 识别并合并功能上等效的代码段。虽然有效，但 ICF 并非完全安全，可能会导致依赖于特定函数地址的 C 代码出现问题。分析表明 Haskell 项目内部存在大量代码重复，表明在编译过程中存在优化的潜力——缓存编译单元以避免重复工作。尝试了 `bloaty` 和 `kcov` 等工具进行进一步分析，但证明与 Haskell 代码不兼容。作者还指出 ICF 与调试工具（如 `-fdistinct-constructor-tables`）之间可能存在交互，需要仔细考虑以保留调试信息。

对不起。

Publish date: Jan 7, 2023
Last updated: Jan 8, 2023

Haskell binaries can get quite large (think ~100MB), especially for projects with many transitive dependencies. Here are two strategies that can help at link time, the latter being more experimental.

I used the test-pandoc binary from pandoc on GHC 9.2.5 below. This was nice because obviously it was easy to test if linking broke anything (just run the tests).

`-split-sections` and `--gc-sections`

You can instruct ghc to emit code in individual minimal sections, allowing the linker to easily find and remove dead code. This looks like, in your cabal.project:

package *
  ghc-options: -split-sections
  -- For C code; This likely won't have much effect,  but worth including:
  gcc-options: -fdata-sections -ffunction-sections  

package pandoc
  -- All of the major thinkers support gc-sections,  but we'll use lld since 
  -- it supports the next experiment we want to do (and it's fast).
  -- NOTE: assumes gcc >= 10, which knows how to invoke lld
  ld-options: -fuse-ld=lld -Wl,--gc-sections,--build-id

These options take the size of the stripped binary down from 113M to 83M (-27%).

See Vidar’s blog for further exploration of this technique, in the context of shellcheck.

Identical code folding

Both gold and lld are able to do a limited form of icf, identifying functionally-equivalent sections at link time, and combining them, fixing up any inbound references. lld seems to have a more effective implementation.

You can try it by adding the following to the settings above:

package pandoc
  ...

  -- --print-icf-sections is just for debugging
  ld-options: -Wl,--icf=all,--ignore-data-address-equality,--ignore-function-address-equality,--print-icf-sections

This gets the binary down to 64M for me (a further -23%).

Looking closer at ICF

There are lots of reasons we can imagine a Haskell binary having a lot of potential code folding candidates; for instance the same function being inlined and maybe specialized the same way in several modules. I wanted to poke at the binary briefly and see if I could learn anything.

Taking a look at the logs from lld and picking a section arbitrarily:

selected section /tmp/pandoc/dist-newstyle/build/x86_64-linux/ghc-9.2.5/pandoc-3.0/build/libHSpandoc-3.0-inplace.a(HTML.o):(.text..Ls10LWe_info)
  removing identical section /tmp/pandoc/dist-newstyle/build/x86_64-linux/ghc-9.2.5/pandoc-3.0/build/libHSpandoc-3.0-inplace.a(HTML.o):(.text..Ls10LWD_info)
   … (snip)
  removing identical section /tmp/pandoc/dist-newstyle/build/x86_64-linux/ghc-9.2.5/pandoc-3.0/build/libHSpandoc-3.0-inplace.a(LaTeX.o):(.text..LsWGUO_info)
   … (snip)

Here’s a section that’s repeated across half a dozen modules within pandoc. Compiling with debug symbols, and inspecting with:

$ objdump -g --dwarf=decodedline -S -d -Mintel dist-newstyle/build/x86_64-linux/ghc-9.2.5/pandoc-3.0/build/libHSpandoc-3.0-inplace.a

…we can find two of the sections mentioned above, and indeed they look the same:

identical folded sections

I wasn’t really able to understand how these two map back on to the source code. The dwarf debug symbols associated come from:

pTagContents :: PandocMonad m => InlinesParser m Inlines
pTagContents =
      B.displayMath <$> mathDisplay
  <|> B.math        <$> mathInline
-- XXX FOLDED (orig):
  <|> pStr
  <|> pSpace
  <|> smartPunctuation pTagContents
  <|> pRawTeX
  <|> pSymbol
  <|> pBad

…and…

-- XXX FOLDED
-- FYI: type LP m = ParsecT TokStream LaTeXState m
doubleQuote :: PandocMonad m => LP m Inlines
doubleQuote =
       quoted' doubleQuoted (try $ count 2 $ symbol '`')
                     (void $ try $ count 2 $ symbol '\'')
   <|> quoted' doubleQuoted ((:[]) <$> symbol '“') (void $ symbol '”')
   -- the following is used by babel for localized quotes:
   <|> quoted' doubleQuoted (try $ sequence [symbol '"', symbol '`'])
                            (void $ try $ sequence [symbol '"', symbol '\''])

Both polymorphic parser code, but beyond that it’s difficult to see what’s going on.

Misc

Interactions with `-fdistinct-constructor-tables`

I’ve come to rely on -fdistinct-constructor-tables -finfo-table-map for profiling and debugging, and I was curious how it interacted with icf.

I created a little test project with:

{-# LANGUAGE GeneralizedNewtypeDeriving #-}
module Main where

main :: IO ()
main = do
    let xs1 = func1 1000000 
    print $ sum xs1
    print xs1
    let xs2 = func2 1000000 
    print $ sum xs2
    print xs2

newtype X = X Int deriving (Num,Show)

func1 :: Int -> [X]
func1 n = map X [1..n]

func2 :: Int -> [X]
func2 n = map X [1,3..(n*4)]

… Compiled with all the options above, and ran with +RTS -l-agu -hi -RTS. I noticed that unsurprisingly the duplicate info tables are folded away, which isn’t what we want for debugging:

distinct info tables get folded

You’d need to instruct the linker that these separate infotables should be treated as GC roots, and not removed.

An opportunity to speed up compilation?

All these duplicate sections represent not only wasted space in the binary, but also wasted time during compilation: code generation, but also probably time in the simplifier, etc etc. Could GHC be smarter about e.g. caching compilation units that eventually are emitted as sections early in compilation?

Approximately half of the 120K folded sections come from pandoc itself, that is are from just a single invocation of ghc, so the idea doesn’t seem outlandish.

Tools that didn’t work

I tried to play with bloaty to investigate the composition of the binary, but it chokes on Haskell code.

I thought it might be neat to try to use kcov to investigate how much of the resulting binary was dead code, before and after different linker options, but it chokes on (or with) probably the same bug.