单次遍历哈夫曼编码

原文

Single-Pass Huffman Coding

Posted on February 17, 2018

While working on something else, I figured out a nice Haskell implementation of Huffman coding, and I thought I’d share it here. I’ll go through a few techniques for transforming a multi-pass algorithm into a single-pass one first, and then I’ll show how to use them for Huffman. If you just want to skip to the code, it’s provided at the end .

The algorithm isn’t single-pass in the sense of Adaptive Huffman Coding: it still uses the normal Huffman algorithm, but the input is transformed in the same traversal that builds the tree to transform it.

Circular Programming

There are several techniques for turning multi-pass algorithms into single-pass ones in functional languages. Perhaps the most famous is circular programming: using laziness to eliminate a pass. R. S. Bird (1984) used this to great effect in solving the repmin problem:

Given a tree of integers, replace every integer with the minimum integer in the tree, in one pass.

For an imperative programmer, the problem is relatively easy: first, write the code to find the minimum value in the tree in the standard way, using a loop and a “smallest so far” accumulator. Then, inside the loop, after updating the accumulator, set the value of the leaf to be a reference to the accumulator.

At first, that solution may seem necessarily impure: we’re using global, mutable state to update many things at once. However, as the paper shows, we can claw back purity using laziness:

data Tree a = Leaf a | Tree a :*: Tree a

repMin :: Tree Integer -> Tree Integer
repMin xs = ys where
  (m, ys) = go xs
  go (Leaf x) = (x, Leaf m)
  go (xs :*: ys) = (min x y, xs' :*: ys')
    where
      (x,xs') = go xs
      (y,ys') = go ys

There and Back Again

Let’s say we don’t have laziness at our disposal: are we hosed? No ! Danvy and Goldberg (2005) explore this very issue, by posing the question:

Given two lists, xs and ys, can you zip xs with the reverse of ys in one pass?

The technique used to solve the problem is named “There and Back Again”; it should be clear why from one of the solutions:

convolve xs ys = walk xs const where
  walk [] k = k [] ys
  walk (x:xs) k = walk xs (\r (y:ys) -> k ((x,y) : r) ys)

The traversal of one list builds up the function to consume the other. We could write repmin in the same way:

repMin = uncurry ($) . go where
  go (Leaf x) = (Leaf, x)
  go (xs :*: ys) = (\m -> xs' m :*: ys' m, min xm ym) where
    (xs',xm) = go xs
    (ys',ym) = go ys

Cayley Representations

If you’re doing a lot of appending to some list-like structure, you probably don’t want to use actual lists: you’ll end up traversing the left-hand-side of the append many more times than necessary. A type you can drop in to use instead is difference lists (Hughes 1986):

type DList a = [a] -> [a]

rep :: [a] -> DList a
rep = (++)

abs :: DList a -> [a]
abs xs = xs []

append :: DList a -> DList a -> DList a
append = (.)

append is $\mathcal{O}(1)$ in this representation. In fact, for any monoid with a slow mappend, you can use the same trick: it’s called the Cayley representation, and available as Endo in Data.Monoid.

rep :: Monoid a => a -> Endo a
rep x = Endo (mappend x)

abs :: Monoid a => Endo a -> a
abs (Endo f) = f mempty

instance Monoid (Endo a) where
  mempty = Endo id
  mappend (Endo f) (Endo g) = Enfo (f . g)

You can actually do the same transformation for “monoids” in the categorical sense: applying it to monads, for instance, will give you codensity (Rivas and Jaskelioff 2014).

Traversable

Looking back—just for a second—to the repmin example, we should be able to spot a pattern we can generalize. There’s really nothing tree-specific about it, so why can’t we apply it to lists? Or other structures, for that matter? It turns out we can: the mapAccumL function is tailor-made to this need:

repMin :: Traversable t => t Integer -> t Integer
repMin xs = ys where
  (~(Just m), ys) = mapAccumL f Nothing xs
  f Nothing x = (Just x, m)
  f (Just y) x = (Just (min x y), m)

The tilde before the Just ensures this won’t fail on empty input.

Finally, it’s time for the main event. Huffman coding is a very multi-pass algorithm, usually. The steps look like this:

Build a frequency table for each character in the input.
Build a priority queue from that frequency table.
Iteratively pop elements and combine them (into Huffman trees) from the queue until there’s only one left.
That Huffman tree can be used to construct the mapping from items back to their Huffman codes.
Traverse the input again, using the constructed mapping to replace elements with their codes.

We can’t skip any of these steps: we can try perform them all at once, though.

Let’s write the multi-pass version first. We’ll need the frequency table:

frequencies :: Ord a => [a] -> Map a Int
frequencies = Map.fromListWith (+) . map (flip (,) 1)

And a heap, ordered on the frequencies of its elements (I’m using a skew heap here):

data Heap a
  = Nil
  | Node {-# UNPACK #-} !Int a (Heap a) (Heap a)

instance Monoid (Heap a) where
  mappend Nil ys = ys
  mappend xs Nil = xs
  mappend h1@(Node i x lx rx) h2@(Node j y ly ry)
    | i <= j    = Node i x (mappend h2 rx) lx
    | otherwise = Node j y (mappend h1 ry) ly
  mempty = Nil

Next, we need to build the tree. We can use the tree type from above.

buildTree :: Map a Int -> Maybe (Tree a)
buildTree = prune . toHeap where
  toHeap = Map.foldMapWithKey (\k v -> Node v (Leaf k) Nil Nil)
  prune Nil = Nothing
  prune (Node i x l r) = case mappend l r of
    Nil -> Just x
    Node j y l' r' ->
      prune (mappend (Node (i+j) (x :*: y) Nil Nil) (mappend l' r'))

Then, a way to convert between the tree and a map:

toMapping :: Ord a => Tree a -> Map a [Bool]
toMapping (Leaf x) = Map.singleton x []
toMapping (xs :*: ys) =
    Map.union (fmap (True:) (toMapping xs)) (fmap (False:) (toMapping ys))

And finally, putting the whole thing together:

huffman :: Ord a => [a] -> (Maybe (Tree a), [[Bool]])
huffman xs = (tree, map (mapb Map.!) xs) where
  freq = frequencies xs
  tree = buildTree freq
  mapb = maybe Map.empty toMapping tree

Removing the passes

The first thing to fix is the toMapping function: at every level, it calls union, a complex and expensive operation. However, union and empty form a monoid, so we can use the Cayley representation to reduce the calls to a minimum. Next, we want to get rid of the fmaps: we can do that by assembling a function to perform the fmap as we go, as in convolve.

toMapping :: Ord a => Tree a -> Map a [Bool]
toMapping tree = go tree id Map.empty where
  go (Leaf x) k = Map.insert x (k [])
  go (xs :*: ys) k =
    go xs (k . (:) True) . go ys (k . (:) False)

Secondly, we can integrate the toMapping function with the buildTree function, removing another pass:

buildTree :: Ord a => Map a Int -> Maybe (Tree a, Map a [Bool])
buildTree = prune . toHeap where
  toHeap = Map.foldMapWithKey (\k v -> Node v (Leaf k, leaf k) Nil Nil)
  prune Nil = Nothing
  prune (Node i x l r) = case mappend l r of
    Nil -> Just (fmap (\k -> k id Map.empty) x)
    Node j y l' r' ->
      prune (mappend (Node (i+j) (cmb x y) Nil Nil) (mappend l' r'))
  leaf x k = Map.insert x (k [])
  node xs ys k = xs (k . (:) True) . ys (k . (:) False)
  cmb (xt,xm) (yt,ym) = (xt :*: yt, node xm ym)

Finally, to remove the second pass over the list, we can copy repmin, using mapAccumL to both construct the mapping and apply it to the structure in one go.

huffman :: (Ord a, Traversable t) => t a -> (Maybe (Tree a), t [Bool])
huffman xs = (fmap fst tree, ys) where
  (freq,ys) = mapAccumL f Map.empty xs
  f fm x = (Map.insertWith (+) x 1 fm, mapb Map.! x)
  tree = buildTree freq
  mapb = maybe Map.empty snd tree

And that’s it!

The similarity between the repmin function and the solution above is suggestive: is there a way to encode this idea of making a multi-pass algorithm single-pass? Of course! We can use an applicative:

data Circular a b c =
    Circular !a
             (b -> c)

instance Functor (Circular a b) where
    fmap f (Circular tally run) = Circular tally (f . run)

instance Monoid a =>
         Applicative (Circular a b) where
    pure x = Circular mempty (const x)
    Circular fl fr <*> Circular xl xr =
        Circular
            (mappend fl xl)
            (\r -> fr r (xr r))

liftHuffman
    :: Ord a
    => a -> Circular (Map a Int) (Map a [Bool]) [Bool]
liftHuffman x = Circular (Map.singleton x 1) (Map.! x)

runHuffman
    :: Ord a
    => Circular (Map a Int) (Map a [Bool]) r -> (Maybe (Tree a), r)
runHuffman (Circular smry run) =
    maybe (Nothing, run Map.empty) (Just *** run) (buildTree smry)

huffman
    :: (Ord a, Traversable t)
    => t a -> (Maybe (Tree a), t [Bool])
huffman = runHuffman . traverse liftHuffman

Thanks to it being an applicative, you can do all the fun lensy things with it:

showBin :: [Bool] -> String
showBin = map (bool '0' '1')

>>> let liftBin = fmap showBin . liftHuffman
>>> (snd . runHuffman . (each.traverse) liftBin) ("abb", "cad", "c")
(["01","11","11"],["00","01","10"],["00"])

Bringing us back to the start, it can also let us solve repmin!

liftRepMin :: a -> Circular (Option (Min a)) a a
liftRepMin x = Circular (pure (pure x)) id

runRepMin :: Circular (Option (Min a)) a b -> b
runRepMin (Circular m r) = r (case m of
  Option (Just (Min x)) -> x)

repMin :: (Ord a, Traversable t) => t a -> t a
repMin = runRepMin . traverse liftRepMin

So the Circular type is actually just the product of reader and writer, and is closely related to the sort type.

It’s also related to the Prescient type, which I noticed after I’d written the above.

Bird, R. S. 1984. “Using Circular Programs to Eliminate Multiple Traversals of Data.” Acta Inf. 21 (3) (October): 239–250. doi:10.1007/BF00264249. http://dx.doi.org/10.1007/BF00264249.

Bird, Richard, Geraint Jones, and Oege De Moor. 1997. “More haste‚ less speed: Lazy versus eager evaluation.” Journal of Functional Programming 7 (5) (September): 541–547. doi:10.1017/S0956796897002827. https://ora.ox.ac.uk/objects/uuid:761a4646-60a2-4622-a1e0-ddea11507d57/datastreams/ATTACHMENT01.

Danvy, Olivier, and Mayer Goldberg. 2005. “There and Back Again.” http://brics.dk/RS/05/3/BRICS-RS-05-3.pdf.

Hughes, R. John Muir. 1986. “A Novel Representation of Lists and Its Application to the Function "Reverse".” Information Processing Letters 22 (3) (March): 141–144. doi:10.1016/0020-0190(86)90059-1. http://www.sciencedirect.com/science/article/pii/0020019086900591.

Pippenger, Nicholas. 1997. “Pure Versus Impure Lisp.” ACM Trans. Program. Lang. Syst. 19 (2) (March): 223–238. doi:10.1145/244795.244798. http://doi.acm.org/10.1145/244795.244798.

Rivas, Exequiel, and Mauro Jaskelioff. 2014. “Notions of Computation as Monoids.” arXiv:1406.4823 [cs, math] (May). http://arxiv.org/abs/1406.4823.