哈希函数的安魂曲，或者：我如何爱上 package maphash

原文

September 22, 2025

Zürich, Schweiz

I am coming off from surgery right now and have some time on my hands, so I wanted to share a story about a learning experience.

This post is a bit of a dual purpose one on a teachable moment as a software engineer — a chance to offer a retrospective from bad decisions and earn penance therefor.

There is one episode that sticks out in my memory with a fair bit of shame: implementing the original metric hashing for the metric family data type in Prometheus’s client metrics. This hashing was used in the internals of the client metrics, so users wouldn’t be exposed to it directly — except in terms of CPU cost!

You see, I implemented it initially in a very naive^{way, using FNV to
digest each of the metric key-value metadata pairs. The underlying data type
looked something like this:}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
type MetricFamily struct {
	Metrics map[uint64]*Metric
}

type Label struct {
	Name  string
	Value string
}

type Metric struct {
	Labels []Label
	Value  float64
}

Note: The eagle-eyed reader will immediately notice that hash code-based indexing needs to consider the eventuality of a legitimate hash collision and that the proposed structure above does not solve this. That problem is OK to ignore, as that is merely tangential to the story. I have reduced the problem down to its essence: bad, specious hashing on my part.
If why I am calling this point out is drawing a blank stare from you, as a reader, consider how the backing storage of something like a map or HashMap works: there is a backing store of entries (e.g., linked list, bucket, or similar) that provides the ability to disambiguate between keys whose hash codes collide. This is one of the reasons that Java requires classes that override Object#hashCode to also provide Object#equals (i.e., to use Object#equal to disambiguate between entries whose keys produce the same Object#hashCode).

Metrics were identified by their labels ([]Label). In real world systems, you’d have label pairs like these since the data was describing a real-world system in a deaggregated way . To understand this, consider a metric family around a parent metric where there are multiple child metrics that represent individual label pairs. Structurally this looks as follows:

metric_parent : http_requests_total
- child metric
  - labels
    - method : GET
    - response_code : 200
  - value: 42
- child metric
  - labels
    - method : GET
    - response_code : 400
  - value: 21
- …

We can see that the server served a total of 63 GET requests (what manages and implements this metric is beside the point). So how do we represent this state in the telemetry instrumentation library? Let’s consider this value literal:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
metric1 := Metric{
	Value: 42,
	Labels: []Label{
		{Name: "method", Value: "GET"},
		{Name: "response_code", Value: "200"},
	},
}
metric2 := Metric{
	Value: 21,
	Labels: []Label{
		{Name: "method", Value: "GET"},
		{Name: "response_code", Value: "400"},
	},
}

How would we go about getting these two children into a MetricFamily such that we could efficiently look up individual Metric values based on the values of their respective []Label values?

Those of you with any experience with Go will immediately recognize that MetricFamily can’t index individual metrics that are keyed on []Label due to this note in the Go Language Specification on map types:

The comparison operators == and != must be fully defined for operands of the key type; thus the key type must not be a function, map, or slice. If the key type is an interface type, these comparison operators must be defined for the dynamic key values; failure will cause a run-time panic.

And these comparison operators are not implemented for slices per the specification’s section on comparison operators:

Slice, map, and function types are not comparable.

That means code like this is out:

1
2
3
type MetricFamily struct {
	Metrics map[[]Label]*Metric
}

Well, when I tell you what I did, you’ll run in horror (post morteming it is easy to do in retrospect). It looked something like this under the hood:

1
2
3
4
5
6
7
8
func hashLabelsFormat(l []Label) uint64 {
	h := fnv.New64a()
	// Invariant: l is already sorted by Name.
	for _, pair := range l {
		_, _ = fmt.Fprintf(h, "%q=%q", pair.Name, pair.Value)
	}
	return h.Sum64()
}

There was a lot wrong with this approach:

Inefficiency: There was a potential for heap allocations (additional garbage for the garbage collector) due to h being passed as an io.Writer to fmt.Fprintf as well as the format arguments.
Inefficiency: fmt.Fprintf has to inspect of the value and types of the arguments for printing at runtime
Obtuseness: Including a = in the format string for good measure to prevent collisions of key and values onto each other in the hash generation.

Aside: In my defense, I had hitherto relied on my IDE to generate hash function implementations for me. This was one of my first projects in Go, and I had just come to the language from Java.

Were I to have rewritten this in the (near) modern day (preserving the overall morphology of the prior API), I might have considered something like this (we’ll reconsider this sketch below):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
func hashLabels(l []Label) uint64 {
	v := uint64(1)
	// Invariant: l is already sorted by Name.
	for _, pair := range l {
		v = uint64(31)*v + hashLabel(pair)
	}
	return v
}

func hashString(s string) uint64 {
	h := fnv.New64a()
	_, _ = io.WriteString(h, s)
	return h.Sum64()
}

func hashLabel(l Label) uint64 {
	v := uint64(1)
	v = uint64(31)*v + hashString(l.Name)
	v = uint64(31)*v + hashString(l.Value)
	return v
}

Where hashLabels could, in turn, be used to generate the MetricFamily as such:

1
2
3
4
5
6
family := MetricFamily{
	Metrics: map[uint64]*Metric{
		hashLabels(metric1.Labels): &metric1,
		hashLabels(metric2.Labels): &metric2,
	},
}

You could imagine the MetricFamily being implemented approximately as this now:

1
2
3
4
5
6
7
8
9
func (f *MetricFamily) Add(labels []Label, val float64) {
	h := hashLabels(labels)
	if m, ok := f.Metrics[h]; ok {
		m.Value += val
		return
	}
	f.Metrics[h] = Metric{Labels: labels, Value: val}
	return
}

Aside: You’ll note that this sketch of a metric type omits synchronization. I left that out due to being extraneous to the discussion.
With the typical functional requirements of a telemetry instrumentation library, I could easily imagine sync.Map being used instead of ordinary map[uint64]*Metric:
Loose cross-metric consistency requirements across values.
Values being effectively set-once (if !ok above).

Why a Hash Function is Written as it Is

I want to take a little bit of time to explore why hashLabels and hashLabel above were written as they were above. I’ll reproduce the functions here to save the scrolling:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func hashLabels(l []Label) uint64 {
	v := uint64(1)
	// Invariant: l is already sorted by Name.
	for _, pair := range l {
		v = uint64(31)*v + hashLabel(pair)
	}
	return v
}

func hashLabel(l Label) uint64 {
	v := uint64(1)
	v = uint64(31)*v + hashString(l.Name)
	v = uint64(31)*v + hashString(l.Value)
	return v
}

Why is this written as it is? Namely:

Where does the number 31 come from?
Why are we multiplying against it and adding new hash values to it?

This approach comes from Bloch’s Effective Java (Second Edition pp. 45–50). From here and throughout the rest of this article (except where expressly noted), I am importing some of the Java practices into Go to demonstrate how a lot of this is done by hand.

Let’s examine what’s happening by walking through this code piecemeal and considering some edge cases. First we’ll examine what hashString computes as a baseline:

Input	Output
“method”	2525399976365011888
“GET”	16897051813516574231
“response_code”	6624505961732004106
“200”	6935200482265501725
“400”	3159633682633943387

(Source)

31 is a prime number. It is also odd. This means that individual hash computation sequences found in hashLabels’s for-loop can be transformed into unique values when multiplied with it. In the context of Effective Java, another machine-, architecture-, and runtime-specific reason was cited for 31:

The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i « 5) - i. Modern [Java] VMs do this sort of optimization automatically.

So let’swork through the computation of a hash codes of a single Label value step-by-step:

h := hashLabel(metric1.Labels[0]) // {Name: "method", Value: "GET"}

Step	`v` Before	`v` After	What’s Computed	Input
1	-	1	Initial Assignment	-
2	1	2525399976365011919 (`31 * 1 + 2525399976365011888`)	Hash of Label’s Name	`hashString("method")`
3	2525399976365011919	2950730712284185640 (`31 * 2525399976365011919 + 16897051813516574231`)	Hash of Label’s Value	`hashString("GET")`

Collision Risk

This rolling mechanism for hashing has several nice properties that my naive implementation fmt.Fprintf(h, "%q=%q", pair.Name, pair.Value) above was trying to address by ensuring that label name and values that when concatenated together are not incorrectly treated as the same Metric by having the same hash.

Let’s suppose we want to run hashLabel on two more labels that are very similar that concatenate to the same string value:

“method” + “”: “method”
“metho” + “d”: “method”

No.	Name	Value
1	`"method"`	`""`
2	`"metho"`	`"d"`

Input: hashLabel(Label{Name: "method", Value: ""})

Output: 749659938114267446

Step	`v` Before	`v` After	What’s Computed	Input
1	-	1	Initial Assignment	-
2	1	2525399976365011919 (`31 * 1 + 2525399976365011888`)	Hash of Label’s Name	`hashString("method")`
3	2525399976365011919	749659938114267446 (`31 * 2525399976365011919 + 14695981039346656037`)	Hash of Label’s Value	`hashString("")`

Input: hashLabel(Label{Name: "metho", Value: "d"})

Output: 6921288831359542208

Step	`v` Before	`v` After	What’s Computed	Input
1	-	1	Initial Assignment	-
2	1	4576034113516619283 (`31 * 1 + 4576034113516619252`)	Hash of Label’s Name	`hashString("metho")`
3	4576034113516619283	6921288831359542208 (`31 * 4576034113516619283 + 12638183902020757363`)	Hash of Label’s Value	`hashString("d")`

What’s important to note is how the hashes of Label{Name: "Method", Value: ""} and Label{Name: "Metho", Value: "d"} do not collide (respectively 749659938114267446 and 6921288831359542208).

The rolling effect of multiplying the existing hash with the prime and then adding the new value incrementally ensures that hash computation process does not have commutative properties.

For the sake of argument, let’s examine what happens with my fmt.Fprintf(h, "%q=%q", pair.Name, pair.Value) approach:

1
2
3
4
5
func hashLabelFormat(l Label) uint64 {
	h := fnv.New64a()
	fmt.Fprintf(h, "%q=%q", l.Name, l.Value)
	return h.Sum64()
}

Label Name	Label Value	Format String	Computed Hash
`"method"`	`""`	`"method"=""`	2044830391780925169
`"metho"`	`"d"`	`"metho"="d"`	7086825952594498043

Like the rolling approach, this one is also impervious to the concatenation problem.

Using a format string like fmt.Fprintf(h, "%v%v", pair.Name, pair.Value) makes the identity defect that the original design solves for more evident:

1
2
3
4
5
6
7
8
func hashLabelNoDelimit(l Label) uint64 {
	h := fnv.New64a()
	fmt.Fprintf(h, "%v%v", l.Name, l.Value)
	// Or even:
	// _, _ = io.WriteString(h, l.Name)
	// _, _ = io.WriteString(h, l.Value)
	return h.Sum64()
}

Label Name	Label Value	Format String	Computed Hash
`"method"`	`""`	`"method"`	2525399976365011888
`"metho"`	`"d"`	`"method"`	2525399976365011888

That’s probably not the behavior you wanted.

Similarly we can demonstrate what happens when a defective hash computation exhibits the commutative property when it lacks the prime multiplication and addition process:

1
2
3
func hashLabelCommutative(l Label) uint64 {
	return hashString(l.Name) + hashString(l.Value)
}

Label Name	Label Value	Computed Hash
`"method"`	`"GET"`	975707716172034503
`"GET"`	`"method"`	975707716172034503

Observe how swapping the label name and value result in the same hash! This would not happen with hashLabel, the "%q=%q" format string, or prime multiplication and addition!

Label Name	Label Value	Prime Multiplication and Addition	`"%q=%q"` Format String
`"method"`	`"GET"`	2950730712284185640	18332136931986247339
`"GET"`	`"method"`	9825172131511368762	6902691168400010101

Looking at a More Complex Data Type

Now that we’ve shown some of the pitfalls of careless hash function design, let’s explore some techniques for handling more complex types. Suppose we have a complex type as follows that we want to hash. How could we do it?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
type TheZoo struct {
	ID        int16
	Optional  *int32
	Unsized   int
	Unordered map[string]string
	Small     bool
	Variable  []int32
	transient string
	Recursive *TheZoo
}

Aside: These field names are not great. I am choosing them in lieu of using quasi-meaningless names like “foo” and “bar”.
The name “transient” is an homage to Java serialization’s concept of transience, which is to say that it is data that we don’t care about for purposes of storage or comparison.

The data type TheZoo provides us with a slew of considerations to address:

We see data of fixed sizes (e.g., int16) and variable sizes (e.g., Unsized, Unordered, and Variable).
We are confronted with data that may not be present (e.g., Optional, Unordered, and Recursive).
We need to handle data that may not need to be hashed at all (e.g., transient).
We have data that could have special semantic meaning to consider in terms of absence and how to map that absence to value space that contains an empty set of data (e.g., Unordered and Variable).

This is a rather tall order to reconcile. The first step I would do is employ a divide and conquer approach to the problem whereby we handle each field individually. We can define a small helper function that enables us to piecemeal compose the hash value from the prior field values:

Aside: This kind of divide and conquer approach is not strictly needed, but it allows us to zero in on the nuance of how each case is handled in isolation of other things and thereby reducing noise. It provides a nice property that each field’s value is treated as a black box and doesn’t pollute hashTheZoo with extraneous complexity.

1
2
3
func combine(prior, next uint64) uint64 {
	return 31*prior + next
}

First let’s look a structure for how to hash TheZoo. We need to handle the case that TheZoo is nil per the original field definition above, so we end up with this opening definition:

1
2
3
4
5
6
7
func hashTheZoo(zoo *TheZoo) uint64 {
	if zoo == nil {
		return 0
	}
	h := uint64(1)
	// Work in Progress
}

We need to communicate that the value is present or absent, which is needed later.

First Field

Now we can look at the fields in order. For ID we can directly represent a int16 in uint64’s value space.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func hashTheZoo(zoo *TheZoo) uint64 {
	if zoo == nil {
		return 0
	}
	h := uint64(1)
	h = hashID(h, zoo.ID)
	// Work in Progress
}

func hashID(prior uint64, v int16) uint64 {
	return combine(prior, uint64(v))
}

Second Field

Like with the frontmatter of hashTheZoo, we need to handle data that is optional, which presents some challenges in that we need to:

represent the value in its value space when it is present
distinctly represent the condition when the value is absent

A common technique is to use separate paths for value generation as you can see below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func hashTheZoo(zoo *TheZoo) uint64 {
	if zoo == nil {
		return 0
	}
	h := uint64(1)
	h = hashID(h, zoo.ID)
	h = hashOptional(h, zoo.Optional)
	// Work in Progress
}

func hashOptional(prior uint64, v *uint32) uint64 {
	if v == nil {
		return combine(prior, 0) // Signal absence.
	}
	h := combine(prior, 1)        // Signal presence.
	return combine(h, uint64(*v)) // Incorporate value.
}

Third Field

We now move on to field Unsized, which uses a machine architecture-specific size. In its largest form, an int’s bits will fit into the space of uint64.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
func hashTheZoo(zoo *TheZoo) uint64 {
	if zoo == nil {
		return 0
	}
	h := uint64(1)
	h = hashID(h, zoo.ID)
	h = hashOptional(h, zoo.Optional)
	h = hashUnsized(h, zoo.Unsized)
	// Work in Progress
}

func hashUnsized(prior uint64, v int) uint64 {
	return combine(prior, uint64(v))
}

Fourth Field

Handling Unordered requires considerable care:

traversing a map is non-deterministic
Aside: Whether map traversal should be deterministic seems to be a near-religious conversation among some developers.
My personal approach in developing and maintaining systems is not to rely on fragile implicit coupling of components and their behavior, so I don’t mind the non-determinism but prefer it.
we need to encode map state (e.g., count of items) when hashing to reduce the risk of collisions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
func hashTheZoo(zoo *TheZoo) uint64 {
	if zoo == nil {
		return 0
	}
	h := uint64(1)
	h = hashID(h, zoo.ID)
	h = hashOptional(h, zoo.Optional)
	h = hashUnsized(h, zoo.Unsized)
	h = hashUnordered(h, zoo.Unordered)
	// Work in Progress
}

func hashUnordered(prior uint64, v map[string]string) uint64 {
	h := combine(prior, uint64(len(v)))
	for _, k := range slices.Sorted(maps.Keys(v)) {
		h = combine(h, hashString(k))
		h = combine(h, hashString(v[k]))
	}
	return h
}

I’ve taken a similar consideration that the Google Go Style Guide takes on nil slices but with maps. A nil map is the same as len(m) == 0.

Aside: Java takes an interesting approach in AbstractMap#hashCode whereby each key-value pair is hashed and summed, and the rolling sum is returned to the caller. The common Map implementation of HashMap contains an internal type called Node that instead XOR-s the key’s hash with the values’. This approach is commutative, meaning it does not care about the order of the AbstractMap’s elements.
As a fun bit of trivia: since Java does not encode the length of the map in the hash code, it is not inconceivable — however improbable for a collision to arise between a 0-element AbstractMap and an AbstractMap containing two elements where their addition cancels each other out to zero, which again, could happy for any number of items whose sum works out to be as such.

Fifth Field

There are multiple ways of representing boolean values like field Small. I took inspiration from the implementation found in Java’s Boolean#hashCode:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
func hashTheZoo(zoo *TheZoo) uint64 {
	if zoo == nil {
		return 0
	}
	h := uint64(1)
	h = hashID(h, zoo.ID)
	h = hashOptional(h, zoo.Optional)
	h = hashUnsized(h, zoo.Unsized)
	h = hashUnordered(h, zoo.Unordered)
	h = hashSmall(h, zoo.Small)
	// Work in Progress
}

func hashSmall(prior uint64, v bool) uint64 {
	if v {
		return combine(prior, 1231)
	}
	return combine(prior, 1237)
}

Both 1231 and 1237 are prime and are sufficiently large to add noise to a composite hash value. If you want a fun treat, decompose each of those numbers into their constituent digits and add them together; see if you notice anything!

Sixth Field

Handling of a repeated field like Variable (a slice) is interesting. It is not too dissimilar from Unordered. Here is Bloch again:

If the field is an array, treat it as if each significant element were a separate field. That is, compute a hash code for each significant element by applying these rules recursively, and combine the values per step 2.b. If the array has no significant elements, use a constant, preferably not 0. If all elements are significant, use Arrays.hashCode.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
func hashTheZoo(zoo *TheZoo) uint64 {
	if zoo == nil {
		return 0
	}
	h := uint64(1)
	h = hashID(h, zoo.ID)
	h = hashOptional(h, zoo.Optional)
	h = hashUnsized(h, zoo.Unsized)
	h = hashUnordered(h, zoo.Unordered)
	h = hashSmall(h, zoo.Small)
	h = hashVariable(h, zoo.Variable)
	// Work in Progress
}

func hashVariable(prior uint64, v []int32) uint64 {
	h := combine(prior, len(v))
	for _, v := range v {
		h = combine(h, uint64(v))
	}
	return h
}

Seventh and Eighth Fields

We can ignore field transient (see Bloch’s allusion to “significant” above), so we continue to field Recursive. hashTheZoo’s frontmatter already included handling for this case, so we merely just needed to call this function recursively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
func hashTheZoo(zoo *TheZoo) uint64 {
	if zoo == nil {
		return 0
	}
	h := uint64(1)
	h = hashID(h, zoo.ID)
	h = hashOptional(h, zoo.Optional)
	h = hashUnsized(h, zoo.Unsized)
	h = hashUnordered(h, zoo.Unordered)
	h = hashSmall(h, zoo.Small)
	h = hashVariable(h, zoo.Variable)
	// field transient is ignored
	return combine(h, hashTheZoo(zoo.Recursive))
}

Wrapping Up

As you can see, doing this all by hand involves a fair bit of work, is toilsome, and is error-prone.

Many of these considerations we see here are not too dissimilar from what those who implement type-value-length (TLV) serialization protocols face. TLV is a common encoding scheme where each piece of data is bundled with a tag identifying its type and a prefix specifying its length, ensuring that a stream of data can be parsed without ambiguity (see also: Protocol Buffers). The approach I took above was rather Java-centric in terms of how a slew of the algorithms are implemented:

Making this More Idiomatic

It’s the year 2025. Surely we can do better than this, right? Let’s explore a new feature of the Go API ecosystem since release 1.19 called package maphash.

This will enable us to approach this problem in a much more concise and Go-idiomatic way. package maphash has some rather interesting caveats about use, so I recommend you read its API documentation before using it (e.g., about not durably storing the hash values). I won’t go into detail here about that; that’s for you to read about.

Let’s start by amending the structure of the hashing function for TheZoo. I’ve reframed hashTheZoo now as this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
func writeHashTheZoo(h *maphash.Hash, zoo *TheZoo) {
	{ // Communicate value presence and absence.
		if zoo == nil {
            maphash.WriteComparable(h, 0)
			return
		}
        maphash.WriteComparable(h, 1)
	}
	{ // Hash first field: ID.
		maphash.WriteComparable(h, zoo.ID)
	}
	{ // Hash second field, communicating presence, too: Optional.
		if zoo.Optional == nil {
			maphash.WriteComparable(h, 0)
		} else {
			maphash.WriteComparable(h, 1)
			maphash.WriteComparable(h, *zoo.Optional)
		}
	}
	{ // Hash third field: Unsized.
		maphash.WriteComparable(h, zoo.Unsized)
	}
	{ // Hash fourth field: Unordered.
		maphash.WriteComparable(h, len(zoo.Unordered))
		for _, k := range slices.Sorted(maps.Keys(zoo.Unordered)) {
			maphash.WriteComparable(h, k)
			maphash.WriteComparable(h, zoo.Unordered[k])
		}
	}
	{ // Hash fifth field: Small.
		maphash.WriteComparable(h, zoo.Small)
	}
	{ // Hash sixth field: Variable.
		maphash.WriteComparable(h, len(zoo.Variable))
		for _, v := range zoo.Variable {
			maphash.WriteComparable(h, v)
		}
	}
	// field transient is ignored
	{ // Hash eighth field: Recursive.
		writeHashTheZoo(h, zoo.Recursive)
	}
}

The core thing to note is that this follows a structure with several conventions that package maphash sets:

This name and structure tell users that they can use this function to add the hash of a *TheZoo value to any *maphash.Hash value to:

a new hash with no state
an existing hash (that might me hashing an outer type that has *TheZoo as a field)

And nothing stops you from offering a helper function like this in addition:

1
2
3
4
5
6
func hashTheZoo(seed maphash.Seed, zoo *TheZoo) uint64 {
	var h maphash.Hash
	h.SetSeed(seed)
	writeHashTheZoo(h, zoo)
	return h.Sum64()
}

Now writeHashTheZoo and hashTheZoo could have been methods on *TheZoo, too. I elected to use free-standing functions to keep the domains more separated for demonstration value.

Deeper Dive into Modern Mechanics

One thing should be immediately evident to you: this implementation has to take into account far fewer things than the original Java-based approach did.

I want to explain some of that nuance here.

Important: maphash.WriteComparable internally performs TLV-like delimiting on the data hashing, meaning capturing information on string length, item count, and similar are not needed when you directly use this API.
This stands in contrast to the other APIs in package maphash.

Try running code snippet to see what I mean:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
package main

import (
	"fmt"
	"hash/maphash"
)

func main() {
	s := maphash.MakeSeed()
	{
		var h maphash.Hash
		h.SetSeed(s)
		maphash.WriteComparable(&h, "a")
		maphash.WriteComparable(&h, "b")
		fmt.Println(1, h.Sum64())
	}
	{
		var h maphash.Hash
		h.SetSeed(s)
		maphash.WriteComparable(&h, "ab")
		fmt.Println(2, h.Sum64())
	}
	{
		var h maphash.Hash
		h.SetSeed(s)
		h.WriteString("a")
		h.WriteString("b")
		fmt.Println(3, h.Sum64())
	}
	{
		var h maphash.Hash
		h.SetSeed(s)
		h.WriteString("ab")
		fmt.Println(4, h.Sum64())
	}
	// Outputs:
	// 1 1524322887173213840
	// 2 2587917104883752786
	// 3 2587917104883752786
	// 4 2587917104883752786
}

(Run this)

A takeaway for me is that when hashing discrete items in totality is that I should lean on maphash.WriteComparable insofar as possible.

A second observation is that I encoded the length of fields Unordered and Variable in both the manual hashing and maphash-based hashing. This might not have been strictly needed here, but note that this helps to produce unambiguously distinguishing results in typical case. To imagine this, let’s suppose we had a struct like this that we wanted to hash:

1
2
3
4
type DynamicComposite struct {
	A, B, C []int32
	D, E, F map[int32]int32
}

If we did not include the lengths of fields A, B, C, D, E, and F and instead only encoded the elements when they are present, we’d automatically get collisions for values like these:

1
2
3
4
5
6
first := DynamicComposite{
	A: []int32{42},
}
second := DynamicComposite{
	B: []int32{42},
}

Similar reasoning applies to the pointer presence used with Optional and Recursive. Consecutive repeated fields suffer the same risk:

1
2
3
type Optionality struct {
    A, B, C *int32
}

Revisiting Labels with a Modern, Idiomatic API

So back to the original hashLabel assignment: let’s reframe it with package maphash using what we’ve learned.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func writeHashMetric(h *maphash.Hash, m Metric) {
	maphash.WriteComparable(h, len(m.Labels))
	// Invariant: m.Labels is already sorted by Name.
	for _, l := range m.Labels {
		writeHashLabel(h, l)
	}
}

func writeHashLabel(h *maphash.Hash, l Label) {
	maphash.WriteComparable(h, l.Name)
	maphash.WriteComparable(h, l.Value)
}

And, now, you can see how we arrived at such a clear, concise, and deliberate outcome.

An Even Better Approach

As a nice followup, Klaus Post kindly pointed out a better approach that has some interesting properties in that it …

waives the need for the sorting invariant above.
has commutative properties (in part), but that has no negative impact on the computed outcome if the clients of this API use it correctly.
parallels how Java’s HashMap#hashCode is implemented under the hood.

An implementation would look naively like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func hashLabel(seed maphash.Seed, l Label) uint64 {
	return maphash.String(seed, l.Name) ^ maphash.String(seed, l.Value)
}

func hashMetric(seed maphash.Seed, m Metric) uint64 {
	var sum uint64
	for _, l := range m.Labels {
		sum += hashLabel(seed, l)
	}
	return sum
}

Above hashLabels is analogous to HashMap.Node#hashCode, and hashMetric to AbstractMap#hashCode.

For concision, I inverted the definition to accept maphash.Seed instead of a *maphash.Hash. This is to help avoid extraneous noise that would arise from calling (*maphash.Hash).Clone, handling an error that should never happen, computing constituent sums, so on and so forth.

Caution: hashMap above does not encode the length of m.Labels like the other implementation in the previous section. That is not a problem on its own, but using a pattern like this in *TheZoo or DynamicComposite means you would still need to differentiate between the constituent fields by hashing something in the overall container type’s representation.
1
2
3
type MetricsAndMetricsAndMoreMetrics struct {
	A, B, C Metric
}
If you want to hash MetricsAndMetricsAndMoreMetrics, you should be able to differentiate between an equivalent value of Metric being in field A, B, or C. Observe how hashMetric’s output is itself included in the outer value for MetricsAndMetricsAndMoreMetrics’s hash.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func (m *MetricsAndMetricsAndMoreMetrics) Hash(seed maphash.Seed) uint64 {
	var h maphash.Hash
	h.SetSeed(seed)
	if m == nil {
		maphash.WriteComparable(&h, 0)
		return h.Sum64()
	}
	maphash.WriteComparable(&h, 1)
	maphash.WriteComparable(&h, hashMetric(seed, m.A))
	maphash.WriteComparable(&h, hashMetric(seed, m.B))
	maphash.WriteComparable(&h, hashMetric(seed, m.C))
	return h.Sum64()
}

Thank you, Klaus! :-)

Closing

You might wonder why I started this article demonstrating hashing by hand. I figure it is a useful exploration to characterize how we got to today and the type of care that had to be taken. It provides a wonderful practical introduction into the context to appreciate why a facility like package maphash is so nice to have.

Now, I can imagine some readers taking umbrage that I’ve saved a crucial point for the end: in the vast majority of situations, creating a manual hashing mechanism is unnecessary. The Go runtime provides a built-in, highly efficient facility for hashing any comparable type. It automatically leverages architecture-specific capabilities without you having to. A good way of understanding this is to peel back the curtain of how the runtime implements maps (namely with a necessary detour into the compiler):

genhash is a rather interesting read, particularly the section on generating hashes for custom composite types using hashFunc. I haven’t dug into its guts too deeply, but it appears to recursively generate instructions that get added to an intermediate representation (IR) store that are returned eventually as a closure. Much of this uses runtime·memhash, which will use hardware hashing support like AES if it is available.

From this, the compiler assigns the the hashing algorithm to abi.SwissMapType’s Hasher field, which the runtime uses.

And the fun part about this: package maphash is largely implemented in terms of the compiler+runtime mechanisms above! You’ll find that both use runtime·memhash

In any case, better to understand how this is implemented than not. May you learn your life lessons less hard than I did — and with certainly less embarrassment.

Navigation: