Zürich, Schweiz
I am coming off from surgery right now and have some time on my hands, so I wanted to share a story about a learning experience.
This post is a bit of a dual purpose one on a teachable moment as a software engineer — a chance to offer a retrospective from bad decisions and earn penance therefor.
There is one episode that sticks out in my memory with a fair bit of shame: implementing the original metric hashing for the metric family data type in Prometheus’s client metrics. This hashing was used in the internals of the client metrics, so users wouldn’t be exposed to it directly — except in terms of CPU cost!
You see, I implemented it initially in a very naive way, using FNV to digest each of the metric key-value metadata pairs. The underlying data type looked something like this:
|
|
Note: The eagle-eyed reader will immediately notice that hash code-based indexing needs to consider the eventuality of a legitimate hash collision and that the proposed structure above does not solve this. That problem is OK to ignore, as that is merely tangential to the story. I have reduced the problem down to its essence: bad, specious hashing on my part.
If why I am calling this point out is drawing a blank stare from you, as a reader, consider how the backing storage of something like a
map
orHashMap
works: there is a backing store of entries (e.g., linked list, bucket, or similar) that provides the ability to disambiguate between keys whose hash codes collide. This is one of the reasons that Java requires classes that overrideObject#hashCode
to also provideObject#equals
(i.e., to useObject#equal
to disambiguate between entries whose keys produce the sameObject#hashCode
).
Metrics were identified by their labels ([]Label
). In real world systems,
you’d have label pairs like these since the data was describing a real-world
system in a deaggregated way . To understand this, consider a metric family
around a parent metric where there are multiple child metrics that represent
individual label pairs. Structurally this looks as follows:
- metric_parent :
http_requests_total
- child metric
- labels
method
:GET
response_code
:200
- value: 42
- labels
- child metric
- labels
method
:GET
response_code
:400
- value: 21
- labels
- …
- child metric
We can see that the server served a total of 63 GET requests (what manages and implements this metric is beside the point). So how do we represent this state in the telemetry instrumentation library? Let’s consider this value literal:
|
|
How would we go about getting these two children into a MetricFamily
such
that we could efficiently look up individual Metric
values based on the
values of their respective []Label
values?
Those of you with any experience with Go will immediately recognize that
MetricFamily
can’t index individual metrics that are keyed on []Label
due
to this note in the Go Language Specification on map
types:
The comparison operators == and != must be fully defined for operands of the key type; thus the key type must not be a function, map, or slice. If the key type is an interface type, these comparison operators must be defined for the dynamic key values; failure will cause a run-time panic.
And these comparison operators are not implemented for slices per the specification’s section on comparison operators:
Slice, map, and function types are not comparable.
That means code like this is out:
|
|
Well, when I tell you what I did, you’ll run in horror (post morteming it is easy to do in retrospect). It looked something like this under the hood:
|
|
There was a lot wrong with this approach:
- Inefficiency: There was a potential for heap allocations (additional
garbage for the garbage collector) due to
h
being passed as anio.Writer
tofmt.Fprintf
as well as the format arguments. - Inefficiency:
fmt.Fprintf
has to inspect of the value and types of the arguments for printing at runtime - Obtuseness: Including a
=
in the format string for good measure to prevent collisions of key and values onto each other in the hash generation.
Aside: In my defense, I had hitherto relied on my IDE to generate hash function implementations for me. This was one of my first projects in Go, and I had just come to the language from Java.
Were I to have rewritten this in the (near) modern day (preserving the overall morphology of the prior API), I might have considered something like this (we’ll reconsider this sketch below):
|
|
Where hashLabels
could, in turn, be used to generate the MetricFamily
as
such:
|
|
You could imagine the MetricFamily
being implemented approximately as this
now:
|
|
Aside: You’ll note that this sketch of a metric type omits synchronization. I left that out due to being extraneous to the discussion.
With the typical functional requirements of a telemetry instrumentation library, I could easily imagine
sync.Map
being used instead of ordinarymap[uint64]*Metric
:
- Loose cross-metric consistency requirements across values.
- Values being effectively set-once (
if !ok
above).
Why a Hash Function is Written as it Is
I want to take a little bit of time to explore why hashLabels
and hashLabel
above were written as they were above. I’ll reproduce the functions here to save
the scrolling:
|
|
Why is this written as it is? Namely:
- Where does the number 31 come from?
- Why are we multiplying against it and adding new hash values to it?
This approach comes from Bloch’s Effective Java (Second Edition pp. 45–50). From here and throughout the rest of this article (except where expressly noted), I am importing some of the Java practices into Go to demonstrate how a lot of this is done by hand.
Let’s examine what’s happening by walking through this code piecemeal and
considering some edge cases. First we’ll examine what hashString
computes
as a baseline:
Input | Output |
---|---|
“method” | 2525399976365011888 |
“GET” | 16897051813516574231 |
“response_code” | 6624505961732004106 |
“200” | 6935200482265501725 |
“400” | 3159633682633943387 |
(Source)
31 is a prime number. It is also odd. This means that individual hash
computation sequences found in hashLabels
’s for-loop can be transformed into
unique values when multiplied with it. In the context of Effective Java,
another machine-, architecture-, and runtime-specific reason was cited for 31:
The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i « 5) - i. Modern [Java] VMs do this sort of optimization automatically.
So let’swork through the computation of a hash codes of a single Label
value
step-by-step:
h := hashLabel(metric1.Labels[0]) // {Name: "method", Value: "GET"}
Step | v Before | v After | What’s Computed | Input |
---|---|---|---|---|
1 | - | 1 | Initial Assignment | - |
2 | 1 | 2525399976365011919 (31 * 1 + 2525399976365011888 ) | Hash of Label’s Name | hashString("method") |
3 | 2525399976365011919 | 2950730712284185640 (31 * 2525399976365011919 + 16897051813516574231 ) | Hash of Label’s Value | hashString("GET") |
Collision Risk
This rolling mechanism for hashing has several nice properties that my naive
implementation fmt.Fprintf(h, "%q=%q", pair.Name, pair.Value)
above was
trying to address by ensuring that label name and values that when concatenated
together are not incorrectly treated as the same Metric
by having the same
hash.
Let’s suppose we want to run hashLabel
on two more labels that are very
similar that concatenate to the same string value:
- “method” + “”: “method”
- “metho” + “d”: “method”
No. | Name | Value |
---|---|---|
1 | "method" | "" |
2 | "metho" | "d" |
Input:
hashLabel(Label{Name: "method", Value: ""})
Output: 749659938114267446
Step v
Beforev
AfterWhat’s Computed Input 1 - 1 Initial Assignment - 2 1 2525399976365011919 ( 31 * 1 + 2525399976365011888
)Hash of Label’s Name hashString("method")
3 2525399976365011919 749659938114267446 ( 31 * 2525399976365011919 + 14695981039346656037
)Hash of Label’s Value hashString("")
Input:
hashLabel(Label{Name: "metho", Value: "d"})
Output: 6921288831359542208
Step v
Beforev
AfterWhat’s Computed Input 1 - 1 Initial Assignment - 2 1 4576034113516619283 ( 31 * 1 + 4576034113516619252
)Hash of Label’s Name hashString("metho")
3 4576034113516619283 6921288831359542208 ( 31 * 4576034113516619283 + 12638183902020757363
)Hash of Label’s Value hashString("d")
What’s important to note is how the hashes of Label{Name: "Method", Value: ""}
and Label{Name: "Metho", Value: "d"}
do not collide (respectively
749659938114267446 and 6921288831359542208).
The rolling effect of multiplying the existing hash with the prime and then adding the new value incrementally ensures that hash computation process does not have commutative properties.
For the sake of argument, let’s examine what happens with my fmt.Fprintf(h, "%q=%q", pair.Name, pair.Value)
approach:
|
|
Label Name | Label Value | Format String | Computed Hash |
---|---|---|---|
"method" | "" | "method"="" | 2044830391780925169 |
"metho" | "d" | "metho"="d" | 7086825952594498043 |
Like the rolling approach, this one is also impervious to the concatenation problem.
Using a format string like fmt.Fprintf(h, "%v%v", pair.Name, pair.Value)
makes
the identity defect that the original design solves for more evident:
|
|
Label Name | Label Value | Format String | Computed Hash |
---|---|---|---|
"method" | "" | "method" | 2525399976365011888 |
"metho" | "d" | "method" | 2525399976365011888 |
That’s probably not the behavior you wanted.
Similarly we can demonstrate what happens when a defective hash computation exhibits the commutative property when it lacks the prime multiplication and addition process:
|
|
Label Name | Label Value | Computed Hash |
---|---|---|
"method" | "GET" | 975707716172034503 |
"GET" | "method" | 975707716172034503 |
Observe how swapping the label name and value result in the same hash! This
would not happen with hashLabel
, the "%q=%q"
format string, or prime
multiplication and addition!
Label Name | Label Value | Prime Multiplication and Addition | "%q=%q" Format String |
---|---|---|---|
"method" | "GET" | 2950730712284185640 | 18332136931986247339 |
"GET" | "method" | 9825172131511368762 | 6902691168400010101 |
Looking at a More Complex Data Type
Now that we’ve shown some of the pitfalls of careless hash function design, let’s explore some techniques for handling more complex types. Suppose we have a complex type as follows that we want to hash. How could we do it?
|
|
Aside: These field names are not great. I am choosing them in lieu of using quasi-meaningless names like “foo” and “bar”.
The name “transient” is an homage to Java serialization’s concept of transience, which is to say that it is data that we don’t care about for purposes of storage or comparison.
The data type TheZoo
provides us with a slew of considerations to address:
- We see data of fixed sizes (e.g.,
int16
) and variable sizes (e.g.,Unsized
,Unordered
, andVariable
). - We are confronted with data that may not be present (e.g.,
Optional
,Unordered
, andRecursive
). - We need to handle data that may not need to be hashed at all (e.g.,
transient
). - We have data that could have special semantic meaning to consider in terms of
absence and how to map that absence to value space that contains an empty set
of data (e.g.,
Unordered
andVariable
).
This is a rather tall order to reconcile. The first step I would do is employ a divide and conquer approach to the problem whereby we handle each field individually. We can define a small helper function that enables us to piecemeal compose the hash value from the prior field values:
Aside: This kind of divide and conquer approach is not strictly needed, but it allows us to zero in on the nuance of how each case is handled in isolation of other things and thereby reducing noise. It provides a nice property that each field’s value is treated as a black box and doesn’t pollute
hashTheZoo
with extraneous complexity.
|
|
First let’s look a structure for how to hash TheZoo
. We need to handle the
case that TheZoo
is nil
per the original field definition above, so we end
up with this opening definition:
|
|
We need to communicate that the value is present or absent, which is needed later.
First Field
Now we can look at the fields in order. For ID
we can directly represent a
int16
in uint64
’s value space.
|
|
Second Field
Like with the frontmatter of hashTheZoo
, we need to handle data that is
optional, which presents some challenges in that we need to:
- represent the value in its value space when it is present
- distinctly represent the condition when the value is absent
A common technique is to use separate paths for value generation as you can see below:
|
|
Third Field
We now move on to field Unsized
, which uses a machine architecture-specific
size. In its largest form, an int
’s bits will fit into the space of
uint64
.
|
|
Fourth Field
Handling Unordered
requires considerable care:
traversing a map is non-deterministic
Aside: Whether map traversal should be deterministic seems to be a near-religious conversation among some developers.
My personal approach in developing and maintaining systems is not to rely on fragile implicit coupling of components and their behavior, so I don’t mind the non-determinism but prefer it.
we need to encode map state (e.g., count of items) when hashing to reduce the risk of collisions
|
|
I’ve taken a similar consideration that the Google Go Style
Guide takes on nil
slices but with
maps. A nil
map is the same as len(m) == 0
.
Aside: Java takes an interesting approach in
AbstractMap#hashCode
whereby each key-value pair is hashed and summed, and the rolling sum is returned to the caller. The commonMap
implementation ofHashMap
contains an internal type calledNode
that instead XOR-s the key’s hash with the values’. This approach is commutative, meaning it does not care about the order of theAbstractMap
’s elements.As a fun bit of trivia: since Java does not encode the length of the map in the hash code, it is not inconceivable — however improbable for a collision to arise between a 0-element
AbstractMap
and anAbstractMap
containing two elements where their addition cancels each other out to zero, which again, could happy for any number of items whose sum works out to be as such.
Fifth Field
There are multiple ways of representing boolean values like field Small
. I
took inspiration from the implementation found in Java’s Boolean#hashCode
:
|
|
Both 1231
and 1237
are prime and are sufficiently large to add noise to a
composite hash value. If you want a fun treat, decompose each of those numbers
into their constituent digits and add them together; see if you notice anything!
Sixth Field
Handling of a repeated field like Variable
(a slice) is interesting. It is
not too dissimilar from Unordered
. Here is Bloch again:
If the field is an array, treat it as if each significant element were a separate field. That is, compute a hash code for each significant element by applying these rules recursively, and combine the values per step 2.b. If the array has no significant elements, use a constant, preferably not 0. If all elements are significant, use Arrays.hashCode.
|
|
Seventh and Eighth Fields
We can ignore field transient
(see Bloch’s allusion to “significant” above),
so we continue to field Recursive
. hashTheZoo
’s frontmatter already
included handling for this case, so we merely just needed to call this function
recursively.
|
|
Wrapping Up
As you can see, doing this all by hand involves a fair bit of work, is toilsome, and is error-prone.
Many of these considerations we see here are not too dissimilar from what those who implement type-value-length (TLV) serialization protocols face. TLV is a common encoding scheme where each piece of data is bundled with a tag identifying its type and a prefix specifying its length, ensuring that a stream of data can be parsed without ambiguity (see also: Protocol Buffers). The approach I took above was rather Java-centric in terms of how a slew of the algorithms are implemented:
Making this More Idiomatic
It’s the year 2025. Surely we can do better than this, right? Let’s explore a
new feature of the Go API ecosystem since release 1.19 called package maphash
.
This will enable us to approach this problem in a much more concise and
Go-idiomatic way. package maphash
has some rather interesting caveats about
use, so I recommend you read its API documentation before using it (e.g.,
about not durably storing the hash values). I won’t go into detail here about
that; that’s for you to read about.
Let’s start by amending the structure of the hashing function for TheZoo
.
I’ve reframed hashTheZoo
now as this:
|
|
The core thing to note is that this follows a structure with several
conventions that package maphash
sets:
This name and structure tell users that they can use this function to add the
hash of a *TheZoo
value to any *maphash.Hash
value to:
- a new hash with no state
- an existing hash (that might me hashing an outer type that has
*TheZoo
as a field)
And nothing stops you from offering a helper function like this in addition:
|
|
Now writeHashTheZoo
and hashTheZoo
could have been methods on *TheZoo
,
too. I elected to use free-standing functions to keep the domains more
separated for demonstration value.
Deeper Dive into Modern Mechanics
One thing should be immediately evident to you: this implementation has to take into account far fewer things than the original Java-based approach did.
I want to explain some of that nuance here.
Important:
maphash.WriteComparable
internally performs TLV-like delimiting on the data hashing, meaning capturing information on string length, item count, and similar are not needed when you directly use this API.This stands in contrast to the other APIs in
package maphash
.
Try running code snippet to see what I mean:
|
|
(Run this)
A takeaway for me is that when hashing discrete items in totality is that I
should lean on maphash.WriteComparable
insofar as possible.
A second observation is that I encoded the length of fields Unordered
and
Variable
in both the manual hashing and maphash
-based hashing. This might
not have been strictly needed here, but note that this helps to produce
unambiguously distinguishing results in typical case. To imagine this, let’s
suppose we had a struct like this that we wanted to hash:
|
|
If we did not include the lengths of fields A
, B
, C
, D
, E
, and F
and instead only encoded the elements when they are present, we’d automatically
get collisions for values like these:
|
|
Similar reasoning applies to the pointer presence used with Optional
and
Recursive
. Consecutive repeated fields suffer the same risk:
|
|
Revisiting Labels with a Modern, Idiomatic API
So back to the original hashLabel
assignment: let’s reframe it with package maphash
using what we’ve learned.
|
|
And, now, you can see how we arrived at such a clear, concise, and deliberate outcome.
An Even Better Approach
As a nice followup, Klaus Post kindly pointed out a better approach that has some interesting properties in that it …
- waives the need for the sorting invariant above.
- has commutative properties (in part), but that has no negative impact on the computed outcome if the clients of this API use it correctly.
- parallels how Java’s
HashMap#hashCode
is implemented under the hood.
An implementation would look naively like this:
|
|
Above hashLabels
is analogous to HashMap.Node#hashCode
, and hashMetric
to AbstractMap#hashCode
.
For concision, I inverted the definition to accept maphash.Seed
instead of a
*maphash.Hash
. This is to help avoid extraneous noise that would arise from
calling (*maphash.Hash).Clone
, handling an error that should never happen,
computing constituent sums, so on and so forth.
Caution:
hashMap
above does not encode the length ofm.Labels
like the other implementation in the previous section. That is not a problem on its own, but using a pattern like this in*TheZoo
orDynamicComposite
means you would still need to differentiate between the constituent fields by hashing something in the overall container type’s representation.
1 2 3
type MetricsAndMetricsAndMoreMetrics struct { A, B, C Metric }
If you want to hash
MetricsAndMetricsAndMoreMetrics
, you should be able to differentiate between an equivalent value ofMetric
being in fieldA
,B
, orC
. Observe howhashMetric
’s output is itself included in the outer value forMetricsAndMetricsAndMoreMetrics
’s hash.
1 2 3 4 5 6 7 8 9 10 11 12 13
func (m *MetricsAndMetricsAndMoreMetrics) Hash(seed maphash.Seed) uint64 { var h maphash.Hash h.SetSeed(seed) if m == nil { maphash.WriteComparable(&h, 0) return h.Sum64() } maphash.WriteComparable(&h, 1) maphash.WriteComparable(&h, hashMetric(seed, m.A)) maphash.WriteComparable(&h, hashMetric(seed, m.B)) maphash.WriteComparable(&h, hashMetric(seed, m.C)) return h.Sum64() }
Thank you, Klaus! :-)
Closing
You might wonder why I started this article demonstrating hashing by hand. I
figure it is a useful exploration to characterize how we got to today and the
type of care that had to be taken. It provides a wonderful practical
introduction into the context to appreciate why a facility like package maphash
is so nice to have.
Now, I can imagine some readers taking umbrage that I’ve saved a crucial point for the end: in the vast majority of situations, creating a manual hashing mechanism is unnecessary. The Go runtime provides a built-in, highly efficient facility for hashing any comparable type. It automatically leverages architecture-specific capabilities without you having to. A good way of understanding this is to peel back the curtain of how the runtime implements maps (namely with a necessary detour into the compiler):
genhash
is a rather interesting read, particularly the section on generating
hashes for custom composite types using hashFunc
. I haven’t
dug into its guts too deeply, but it appears to recursively generate
instructions that get added to an intermediate representation
(IR) store that are returned eventually as a closure. Much of this uses
runtime·memhash
, which will use hardware hashing support like
AES if it is available.
From this, the compiler assigns the the hashing algorithm to
abi.SwissMapType
’s Hasher
field, which the runtime uses.
And the fun part about this: package maphash
is largely implemented in terms
of the compiler+runtime mechanisms above! You’ll find that both use
runtime·memhash
In any case, better to understand how this is implemented than not. May you learn your life lessons less hard than I did — and with certainly less embarrassment.
Navigation: