(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=40988837

1978 年,著名著作《C 编程语言》的作者 Brian Kernighan 表示,哈希表被认为“有点奇怪”。 然而,这一说法与历史证据相矛盾,历史证据表明哈希表在 20 世纪 60 年代和 1970 年代的计算机科学中得到了普遍使用和研究。 哈希表通常称为关联检索结构,是编译器、汇编器、链接器、数据库和文件系统中的标准功能。 它们早在 1973 年就出现在 Knuth 备受推崇的“计算机编程艺术”系列中,并且在整个 1970 年代,Communications of the ACM 上发表了大量讨论哈希表和相关主题的文章。 此外,Kernighan 本人也为其中许多出版物做出了贡献,这表明他确实熟悉这项技术。 Kernighan 的评论似乎是因为他专注于让编程语言更简单、更容易让初学者学习,主张将哈希表作为核心库结构,而不是要求用户手动实现它们。 到 1979 年,当 Unix 版本 7 发布时,以 Kernighan 的脚本语言 AWK 为特色,哈希表不再被视为深奥,而是编程中的常规工具。 如今,哈希表已成为现代编程语言(例如 Python、C++ 和 Java)中普遍存在的组件。 值得注意的是,第一篇讨论哈希表的论文由 Donald E. Knuth 撰写,可以追溯到 1968 年,而计算机协会 (ACM) 的 ACM 杂志的 Communications 中最早引用哈希表的是 1959 年此外,这个概念早于这两个里程碑,起源于 1953 年的 IBM 内部,并于 1956 年被 Dumas 记录在公开文献中。因此,考虑到哈希表已经有 20 多年的历史,Kernighan 对哈希表的描述是新颖的。 1978 年出版《C 编程语言》。

相关文章

原文






















































the picture you're painting here is absurdly far from reality

— kernighan definitely did not think hash tables were 'a bit exotic' in 01978 —

hash tables (or sometimes other associative retrieval structures) were standard fare in computer science degree programs and were a standard feature of compilers, assemblers, and linkers, of which kernighan had written several at that point; associative retrieval structures are also fundamental to databases and filesystems. in 01979, 7th edition unix (in which kernighan published awk) barely had date formatting but included an on-disk hash table library called 'dbm'

hash tables are the exclusive topic of pp. 506–549 of knuth's taocp volume 3, a standard textbook which came out in 01973. after recommending you some hash functions, on p. 512 (525/739) knuth recommends you check out a survey of hash functions published in cacm by lum, yuen, and dodd in 01971, and he cites papers on hashing in cacm as early as 01959 and in an ibm journal as early as 01957. in the 'history' section (pp. 540–542 (553–555/739)) he traces it back to an internal ibm report by luhn in 01953, and its first report in the open literature to a paper by dumey in 01956. most of the rest of the 739-page book is about how to implement associative arrays of various kinds, although some parts instead concern the string search problem, and the first half is about sorting, and a sorted table can be used for purposes other than associative retrieval

knuth also mentions a 'very influential survey' of hashing published in cacm in 01968 by robert morris, kernighan's coworker at bell labs, and one of the very first users of unix and a major contributor to it

there are several papers about hashing every year in cacm throughout the 01970s, as well as alternative associative-array structures such as tries, sorted arrays and binary search trees. in march 01975 we have https://cacm.acm.org/research/the-algorithm-sequential-acces..., in april 01975 we have https://cacm.acm.org/research/the-quadratic-hash-method-when..., in may 01975 we have https://cacm.acm.org/research/analysis-and-performance-of-in..., in july 01975 we have https://cacm.acm.org/research/a-note-on-hash-linking/, in september 01975 we have https://cacm.acm.org/research/multidimensional-binary-search..., in october 01975 we have https://cacm.acm.org/research/optimizing-the-performance-of-... and https://cacm.acm.org/research/merging-with-parallel-processo..., in january 01976 we have https://cacm.acm.org/research/performance-of-height-balanced..., in february 01976 we have https://cacm.acm.org/research/on-self-organizing-sequential-... and https://cacm.acm.org/research/a-stochastic-evaluation-model-..., in june 01976 we have https://cacm.acm.org/research/a-practitioners-guide-to-addre... (which is about hashing despite the name), in july 01976 we have https://cacm.acm.org/research/compressed-tries/, and in august 01976 we have https://cacm.acm.org/research/an-insertion-technique-for-one.... you could not read cacm in the 01960s and 01970s without frequently coming across hashing, as well as other associative-array structures.

but perhaps kernighan hadn't read cacm? on the contrary, he and his colleagues regularly published in it; he published in cacm once in 01972, once in 01973, and once in 01975, and his colleagues such as stephen johnson, mike lesk, al aho (his coauthor on awk), and, as i said above, robert morris, published there regularly as well

in the practice of programming, kernighan and pike claim that almost all of the time, the only data structures you need are arrays, linked lists, trees, and hash tables (though they wrote this in 01999)

what kernighan is talking about here is including them as a basic language construct, a design choice his 01978 tech report on awk likens to that of snobol4, which was a well-known language but not a mainstream one. in the context of the above cacm articles, you can perhaps see that the daring choice was to offer a single 'good enough' hash table implementation rather than one optimized for a given purpose—and to use it even in place of conventional arrays

— kernighan was not isolated like you were —

kernighan was at bell labs, which was probably the most connected place in the software world at the time

when awk was released in 01979 the internet was already ten years old, and at&t (the parent company of bell labs) had installed all its long-distance links

in 01976 they'd also developed their own 'internet' called uucp, which at the time of awk's release was interconnecting 80 different unix sites, interchanging software updates on a daily basis, over 300-baud and 1200-baud modems. they were also in contact with ken thompson's alma mater berkeley; thompson took a sabbatical year to teach there in 01975. berkeley got a darpa contract to add tcp/ip support to the bell labs operating system 'unix', which would become the majority of the internet during the 01980s; berkeley participated actively in the uucp network

bell labs was also working on the so-called 'advanced communication system', its own misconceived version of the internet, where data would be routed using phone numbers, because data communications had become the fastest-growing source of call traffic. (colin berkshire has written a fantastic overview of this forgotten history.)







in the interview, kernighan said, 'The main idea in Awk was associative arrays, which were newish at the time, but which now show up in most languages either as library functions (hashmaps in Java or C++) or directly in the language (dictionaries in Perl and Python). Associative arrays are a very powerful construct, and can be used to simulate lots of other data structures.'

zambyte expressed surprise that kernighan referred to 'associative arrays' as 'newish' and said that lisp had had them 20 years before. what lisp had 20 years before (presumably they meant in https://www-formal.stanford.edu/jmc/recursive.pdf in 01959, not 01957, when mccarthy was trying to figure out how to add linked lists to fortran) was this line of code:

    assoc[x; y] = eq[caar[y]; x] → cadar[y]; T → assoc[x; cdr[y]]
which we can translate to python as
    assoc = lambda x, y: y[0][1] if y[0][0] == x else assoc(x, y[1:])
and which allows you to use a list of 2-tuples such as [('red', '#f00'), ('green', '#0f0'), ('white', '#fff')] as an associative array

you replied, saying, 'it was (...) easy to be (...) knowledgeable and yet never have been exposed (...) in the 1990s (...) I can tell you a lot of the C world would still consider associative data structures like a hash table something to be a bit exotic (...). It was very easy to not be very up to date on these things.'

that comment is only relevant to zambyte's comment—to which you posted it as a reply—if you had the implicit assumption that what kernighan meant by 'associative arrays' was 'associative data structures like a hash table'. but associative data structures like hash tables were not newish; hash tables in particular were 22 years old in 01979, and were already the most commonplace of all data structures except for strings and arrays. i listed 13 articles in cacm in 01975 and 01976 about associative data structures. cacm only published about 6 articles per issue (excluding the news and opinion departments), so that's roughly 10% of everything they published

you say 'you can't prove the normal world of computing was all aware of all these things', but the everyday world of computing was, if anything, even more preoccupied with associative data structures than the academic world. ibm's most lucrative product was ims, an online database, which is to say, a large on-disk associative data structure. filesystems (including dectapes), variable namespaces in interpreters and compilers, symbol tables (in assemblers, linkers, and compilers), and databases are associative arrays. even in 01979 it was already hard to find programmers who did not use any of filesystems, assemblers, interpreters, or databases (though they did exist)

so it is completely implausible that what kernighan was describing as 'newish' (in 01979) was associative data structures like a hash table. i'm not doubting you when you say you knew c programmers in the 90s who considered them 'exotic', perhaps the province of the wizards who wrote compilers and filesystems, but those c programmers were not 'the normal world of computing' or 'experienced and knowledgeable'; they were beginners

the 01978 edition of k&r presents binary search (using a sorted array as an efficient associative array) in chapter 3 on p. 54 (62/236), before presenting such advanced language features as the 'break' statement, '#define', structs, non-integer function return values, and pointers. in chapter 6 on p. 125 (133/235), they use such an associative array to count the number of times each c keyword occurs in the input. §6.5 (pp. 130–134 (138–142/235)) demonstrates a more scalable associative array using a binary search tree. §6.6 (pp. 134–136 (142–144/235)) gives another program using a hash table

so for k&r in 01978, hash tables were more basic than features such as unions, typedef, string formatting (%.10s, etc.), opening files, and listing directories, all of which are relegated to later in the book

please don't tell me 'a lot of the c world' in the 01990s hadn't heard of k&r. next you're going to be telling me a lot of the c world hadn't heard of c

getting back to the interview, what kernighan was evidently referring to was not associative array data structures in general, but having a general-purpose associative data structure library always at hand. lisp's assoc (or assq) isn't general-purpose because it's too slow for large associative data structures. lisp plists aren't general-purpose because they leak memory (you can't remove a property from all atoms, and each atom you put a property on continues to occupy memory even if you remove the property from it) and because they get too slow if you have more than a few properties. ims isn't general-purpose because you have to go talk to the database administrator to create a new file; you can't create a temporary one in ram. the example hash table in k&r isn't general-purpose because it uses a static array, a fixed size of 100 hash buckets, has no deletion, associates specifically a string value with each (string) key, and is quite inefficient; it's a reasonable starting point, but you have to tweak it for your application

java hashmaps, awk arrays, and python dicts are general-purpose as long as your data fits in ram. (you could gripe that awk arrays aren't first-class objects and can only have string and numeric keys, but that's just because awk doesn't have first-class objects at all in the lisp sense.)

precisely as a result of this, hashing (like sorting) has ceased to be a beginner-programmer topic; if you need an associative array or a sorted list, you can get quite far just by invoking the built-in facilities of your programming language. sooner or later you will probably want one optimized for some specific application, but it's no longer something you have to do all the time. but to any working programmer in the 01970s, it would have seemed absurd to propose linear searches or hash tables were 'newish'







































kernighan said in the interview

> The main idea in Awk was associative arrays, which were newish at the time, but which now show up in most languages either as library functions (hashmaps in Java or C++) or directly in the language (dictionaries in Perl and Python). Associative arrays are a very powerful construct, and can be used to simulate lots of other data structures.

awk was released in 01979, and the authors published this paper in sp&e that year: https://plan9.io/sources/contrib/steve/other-docs/awk.pdf but you see this report version is dated september 01978. but i don't think the report was widely circulated until the next year, when it was included in 7th edition unix as /usr/doc/awk (sudo apt install groff; groff -ms -Tutf8 v7/usr/doc/awk | less -r). it explains:

> Array elements may be named by non-numeric values, which gives awk a capability rather like the associative memory of Snobol tables. (...) There is an alternate form of the for statement which is suited for accessing the elements of an associative array:

this pdf has evidently been retypeset from the troff sources from the open-source 7th edition release, but without the correct bibliographic database, so the references are missing. a comment in the troff source says:

> ....It supersedes TM-77-1271-5, dated September 8, 1977.

but possibly that reference is inaccurate

python goes beyond merely having dicts 'directly in the language'. python's primary data structure is the dict; among other things, it uses dicts for modules, (most) class instances, associating methods with classes, the locals() user interface to the local-variable namespace, and passing keyword arguments to functions and methods. that is, it uses associative arrays to simulate lots of other data structures, as you are obliged to do in awk, lua, and js. so where did python get dicts?

python got dicts (and tuples) from abc, a teaching language which wikipedia claims was started in 01987, 8 years after awk's release, and added conventional arrays (lists) back in. the five data types in abc https://homepages.cwi.nl/~steven/abc/language.html are numbers, strings, compounds (called tuples in ml), lists (really multisets because they're implicitly sorted), and tables (dictionaries), which are awk's 'associative arrays'—and, as in awk, js, lua, and tcl, they're used to provide the functionality of conventional arrays as well

however, lambert meertens credits the use of tables in abc to jack schwartz's setl https://inference-review.com/article/the-origins-of-python rather than to awk. he says of the addition of tables to b (the early name for abc, not to be confused with the b that was an earlier version of c)

> Having coded a few algorithms in SETL, I had experienced its power firsthand—a power that stemmed entirely from its high-level inbuilt data types. Particularly powerful were sets and maps, also known as “associative arrays,” containing data that can be indexed not only by consecutive integers but by arbitrary values. A programmer could introduce a simple database of quotations named whosaid, in which the value ”Descartes” could be stored in the location whosaid[”cogito ergo sum”]. These high-level types made it possible to express algorithms that required many steps in B1 using just a few steps in SETL. In a clear violation of the Fair-Expectation Rule, B1 allowed only integers as array indices. This design decision had been driven by fear: we had been concerned that aiming too high would make our language unimplementable on the small personal computers that were starting to appear on the market. But Dewar, in particular, convinced me that this meant we were designing for the past, not the future. This led us to redesign the system of data types for our beginners’ language. This time we used only the criteria of ease of learning and ease of use to select candidate systems. The winner turned out to be remarkably similar to the data type system of SETL. The set of possible data type systems to choose from was very large, and to make the process more manageable I had written a program to select the competitive (Pareto-optimal) candidate systems. Interestingly, but quite incidentally, that selection program itself was written in SETL. The winning type system became that of B2, and remained unchanged in the final iteration, released in 1985 under the name “ABC.”

'associative arrays', of course, is the term used by awk

this story of adding associative arrays to abc only for b2 is somewhat complicated by the fact that the version of b (b1?) in meertens's 01981 'draft proposal for the b programming language' https://ir.cwi.nl/pub/16732 already includes tables, three years after the release of awk as part of 7th edition; p. 6 (9/91) says,

> Tables are somewhat like dictionaries. A short English-Dutch dictionary (not sufficient to maintain a conversation) might be (...) Table entries, like entries in a dictionary, consist of two parts. The first part is called the key , and the second part is called the associate. All keys must be the same type of value, and similarly for associates. A table may be written thus: {[’I’]: 1; [’V’]: 5; [’X’]: 10}.

> If this table has been put in a target roman, then roman[’X’] = 10.

note that this is also awk's syntax for indexing an associative array, though it doesn't have a syntax for writing one down.

(to be continued; hn says, 'that comment included too many facts')



a more recent set of slides on the relation between abc and python is https://www.cwi.nl/documents/195216/Meertens-20191121.pdf which describes again how abc was started in 01975. this helpfully clarifies the timeline: b0 was 01975; b1 was 01978; b2 was 01979; and b∞ = abc was 01985. so specifically the point at which setl inspired the replacement of conventional arrays in b1 with associative arrays in b2 was 01979, which was the year 7th edition unix was released and the aho, weinberger, and kernighan paper was published in sp&e

a question of some interest to me here is what platform they were developing abc on in 01979. clearly it couldn't have been the ibm pc, which wouldn't come out until 01983 (and as far as i know abc on the ibm pc only runs under cygwin or 32-bit microsoft windows), or macos (which came out in 01984) or atari tos, which wouldn't come out until 01985. and so far i haven't seen any mention in the history of abc of other operating systems of the time like cp/m, vm/cms, dg rdos, tenex, or tops-20. the most likely platform would seem to have been unix, on which awk was one of the relatively few programming languages available. perhaps at some point i'll run across an answer to that question in the abc papers

python adopted awk's syntax for putting 10 into roman['x'], which was `put 10 in roman['x']` in abc, but `roman['x'] = 10` in awk and python. abc's syntax is uppercase, presumably case-insensitive, separates words with apostrophes, and departs widely from conventional infix syntax. python's syntax is case-sensitive, mostly lowercase, and conventionally infix, features that have become common through the influence of unix. python's control structures are for, while, and if/elif/else, as in algol and in abc, and indentation-sensitive as in abc, but uses a conventional ascii syntax rather than abc's scratch-like syntax-directed editor

abc was statically typed with a hindley-milner type system ('the type system is similar to that of lcf', p. 15 (18/91) of the draft proposal), while python is dynamically typed, like smalltalk, lisp, and awk

if meertens got his daring notion of storing everything in associative arrays from awk, he certainly doesn't mention it. instead he mentions setl a lot! the draft proposal doesn't cite awk but it also doesn't cite setl; it cites the algol-68 report, milner's lcf typing paper, a cleaveland and uzgalis paper about grammars, gehani, and three of his own papers, from 01976, 01978, and 01981. unfortunately i can't find any of those earlier meertens papers online

the wikipedia page about setl says

> SETL provides two basic aggregate data types: (unordered) sets, and tuples.[1][2][5] The elements of sets and tuples can be of any arbitrary type, including sets and tuples themselves, except the undefined value om[1] (sometimes capitalized: OM).[6] Maps are provided as sets of pairs (i.e., tuples of length 2) and can have arbitrary domain and range types.[1][5]

but it's citing papers about setl from 01985 there, well after awk had supposedly popularized the notion of associative arrays

however, in meertens's essay on python's history, he cites a 01975 paper on setl! https://www.softwarepreservation.org/projects/SETL/setl/doc/...

> Jacob T. Schwartz. ON PROGRAMMING: An Interim Report on the SETL Project. Part I: Generalities; Part II: The SETL Language and Examples of Its Use. Computer Science Department, Courant Institute of Mathematical Sciences, New York University, revised June 1975.

this discusses how setl represented data in memory starting on p. 57 (57/689). it used hash tables to represent sets, including sets of tuples, rather than the ill-advised balanced-tree approach used by abc. (python, like awk and setl, uses hash tables.) on pp. 62–63 (62–63/689) it explains:

> The hash code of a tuple is taken to be the hash code of its first component, for reasons that will become clear in the next section. The hash code of a set is the exclusive or of the hash codes of all its members. (...)

> — Tuples in Sets —

> Though expressible in terms of the membership test, "with", and "less" operations, functional evaluation plays so important a role in SETL algorithms that we treat it as a primitive.

> SETL makes three types of set-related functional evaluation operators available:

> - f(x)

> - f{x}

> - f[s]

> The most fundamental of these is f{x}, which invokes a search over f for all n-tuples that begin with x (n ≥ 2), and which yields as result the set of all tails of these n-tuples. More precisely, in SETL:

> f{x} = if #y eq 2 then y(2) else tℓ y, y ∈ f | type y eq tupl and #y ge 2 and hd y eq x}

> The operation f(x) has a similar definition but includes a single valuedness check:

> f(x) = if #f{x} eq 1 then ∋f{x} else Ω

> The operation f[s] is adequately defined in terms of f{x}:

> f[s] = [+: x ∈ s] f{x}

i am fairly confident that the f{x} definition translates into current vernacular python as the set-comprehension {y[2] if len(y) == 2 else y[1:] for y in f if type(y) == tuple and len(y) >= 2 and y[0] == x}.

so, it becomes clear that already in 01975 setl treated sets of tuples as maps, which is to say associative arrays, but it didn't use the 'associative array' terminology used by meertens in 01981, or for that matter 'maps'. to look up an element in the map, it didn't use the f[x] notation used by python, awk, and abc; instead it used f(x). further explanation on pp. 64–65 (64–65/689) clarifies that really it is more accurate to think of 'sets of tuples' as trees; each item in the tuple entails following an additional level of pointers to a further hash table

(in a number of other notational details, python and presumably abc follows setl: start: or start:end for array index ranges, + for string concatenation, * for string repetition, boolean operators spelled out as and, or, and not. but indexing maps is far from the only difference)

abc (including b as described in the 01981 report) also seems to lack the f{x} operation and its possibility of associating an arbitrary-sized set of values with each key. this is a nontrivial semantic divergence

so if abc got its idea of tables from setl, but used awk's terminology, notation, and semantics for them (and its own ill-conceived balanced-tree implementation, used by neither), and decided to adopt the table idea in the year when awk was released, probably on the platform that awk was released on, i think it's reasonable to assign some share of the credit for abc's tables to awk? even if not all of it

but if that's so, then why didn't meertens credit aho, weinberger, and kernighan? i don't know. maybe awk's loosey-goosey nature was repugnant to him. maybe weinberger is jewish and meertens is secretely anti-semitic. maybe meertens thought that awk's loosey-goosey nature would be repugnant to the dijkstra-honoring dutch computer science establishment. maybe aho insulted meertens's favorite beer one time when he visited the netherlands. or maybe he thought it would be unfair for aho, weinberger, and kernighan to steal the thunder of schwartz, who did after all precede them in developing a general-purpose language whose memory was based entirely on hash tables. from a certain point of view that would be like crediting carl sagan with the theory of relativity because he explained it on nova





it sounds a little bogus to me, but i can outline some ways to approach it

if you have a set of features, generating its powerset is fairly simple

    >>> powerset = lambda xs: [x0 + others for others in powerset(xs[1:]) for x0 in [[], [xs[0]]]] if xs else [[]]
    >>> powerset(['int', 'float', 'set', 'array'])
    [[], ['int'], ['float'], ['int', 'float'], ['set'], ['int', 'set'], ['float', 'set'], ['int', 'float', 'set'], ['array'], ['int', 'array'], ['float', 'array'], ['int', 'float', 'array'], ['set', 'array'], ['int', 'set', 'array'], ['float', 'set', 'array'], ['int', 'float', 'set', 'array']]
then you just need some way to calculate various merits of the different designs. one merit is simplicity, which you could maybe try to operationalize as something like this:
    >>> [(d, {'simplicity': 10 - len(d)}) for d in powerset(['int', 'set', 'array'])]
    [([], {'simplicity': 10}), (['int'], {'simplicity': 9}), (['set'], {'simplicity': 9}), (['int', 'set'], {'simplicity': 8}), (['array'], {'simplicity': 9}), (['int', 'array'], {'simplicity': 8}), (['set', 'array'], {'simplicity': 8}), (['int', 'set', 'array'], {'simplicity': 7})]
comparing two merit-maps such as {'simplicity': 8} and {'simplicity': 6} for pareto optimality is simple enough with some thought, especially if we assume the keys are the same:
    >>> some_way_better = lambda a, b: any(a[k] > b[k] for k in a)
    >>> defeats = lambda a, b: some_way_better(a, b) and not some_way_better(b, a)
    >>> defeats({'simplicity': 9}, {'simplicity': 8})
    True
    >>> defeats({'simplicity': 8}, {'simplicity': 9})
    False
    >>> defeats({'simplicity': 9, 'turing-complete': 0}, {'simplicity': 8, 'turing-complete': 1})
    False
    >>> defeats({'simplicity': 9, 'turing-complete': 1}, {'simplicity': 8, 'turing-complete': 1})
    True
then it's easy enough to calculate which of a set of candidate designs can be eliminated because some other design defeats them

the bogus part is how you automatically calculate the various merits of a hypothetical collection of language features







联系我们 contact @ memedata.com