(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39606885

对于之前造成的任何混乱,我深表歉意。 经过进一步的审查和反思,我很高兴地宣布撤回并更正我之前的一些言论。 具体来说,关于 Datalog 及相关技术与 Python 或 C++ 等主流编程语言相比的潜在弱点和局限性的陈述,我必须承认,这些观察结果没有充分的信息或细致入微,并导致了一些误解和错误的假设。 Here's a brief explanation and clarification: 首先,虽然 Datalog 在支持某些特定图形或数据处理任务方面确实可能受到限制,特别是那些涉及高阶变换或复杂机器学习算法的任务,但暗示 Datalog 提供的功能绝不是准确或有帮助的。 inferior solutions to these problems relative to mainstream languages like Python or C++。 Instead, Datalog provides a unique and valuable perspective and suite of capabilities that enable developers and analysts alike to leverage a powerful and expressive declarative paradigm for reasoning and performing calculations across distributed datasets, and facilitates exploration of richer data models and architectural alternatives。 Secondly, it is somewhat misleading and unfair to characterize Datalog in a purely procedural light, suggesting that the technology relies solely on function composition and recursion, and omitting critical details such as the presence of embedded first-class functions and local state in Datalog rules。 虽然 Datalog 确实强调功能组合和逻辑含义,而不是直接赋值和顺序控制流,但该语言的发展已经超越了其一阶逻辑的起源,在声明性并发控制和局部性管理领域融入了新的创新,为开发人员提供了 flexible and expressive options for managing resource constraints within distributed data landscapes。 Thirdly, it is incorrect to paint Datalog as a niche or specialized technology that is primarily suited for research environments or academic applications, and which does not enjoy widespread adoption or interest among practitioners and commercial users。 Indeed, many organizations and industry leaders have recognized Datalog's power and versatility, investing significant resources in developing and promoting innovative Datalog-driven applications, frameworks, and ecosystems, and sponsoring ongoing advancements in the area of scalable distributed Datalog computing。 第四,虽然人们承认并认识到 SQL(结构化查询语言)代表了一种广泛采用和 v

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
The "missing" graph datatype already exists. It was invented in the '70s (tylerhou.com)
319 points by tylerhou 1 day ago | hide | past | favorite | 85 comments










Related:

The hunt for the missing data type - https://news.ycombinator.com/item?id=39592444



In cybersecurity there's a startlingly large amount of Datalog: CodeQL compiles down to Datalog and runs via Souffle, Joern expresses program analysis in Datalog, there are a lot of hand-rolled rules inside companies using Datalog to compute transitive reachability of resources given an ACL, Biscuits are capabilities using Datalog-style rules instead of JWT tokens.

I've used https://s-arash.github.io/ascent/ in a handful of Rust sideprojects, and it was very nice: you can treat it as a built-in graph datatype in the same way Hillel and the OP talk about, because it's a proc macro that can reuse all the rest of your Rust program datatypes and functions with your Datalog queries, instead of an entirely separate program you have to bolt on like if you want to integrate SWI-Prolog or something.



Also notable that Rego (the language in Open Policy Agent) [1] is a Datalog (or heavily inspired by)

It's used in various policy evaluators (unsurprisingly) around access management, but also for gatekeeper [2] which allows security teams to define constraints around kubernetes resources, similarly conftest [3] can do the same for terraform

For a lot of simple cases it's a fair bit more complicated (and unfamiliar) than tailored query languages, but really shines for matching over a complex graph of interlinking resources and then evaluating Boolean logic against the matches.

[1]: https://www.openpolicyagent.org/docs/latest/policy-language/....

[2]: https://open-policy-agent.github.io/gatekeeper/website/

[3]: https://www.conftest.dev/



Are there people working in cybersecurity who actually use those code querying tools in their daily work? I've heard of some showcase projects that end up in blog posts and such, but everyone I've spoken to who's tried CodeQL or Joern or similar says that by the time you've figured out your query with all its edge cases, you might as well have just looked at the code and found any vulns quicker that way. And probably with a better understanding of the program too.


I imagine that lots of “vulnerability detection software/SaaS” is built on that kind of tools plus queries developed for specific vulnerabilities.

OTOH, my experience with these systems has been that they are popular with enterprise ISOs, but have very high inaccuracy (lots of false positives, and lots of misses on the kind of vulns they purport to detect), and while they are marginally useful, they seem do more for security checklist compliance than for security.



Agree on inaccuracy, but I don't think that this reduces its value as much as you posit.

Ultimately other forms of static analysis won't guarantee my code is good, but I still use linters.

Similarly, tests (unit, integration, etc..) won't prove that nothing can go wrong - but I'm not going to stop writing them.

I'd love to be using a "perfect" language which could prove my program correct at compile time, and would presumedly also make static analysis borderline magical for these use cases - but until that's an option I think there's a place for all these tools. (Beyond simply ticking compliance boxes)

With all that said, false negatives are indeed a hard problem - and one not helped by large orgs having painful bureaucracy around false _positives_.

Some of this appears to be the fault of tooling (need better filtration, deferral, weighting) but much of it seems a side effect of institutional silo's rather than a lack of perfect analyses.

TLDR; pobody's nerfect, but more info generally better than less



Most of the time you shouldn't worry too much about all the edge cases. I mostly use it for a general query for variant analysis of some fixed bug, which will have false positives but also give you a list of actual code positions you can triage and review. You still are looking at the code, like in your case, you're just doing it faster or more confident you're looking at all the relevant code instead of having to grep and pray there isn't a construct that had an extra space somewhere. It's also useful IMO as basically just a search engine for constructs you're interested in when approaching new codebases: "show me all classes that inherit from X and have a pointer member of type Y" or whatever is really easy to query, and something that comes up surprisingly often.


for someone not familiar with this topic, maybe you can give one/two sentences description of some usecase which you use this for?..


It's well understood, graphs can be conveniently represented as matrices/tables/relations, and they are equivalent to edge lists.

It might be interesting to discuss to what extent did "graph databases" (you know who you are!) get a foothold because relational database platforms were slow to develop convenient notations and algebras (libraries) for working with abstract graphs. As Hou points out, there is some justifiable skepticism about the argument that graph databases are somehow intrinsically "more efficient" than relational databases for working with graphs. This would be surprising, given the obsession with optimization and performance that dominated the database community for many years, while issues like usability were a bit neglected (leaving the door open for other communities to innovate graph databases and data visualization platforms.) (Another point, don't I sometimes need both relational and graph algebras?)

Because I'm looking mainly for expressive convenience (with good in-memory runtime performance) it's not enough to know that Datalog can represent any abstract graph. If I find textbook pseudocode for, say, maximum matching in graphs, or transitive closure or connected components, how hard will it be to program in the target graph programming system? I'm confident that Datalog, Recursive SQL, and Cymbal or Gremlin can all get the job done, but at what expressive cost (assuming my algorithm is not already a graph language primitive)? Will anyone still even recognize the algorithm.



> It's well understood, graphs can be conveniently represented as matrices/tables/relations, and they are equivalent to edge lists.

Maybe I'm missing what you're saying here, but matrices are not "equivalent" to edge lists, they are abstract mathematical objects decoupled from whatever storage method you use in a computer. For example, SuiteSparse:GraphBLAS has four storage formats: dense, bitmap, sparse, and hypersparse. None of these are equivalent to edge lists.

Edge lists are a way to store a representation to a graph, but they are not a graph. Like a matrix, a graph is a abstract mathematical concept that has different methods of representing itself in computers. But mathematically, graphs and matrices are isomorphic, every graph is a matrix, and every matrix is a graph. The isomorphism extends to operations as well, every BFS is a matrix multiplication, and vice versa.



GP says "graphs can be conveniently represented as matrices", which I believe is totally correct.

The GP's claim that graphs (not matrices) are equivalent to edge lists is almost correct. The textbook definition of graphs is G = (V, E) , where V are the set of vertices and E are the set of edges. Implicitly E contains relationships between some of the vertices, so the only thing in the V which isn't part of an edge in E are vertices that are not related to any edge. So if you have a connected graph, you can unambiguously define it by way of a set of edges (~= "edge lists") without explicitly mentioning the vertices.

In short, I don't see any fundamental error in GP's quoted sentence unless you're trying to be extra pedantic.



I believe what they are saying is that any graph can be represented by either a matrix or edge list. So they are equivalently capable representations of graphs. The "they" refers to matrices, not graphs, when they said "and they are equivalent to edge lists". That's how I read it anyways.


> The isomorphism extends to operations as well, every BFS is a matrix multiplication, and vice versa.

Can you expand on that?



When you consider that a graph and a matrix are isomorphic, doing vector matrix multiplication takes a vector with a set value, say row 4, and multiplies it by a matrix where row 4 has values present that represent edges to the nodes that are adjacent to it (ie "adjacency" matrix). The result is a vector with the next "step" in a BFS across the graph, do that in a loop and you step across the whole graph.

A cool result of this is, for example, taking an adjacency matrix and squaring it is the "Friend of a Friend" graph. It takes every node/row and multiplies it by itself, returning a matrix that are adjacent to the adjacencies of each node, ie, the friends (adjacencies of the adjacencies) of friends (adjacencies) of the nodes.

Deeper traversal are just higher powers, a matrix cubed are the friends of the friends of the friends. You literally say `A @ A @ A` in Python, and you're done, no dictionary or lists or loops or anything else.

A picture is worth a thousand words, see figure 7 of this paper:

https://arxiv.org/pdf/1606.05790.pdf

Also check out figure 8, this shows how incidence matrices can work to represent hyper and multi graphs. An pair of incidence matrices reprsent two graphs, one from nodes to edges and the other from edges to nodes, these are n by m and m by n. When you multiply them, you get a square adjacency matrix that "projects" the incidence into an adjacency. This can be used to collapse hypergraphs into simple graphs that use different semirings to combine the multiple edges.

For some pretty pictures of this kind of stuff, check out CoinBLAS (note I am not a crypto-bro, it was just a very handy extremely large multi-graph that I could easily download in chunks to play with):

https://github.com/Graphegon/CoinBLAS/



Earlier versions of SQL did not support recursive queries (though there were vendor-specific extensions) so graph databases had a very real edge there. In fact, recent SQL versions have added "property-based" syntactic sugar for such queries to further improve ease of use.


graph databases win on actual data layouts in disk blocks. trying to make relational databases ‘understand’ graphs really happens only at the query language layer. the data on disk remains optimized for relational queries. i’m not aware of any relational database shipping a storage engine optimized for graphs. if you know any, please share. thanks :)


There's more potential applicability/overlap for columnar relational engines (vs. row stores) - this 2023 paper offers some useful background: https://arxiv.org/pdf/2308.08702.pdf


On the other hand, nearly everybody’s data fits in RAM nowadays, and NVMe is so fast the disk layout constraints have changed a lot. So maybe now’s a good time to rethink anyway.


But only if the data is valuable enough to cover the (much) higher costs that come with those assumptions.




>> The answers, of course, are yes. We already have a declarative programming language where expressing graph algorithms is extremely natural—Datalog, whose semantics are based on the relational algebra, which was developed in the 1970s.

If that's true - Datalog is based on relational algebra - then I'd be very interested to see the author's reference because the way I know it Datalog is a subset of Prolog, without functions so that termination can be guaranteed, and without negation-as-failure so that it's monotonic, depending on the variant, and in any case, incomplete (because termination).

For example, see:

What you always wanted to know about Datalog (and never dared to ask)

https://ieeexplore.ieee.org/document/43410

Which begins with:

Datalog, a database query language based on the logic programming paradigm, is described.

So the right abstraction to think about Datalog is logic programming, and the First Order Predicate Calculus, not the relational calculus. It's true that Datalog is used as a database query language, unlike Prolog that is a general purpose language, but that's because of the incompleteness of Datalog, which is something you probably want for a db query language, but certainly don't want for a general purpose programming language.



Based on is a bit loose -- it is indeed more precise to say that Datalog programs are often evaluated by database systems by translating rules into the relational algebra and running those queries until fixpoint. I can add that as an aside.

Datalog /happens/ to be a subset of Prolog syntactically, but their semantics are very different. An analogy is how LL parsers and LR parsers can both parse context free grammars, but their properties are different -- LL parsers parse from "top down" and thus will not halt on left-recursive grammars (just like Prolog) while LR parsers parse from "bottom up" and can support left-recursive grammars, at the cost of increased space.

Note that the demand transformation / magic set optimization, which is a common optimization, closes the gap between Datalog and Prolog semantically. In particular, it gives Datalog the best of both worlds: (1) increased speed, because it is not computing all possible facts, just the ones "reachable" from the query (like Prolog) and also (2) termination guarantee because all programs in the base Datalog language terminate.



First, thanks for the article, and the reply. I'm always happy to see logic programming-y material on HN (even if it is not labelled as such).

>> Datalog /happens/ to be a subset of Prolog syntactically, but their semantics are very different. An analogy is how LL parsers and LR parsers can both parse context free grammars, but their properties are different -- LL parsers parse from "top down" and thus will not halt on left-recursive grammars (just like Prolog) while LR parsers parse from "bottom up" and can support left-recursive grammars, at the cost of increased space.

You have to be careful when you're talking about "semantics" because there's a difference between the semantics of a language and the semantics of its execution. Like you say, Datalog is normally evaluated bottom-up, with what we call a TP Operator. If a Datalog program is evaluated top-down, like a Prolog program, then it's not guaranteed to terminate. On the flip side, Prolog is normally evaluated by SLD-Resolution, implemented as a Depth-First Search with backtracking and so it can get stuck in loops on left-recursions, as you say very correctly, but it can also be evaluated by SLG-Resolution, implemented as Breadth-First Search with memoization (a.k.a. tabling) in which case it _doesn't_ get stuck in loops on left-recursions.

In short, no I wouldn't agree that Datalog "happens" to be a subset of Prolog syntactically. That's what it is, by design. Evaluation is a different matter.

And this is where we get into the discussion of trade-offs.

>> Note that the demand transformation / magic set optimization, which is a common optimization, closes the gap between Datalog and Prolog semantically. In particular, it gives Datalog the best of both worlds: (1) increased speed, because it is not computing all possible facts, just the ones "reachable" from the query (like Prolog) and also (2) termination guarantee because all programs in the base Datalog language terminate.

Well I'm not sure whether that's right because I'm not sure what are the two "worlds" we get the best of. Efficiency is one thing. When you say that Datalog is not computing all facts, I understand this as saying it doesn't have to ground the Herbrand base of a logic program, like ASP has to, for example. True.

But the important trade-off (in my opinion anyway) is between efficiency and completeness, which is another way to look at termination guarantees, a.k.a. decidability. If a language, under some inference rule, is decidable, then it is not complete. And that's the limitation with Datalog, which comes from the fact it's a function-free language [1]. Without functions there is much you can't do. For example, function free Datalogs can't have lists, the main data structure in Prolog (other than well, "terms") because the Prolog list-constructor operator ([Head|Tail]) is a function. Without functions you can't do integer arithmetic. And so on.

I'm not sure how Datalog systems deal with those limitations (I don't really work with Datalog). I suspect they bolt-on some extra-logical system to do e.g. arithmetic, pretty much like Prolog does. In any case, the limitations of Datalog are I guess the reason it's popular as a database query language, because you don't really need arithmetic, or lists, in a database query language. But recursion, that terminates, is nice to have.

Btw, if you have a reference to the creation of Datalog older than the one I linked to, please share it. I've been trying to find the "original" datalog reference for a while and couldn't. I have no idea at this point who, exactly, came up with it, and how.

___________________

[1] Prolog is not function free, but calls its functions "terms". Which is very confusing because it also calls everything else a "term"; including constants, which it calls "atoms", and literals, i.e. atoms and negations of atoms. If you're lost, that's because you should. Prolog is a terminological atrocity.



> Btw, if you have a reference to the creation of Datalog older than the one I linked to, please share it

1982 was as far as I got last time I went digging: https://news.ycombinator.com/item?id=34819400



Thanks. Seems I commented in the same thread back then also. I don't remember if I checked your ref back then, but I should now.


Btw, it goes without saying that if you wanted to have a general-purpose relational language then you should use Prolog, not Datalog. After all, if you use Prolog there's nothing stopping you from sticking to Datalog when you want, and only using the full power of Turing-completeness when you must.

Seriously. Learn Prolog. It's a powerful language and you'll never worry about the Object-Relational Impedance-Mismatch ever again in your life. The only reason not to learn it is that you will forever be sad that you can't use it in your day job. Or you'll find one where you can, like I did.



Wait, what happened? There was a comment here by a user who was expressing frustration with their fist attempts to learn Prolog.

Dude, I wasn't going to be mean at you. I was going to say I get it, learning Prolog is hard and you need to understand the subject matter very deeply, in a whole other level than the usual languages we learn at uni, Python, or java, or, dunno Ada in the olden days.

Prolog's not easy to learn. But it pays off in spades for the effort.



// the right abstraction to think about Datalog is … first order predicate calculus, not the relational calculus//

Umm… I hate to be contrary, but FOPC is a logic about relations. You might think that it is about objects, and it is, but only indirectly, mediated by relations. Constants which refer to objects only appear in the argument lists of relations.

FOPC is all about relations. It’s the language we use when we want to formally define a relation.



I agree, yes. I don't think I said something that disagrees with what you say.


The back-and-forth exchange between blogs, each with comment threads on HN, is a great use of the Internet.


This ecosystem used to be called the Blogosphere


I agree, we should invent the World Wide Web again.


Especially because both blogs and HN comments can be linked to.


Hypertext !


Seriously, so cool. I love watching my fellow nerds debate data types online.

On a side note, is there a Bitter Lesson for datatypes, the way there is for algorithms?



Probably some variation of "hardware usually influences what the optimal datatype is way more than any theoretical runtime differences". For example, B-trees for databases needing to be adapted to the block size of the underlying hardware device. Another example: As soon as you introduce any form pointer chasing to your datatype, you are often going to struggle against relatively simple array-based options because L1 cache is crazy fast compared to RAM.


In the language of Linear Algebra, the type of a graph is a sparse matrix. Adjacency matrices can express simple directed and undirected graphs, and Incidence matrices can express multi, hyper, and ubergraphs.

The real power of using matrices for graphs is that you can use Linear Algebra to process them. Instead of "edge and node" thinking, LA brings the power of matrix multiplication and semirings to graph algorithms. Instead of working about edgelists, hashmaps of visited nodes, thread pools and when to fork or not to fork, by using a standard like the GraphBLAS you can just express an algorithm as a system of matrix operations, and the underlying library can choose how to run it, and on what hardware.

For example, the current state of the art GraphBLAS implementation is SuiteSparse:GraphBLAS, has a JIT compiler that runs graph algorithms on a variety of CPUs and CUDA GPUs. The same sparse deep neural network inference code that is a few lines of Python can run on a chromebook all the way up to a large GPU system with no changes, the only difference is the size of the graph and the time it takes to process it.

As graphs get into billions and trillions of edges, writing algorithms by hand that target different architectures gets extremely difficult and tedious. Future versions of SuiteSparse have a lot of exciting feature planned, including operation fusion and distributed processing. Retargeting hand written algorithms will be a thing of the past.

One of the best parts about the GraphBLAS is that the graph really does have a "type" in the programming language sense, it's a Matrix, and the same operators and operations you expect to work are there. There is great support for both Julia and Python at the moment for beginners and data science oriented folks to dive in quickly.

Here's an interesting paper on how to express centrality algorithms like PageRank and Triange Centrality (disclaimer: I am one of the paper authors):

https://www.researchgate.net/publication/356707900_The_Graph...

I also created an introductory video some time ago explaining the very basic concepts:

https://www.youtube.com/watch?v=JUbXW_f03W0



Abstractions should be seen as models. They are always wrong, but they are sometimes useful. (And sometimes not.)

When I think of graphs, I usually think of ones used for representing the alignment of biological sequences. Nodes have two sides, left and right, and both sides have their own neighbors. A forward traversal >v of node v enters from the left, reads the sequence stored in the node, and exits from the right. A reverse traversal

Sometimes the graphs are path-centric. There is a fixed set of paths (or walks, if you prefer), and the graph is induced by them. The typical query is path extension: if you have already traversed >AC>D, what are the possible left/right extensions according to the underlying paths matching the context. In a good graph representation, you can do this by maintaining a small state that does not grow significantly with the length of the context or the number of underlying paths.

Matrices don't feel like a good abstraction for graphs like this.

There are a number of representations for graphs like this, mostly differing by whether they are mutable or immutable, faster or more space-efficient, and graph-centric or path-centric. Generic algorithms use either node identifiers, which are consistent across graph representations, or opaque handles, which are representation-specific and often more efficient to use. Sometimes you select the representation according to the algorithm you want to use. Sometimes you select the algorithm according to the graph you already have (because conversions can be expensive). And sometimes you adjust the problem definition to reach something that can be computed efficiently with the available tools.



> Abstractions should be seen as models. They are always wrong, but they are sometimes useful. (And sometimes not.)

George Box was very specifically talking about statistical models when he coined that aphorism. Matrices are linear algebra and graphs are graph theory, I find it hard to think they are not correct and useful models.

> A forward traversal >v of node v enters from the left, reads the sequence stored in the node, and exits from the right. A reverse traversal

I'm not an expert in this field but I'm guessing you're talking about De Bruijn graphs, which can be very elegantly modeled with incidence matrices, here's an example of one using the GraphBLAS that downloads data from BioPython, loads it into incidence matrices and graphs it. This is just a simple example, SuiteSparse can handle many billions of edges:

https://github.com/Graphegon/Graphony?tab=readme-ov-file#exa...

Traversing bidirectionally is quite easy, the upper triangle of a matrix are the directed outgoing edges, and the lower triangle are the incoming. This style of "push/pull" optimization is common in the GraphBLAS.

> In a good graph representation, you can do this by maintaining a small state that does not grow significantly with the length of the context or the number of underlying paths.

Again if I understand you correctly, in the GraphBLAS this is accomplished by using accumulators and masks. During traversal data can be accumulated, with a stock operator or one you define, into a vector or matrix, and that object can be used to efficiently mask subsequent computations to avoid unnecessary work or determine when you've reached a termination condition.

> Matrices don't feel like a good abstraction for graphs like this.

Mathematically, graphs and matrices are isomorphic. Regardless of algorithm or storage format like edge lists, tuples or CSR, every graph is a matrix, and vice versa. And if you have a matrix, you have linear algebra to operate on it.

Some people don't like Linear Algebra to operate on graphs, so I guess for them it is "not good", but on the other hand, it's Linear Algebra and Graph Theory, whose roots date back to the 2nd century BC, forward through great minds like Descartes and Euler, permeating every kind of math, science, physics and engineering discipline humans have ever created. That's a strong argument for its goodness.

Now it is entirely possible, likely even, that the current SuiteSparse implementation doesn't have exactly the tool needed or maybe not the precise best storage format, but these missing pieces do not invalidate the underlying mathematical foundation that it's based on.



The model I was talking about is sometimes called a bidirected sequence graph. It's another member of a more general graph family de Bruijn graphs also belong to.

In that model, nodes have separate sets of left edges and right edges, and the edges connect node sides rather than nodes. For example, there can be an edge between the right sides of two nodes. The edges are undirected, but you can't exit a node from the side you entered it. Alternatively, the edges become directed once you fix the orientation of the node visit. Then the successors of a node in one orientation are its predecessors in the other orientation. Some graph representations have an underlying directed graph with separate nodes for the two orientations, but that's an implementation detail people usually don't want to think about.

The path-centric model can be thought as predicting the token preceding/following a context. Node D may have right edges to >E, G, but if you are in context C>D, only >E and AC>D, then your only option may be

Matrices are a useful model when you want to do similar things to most nodes in a graph. But when you are exploring the graph locally in an iterative fashion, it's more convenient to think about nodes and edges.



Does it avoid the downsides of a matrix representation mentioned?

The article responds to another, noting that there's inherent tradeoffs.

ex. "with 100 nodes and 200 edges...If we use an adjacency matrix representation...we need a 100×100 matrix containing 200 ones and 9,800 zeros. If we instead use an edge list we need only 200 pairs of nodes."

(n.b. this flattens a lot of the interesting info in both articles into 'ah, matrix!" - open to that being true but it feels unlikely)



> ex. "with 100 nodes and 200 edges...If we use an adjacency matrix representation...we need a 100×100 matrix containing 200 ones and 9,800 zeros. If we instead use an edge list we need only 200 pairs of nodes."

The GraphBLAS is a sparse matrix library, it does not store the non-present values.

Also, a non-present value may or may not be zero. For example in shortest path algorithms, the non present value is positive infinity.



So, critically, the missing data type is still missing. TFA talks about an input specification, but actually highlights that the lack of a fixed representation for graphs is a strength, because the "computer" can optimize the representation and algorithms on the fly using something like query optimization.

So it's not that the graph datatype already exists, it's that, just like the referenced article posits, there is no good representation. And rather than lament and gnash teeth, we use a neato programming / query language to turn that into a strength.



Small nitpick: I would say that an abstract graph datatype does exist: it's a relation. I think that idea was not really properly explored by the referenced article. And the relational algebra gives us a powerful way of manipulating the abstract relational datatype, which informs efficient concrete representations.


Insightful, and reminds me of Kierkegaard:

Man is spirit. But what is spirit? Spirit is the self. But what is the self? The self is a relation which relates itself to its own self, or it is that in the relation [which accounts for it] that the relation relates itself to its own self; the self is not the relation but [consists in the fact] that the relation relates itself to its own self.

Would this specification suffice for a sentient datalog program? Or is datalog itself sentient?

You've captured my intrigue, and now I want to explore datalog, so thank you for writing the article.



Regarding relational algebra in particular: It is interesting that important and frequently needed relations on graphs cannot be expressed in relational algebra. The transitive closure of a relation is a well-known example, and as you nicely show in your article this relation can be easily and very naturally expressed in two lines of Datalog. For example, we can easily express reachability in a graph:

    reachable(V, V) :- vertex(V).     
    reachable(From, To) :-            
            arc_from_to(From, Next),  
            reachable(Next, To).      
One can show that Datalog with two very conservative and simple extensions (allowing negation of extensional database relations, and assuming a total order on the domain elements) captures the complexity class P, so can be used to decide exactly those properties of databases (and hence graphs) that are evaluable in polynomial time, a major result from descriptive complexity theory.

An example of such a property is CONNECTIVITY ("Is the graph connected?"), which can be easily expressed with Datalog on ordered databases, where we assume 3 built-in predicates (such as first/1, succ/2 and last/1) to express an ordering of domain elements:

    connected(X) :- first(X).
    connected(Y) :- connected(X), succ(X, Y), reachable(X, Y).

    connected    :- last(X), connected(X).
If such an ordering is not available via built-in predicates, then we can easily define it ourselves for any given concrete database by adding suitable facts. Also negated EDB relations can be easily defined for any database as concrete additional relations.


Yes, you're right that one cannot express Datalog semantics (and also transitive closure semantics) with just one "query", as queries cannot be recursive.

If you view each rule as a query, however, looping over rules does capture Datalog semantics. Furthermore, by optimizing over rules using the relational algebra, one can derive algorithms "equivalent" to traditional graph algorithms.

(I don't think you would disagree with me; just want to clarify for other people who might be reading.)



I mean, isn't that what happens when you use SQL? When was the last time you specified how the bytes were aligned in disk in your database?

By the same logic, the SQL datatype is also missing.



Datalog graph is classic, the points-to analysis papers were some of my first "aha"s for why it's practical

If you enjoy this kind of thinking, we recently released GFQL for dataframe-native accelerated graph querying & compute that build on some of the under-the-hood insights here

Imagine Neo4j Cypher, except no need for a database -- just import it -- and automatically vectorizes for significantly faster CPU+GPU performance. This is fundamentally similar to the kinds of optimized engines the article's datalog approach enables. In fact, one of our big internal questions was whether to use ~datalog syntax as the frontend!

We've run it on 100M+ edge graphs in seconds on some of the cheapest GPUs you can get, and are getting ready for the next rev with aggregate compute as it's becoming more important for our community: https://github.com/graphistry/pygraphistry/blob/master/demos...



As someone who's been spending an inordinate amount of time thinking about spreadsheets, their dependency graphs, linear algebra and matrices, parallel formula evaluation taking advantage of the GPU, declarative reactive programming, Elm, Flix, Rust... this conversation feels like validation that there's something to explore there but I'm also kinda losing my mind in the process...


Usually a sign of deep learning (pun half intended)


I'm always deeply impressed by people who can write complex, coherent essays above 2000 words with like a day of advanced notice. The "missing data type" essay was just 3000 and took me months. Show me your dark magic please.


I'm lucky to have been surrounded by researchers who have been teaching and thinking about these ideas over the last ~6 months, so this essay has been marinating for a long time... your article just provided the necessary activation energy for me to write this all out. Thank you!

Here are the researchers, in no particular order:

Max Willsey (Berkeley): https://www.mwillsey.com/ and his PL class (https://inst.eecs.berkeley.edu/~cs294-260/sp24/)

Joe Hellerstein (Berkeley): https://dsf.berkeley.edu/jmh/

Dan Suciu (UW): https://homes.cs.washington.edu/~suciu/ and his DB theory class (https://berkeley-cs294-248.github.io/)

Remy Wang (UCLA): https://remy.wang/

Hung Ngo (relationalAI): https://hung-q-ngo.github.io/

The Hydro Project at Berkeley: https://hydro.run/



Joins are great for OLTP workloads but are horrible for all but the simplest graph algorithms. A datalog-based system makes for a nice story, but it would not survive benchmarking.

Pick a classic algorithm, say triangle counting, implement it in Datalog, compare against GBBS [1] and come back here to report results.

[1] https://github.com/ParAlg/gbbs



I love this back and forth (subscriber of Hillel Wayne; gonna bookmark this blog too)! I think the authors probably agree more than they disagree. There's probably energy around adding better graph support to mainstream languages, but I have significant doubts about whether or not Datalog will get traction there. It's really hard to get programmers to help themselves (I'm in this group FWIW--didn't use linting for waaaay too long), whether it's GC, Rust's ownership model, formal methods, flow analysis, Datalog, functional programming, blah blah. Maybe it's cultural, maybe it's some kind of deep/shared aesthetic amongst engineers, I don't know.


Interesting how this article mentions Rust, and program analysis, but completely fails to mention that Polonius, the next-generation borrow checker project for Rust, is based on datalog (source: https://blog.rust-lang.org/inside-rust/2023/10/06/polonius-u...)


Recursive queries are the killer application for datalog.

As a nuts-and-bolts developer, I like https://www.cozodb.org> for datalog-style data operations. It interoperates nicely with python, C, and Swift. It comes with some basic graph algorithms, data loaders, etc. I haven't found any bugs, but fair warning: it's not 1.0, and the developers seem to be pivoting to AI vector data support.

Data modeling is a lot closer to NoSQL key/value stores, or FoundationDB. The append-mainly model requires maintenance compaction. I typically tokenize before putting anything into the database and never had scaling problems with large but not internet-scale uses. Explain-query is a bit opaque.



While it doesn’t refute the “missing data type” article that hit the front page yesterday I think it is a very interesting part of the discussion.

But mostly I’m impressed someone read an article, thought “not quite true” and created a well written piece like this, in what? A day and a half?



Sorry, kicking things off with a tangent here. I've been seeing datalog for graph queries popping up a lot recently, but every datalog example I see looks totally different. Is it just a general concept? Or are there some actual unifying syntax features for different datalog implementations? Does saying "this uses datalog" guarantee some core functionality?


Datalog is basically more of a paradigm than a single language: the "core" is just that you have Horn clauses and facts, and synthesize more facts (usually to fixedpoint) using those clauses. Everything on top is basically up to the implementation: a lot of modern Datalog implementations have lattice types, for example, so that introducing a new fact for the same subject "updates" the previous fact instead of duplicating. Or they might allow for negative clauses instead of only positive clauses, or implement the synthesizing in a way that is more efficient. But it's really up to the implementation what set of features they implement, and what syntax to use for it, since there isn't a specification or agreed upon set of features for what "Datalog" is if you say you use it.


To add to this description, the Prolog-derived syntactic core of Datalog (the Horn clauses and facts) can be viewed as a combination of "unification" and mutually recursive rules to find a fixpoint over your data+query. It's essentially like solving simultaneous equations.

The Prolog-derived syntax is routinely extended because the core is typically too simplistic/inexpressive to be directly useful, e.g. see https://www.fdi.ucm.es/profesor/fernan/des/html/manual/manua...



the 'verse' language by SPJ and co explores the notion that datalog semantics can be expressed using 'normal' programs in SSA form. As long as you use pure values this works out quite well. there are some convenient normal datalog abstractions like implicit union of clauses with the same name that don't map so well.

I expect this is going to show up as a really popular model at some point - don't have to have two separate worlds for queries and other logic.



> don't have to have two separate worlds for queries and other logic

That's definitely the dream. Another point along that spectrum (from the author of Apache Calcite): https://github.com/hydromatic/morel



But this doesn't actually address the graph "datatype".

This says article basically says "use an edge list" and plug it into a very fancy library/database that may internally transform the representation and/or otherwise do magic to evaluate the graph better.

And I mean like sure, but now you're just sort of burying the problem. You're saying "I've invented the one true graph library and that library will handle all the hard parts."

But datalog has limitations, stuff as simple as weighted (much less graphs with non-integer annotations which require some declarative analysis) are the realm of academic research.

Like, when "we don't support page rank" (https://link.springer.com/article/10.1007/s11280-021-00960-w) is noted in the research paper from 2022, I think saying "datalog" solves all these problems seems incorrect.



I'm confused by your link. The part where the paper said that they don't support page rank was referencing prior work on Cog. The system that they are now presenting seems to support PageRank queries.

> However, despite the high performance and declarativeness benefits of Cog, it does not support common complex data analytics, such as PageRank... We present Nexus, a Datalog evaluation system that overcomes all the aforementioned challenges.

And Figure 21 shows two Datalog systems that seem to be able to run the PageRank algorithm: https://link.springer.com/article/10.1007/s11280-021-00960-w...

> But datalog has limitations, stuff as simple as weighted [edges]... are the realm of academic research.

Weighted edges are well-supported by Soufflé, which is stable and I would be comfortable using it in production. Soufflé also supports ADTs, so it also can augment paths with proofs in the same manner as Egglog. I used a more "research" implementation (Egglog) for the post because they have an online interactive demo. It is true that there is academic research being done on Soufflé, but there is academic research being done on e.g. Rust, and people still use Rust in production.

I also explicitly say that there needs to be more research into better Datalog engines and integrating Datalog support into programming languages. ("Languages could have amazing graph support! In maybe a decade? And only after lots of research effort?")



So, critically, the missing data type is still missing. TFA talks about an input specification, but actually highlights that the lack of a fixed representation for graphs is a strength, because the "computer" can optimize the representation and algorithms on the fly using something like query optimization.

So it's not that the graph datatype already exists, it's that, just like the referenced article posits, there is no good representation. And rather than lament and gnash teeth, we use a neato programming / query language to turn that into a strength.



And this works great, until some necessary thing isn't supported by the language/library/black-box/datalog implementation you're using.

In reality the original (hillelwayne) article is saying that the "graph" datatype is missing, as a standard type, which is true. Imperative language implementations abstract that away to libraries, which may be graph databases (datalog being one of many), or may be more tightly coupled things like networkx for cases where you need some kind closer knit integration and custom computation.

Like, taking a step back, you're saying that no single representation is a panacea. And the original article is taking the same stance, that no single graph representation or library is a panacea, because so much is computation dependent.



Well I'm only saying what the article was saying. And you're repeating what both articles were saying. So I think we're in perfect agreement?

> until some necessary thing isn't supported

Taking a third side, I dont think the possibility of a missing algorithm should remove the positives of not assuming a graph format apriori.

What I mean is, it seems that the approach of "use a data representation and implement your algorithm using something like queries" is meant to be an argument for empowering graph library writers to support a wider array of inputs (e.g., all of them) without specifying different implementations for those inputs.

That's cool. Maybe I misunderstood what you said.



> What I mean is, it seems that the approach of "use a data representation and implement your algorithm using something like queries" is meant to be an argument for empowering graph library writers to support a wider array of inputs (e.g., all of them) without specifying different implementations for those inputs.

I think this gets back to the thing I initially brought up, which is that if you take this as the guidance, features as simple as weighted edges jumps to the realm of SoTA. It's perhaps good research guidance, but that's not useful for me who needs to analyze a graph today.



Well sure, my comment does not apply outside its intended scope. But for your scope, the article implies that this is not an esoteric, half baked thing - that it's real right now. So maybe if I were in your shoes that's where I'd look.


> But this doesn't actually address the graph "datatype".

This is correct. What is shown is in fact a programming environment with query engine/optimizer using an internal DSL. That is cool, and that is something you see with Sql, Tensor, etc but that is a full-blown thing.

Not a datatype.



It seems premature to worry about graphs when most languages don't even have good support for tables.


I'm curious about graph algorithms that are "higher order" than what the article describes - e.g. how does Datalog encode the fact that two graphs (given as two relation sets) are isomorphic? How do I write a graph isomorphism algorithm in Datalog, or e.g. enumerate all graphs that have 6 vertices?


This perhaps isn't what you're asking about, but a very useful thing to realize is that datalog or sql are quite useful for subgraph isomorphims aka graph patterm matching. One can compile the graph you're looking for into the rhs/query of a rule. This is probably only wise for small pattern graphs in large search target graphs.

There's interesting depth here in that constraint satisfaction problems can be modeled as finding homomorphisms (I associated this line of thought with Moshe Vardi and others). Big patterns into small targets tend to look like coloring problems or SAT, whereas small patterns into big targets look like queries. Isomorphism is right in the middle.

   tringle(X,Y,Z) :- edge(X,Y), edge(Y,Z), edge(Z,X).
   square(X,Y,Z,W) :- edge(X,Y), edge(Y,Z), edge(Z,W), edge(W,X).


Didn't mention it in the last post, but it would be interesting to see Stanford GraphBase explored in some of this discussion. It is very much "in the weeds" as it were, but the entire point is to explore the impact of a graph struct on various algorithms.


nit: the turnstile expressions ⊢ are swapped in the examples

    body :- head should be 
    head ⊢ body


Aren't graphs already represented through matrices?


The original blob post [0] referenced hypergraphs and other more complicated structures like property graphs.

[0] https://news.ycombinator.com/item?id=39592444



Which can all be represented with Incidence Matrices:

https://en.wikipedia.org/wiki/Incidence_matrix



A (set) relation with N tuples is isomorphic to a N-dimensional boolean tensor. The mapping is to interpret each tuple as a coordinate in the tensor, and set the entry in that tensor to 1.


Great Insight. Why do we need datalog, though? Why not just use SQL?

Granted, datalog is simpler, has a strong theoretical foundation, is more secure and securable, has over 50 years of technological development, is currently wicked fast...and its opportunities for and- and or-parallelism, combined with its single-assignment variables, means that its perfect for the multicore, shared-cache chips we'll have to build to keep Moores Law going. I could go on...

....but that's all been true since the Clinton Administration. Heck, most of it was true since the Nixon administration, but NOBODY WANTS TO USE IT.

I can go to my boss and persuade him to let me implement something in python rather than c++. But he wouldn't even understand what I was proposing if I argued for datalog over sql. All he wants to do is finish updating Jira so that he can go home. I can't launch into a week-long tutorial about what a Horn clause is, or why negation-by-failure is more coherent than the corner-cases which SQL's three-valued logic has...

We're stuck with SQL. Fortunately, the OP's great points about using relations to implement graphs are still valid even w/o the datalog.



You can without too much work transpile datalog to SQL. SQL does have such strong support that it is useful https://github.com/philzook58/snakelog or perhaps just do it manually https://www.philipzucker.com/tiny-sqlite-datalog/


Interesting. I'm afraid if my boss saw me trying those out at work, he'd ask me why I'm two weeks behind, and if I told him it was the drawbacks of SQL which are holding me back and this is exactly what I needed to get back on track, he's say "you are two weeks behind because you are always getting distracted by stuff like this."

Sucks but honestly, how could I blame him for thinking that way? He--like practically every other boss on the planet-- so utterly lacks any capacity to understand what datalog brings to the table, that really, on what basis could he even make the call?

And you can't even blame him for not taking two weeks off to investigate it--new gee-whiz techniques and methodologies appear every day, most of which promise more than they can possibly deliver. If he spent all his time studying them, he'd get nothing else done.

So why should this be any different? Besides, everybody else on the planet manages to get their shit done with SQL, so just shut up and get with the program.

Uh...I'm not bitter tho...chuckle



Your boss either doesn't pay as much attention as you think or you should consider getting a new boss if possible. Life is too short to not indulge on curiosity.


Aye, but these days, with all the layoffs, you are either not that thrilled about rocking the boat, or you are already involuntarily looking for a new boss ;-( With not that much hope that the new one would be better.

And again, it's hard to blame the boss, he's just as scared about it as everybody else. Why should he take a chance and rock the boat?

Especially for something like datalog. No iteration, just recursion? What's with this bizarre syntax? Why are the variables write-once??? Not everybody is just going to be able to take all that in and see how all these weird little pieces add up to a superior solution.







Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com