（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40474712

一位前 Google 员工讨论各种 Google 服务，包括 Gmail、Google 网上论坛、Google Drive 和 Gchat。他们澄清说，在没有人为干预的情况下分析数据的算法不被视为侵犯隐私。但是，它们区分数据驻留的位置和执行检查的位置。一个简单的拼写检查器如果在设备本地运行，就不会引起隐私问题。他们对人工智能等复杂系统所需的不断更新以及持续处理大量数据的挑战表示担忧。他们提到，旧的文字处理程序具有独立运行的有效拼写检查器，但现代文字处理程序通常涉及云重定向和分析，从而导致潜在的隐私问题。此外，他们认为任何访问私人信息的内容都与隐私讨论相关，包括元数据，甚至是电子邮件标题等看似无害的元素。他们引用了前国家安全局局长的评论，“我们根据元数据杀人”，强调了其重要性。该员工建议，Go 团队与 Google Cloud Platform 和 Drive 合作托管潜在有害文件，有助于降低安全风险。他们还建议不要玩关于软件要求的游戏，分享使用 PyPI 进行非 Python 项目（如 FFmpeg 和 Eigen）的经验。总之，这位前 Google 员工强调，由于缓存内容的风险，定义安全的 Go 模块至关重要，并讨论了与 CUE 模块设计中解决的挑战相关的 Go 模块设计的演变性质。

And Gmail and Google groups, and Google drive, and Gchat, on and on. The data you store doesn't even have to be public. With Gmail they would distribute credentials to log in and read attachments that they uploaded via imap.

(I am a former Google SAD-SRE [Spam, Abuse, Delivery])

No inside information, but presumably this means Delivery to other organizations, which, among other things, includes maintaining outbound IP reputation, which is closely related to Spam and Abuse.

An algorithm that processes private user data is by itself not invading anyone's privacy. It's clear to me that invasion of privacy only happens when humans look at private user data directly, or look at user data that's not sufficiently processed by an algorithm.

Otherwise, something as simple as a spell checker would be an invasion of privacy because it literally looks at every word in an email you write. That's absurd.

At least in my opinion, there's a big difference with where the data lives and where the checking algorithm is run. I don't think a spell checker would fall into what I'd consider a privacy concern as long as the spell checker is running locally on my device.

I don't work in the area of email nor Google but I see two problems.

1) you need to constantly update the spell checker so each time you say this is word or something like that most likely the data is send the problem is part of the data, I assume Google do something similar whit data send to span and mark as not spam. This is full email redirect and analysis not partial like old word processing.

2)I feel ai make this even more harder so now you can't simply check patterns as simply as before, and you need to check the whole content constantly

We've had spell/grammar checkers in word processors that worked totally offline for a long time now. They definitely can be improved with a hosted service but that's by no means necessary and comes with tradeoffs like latency and offline support.

An algorithm that denies service, changes ad behavior, etc based on user content is definitely invading privacy compared to your spell checker case.

The spell checker would also be a massive privacy invasion if if flagged users based on the content of what they wrote.

If an algorithm is looking through private stuff and making a decision based on it or is sending signals where the signal depends on the private stuff, then it's pretty much by definition leaking private information.

An algorithm that leaked no private information would not be useful to a business. It would do a bunch of computation and then throw it away. So realistically anything that looks at private information is privacy-relevant.

That includes even just the email headers. To quote the former head of the NSA "We Kill People Based on Metadata" https://abcnews.go.com/blogs/headlines/2014/05/ex-nsa-chief-...

You can have debates about how much private information should be leaked and for what purposes. But I don't think having a threshold like "it's all private unless another human reads it" is a good way to think about the issue.

It seems like it would be pretty easy to use PyPI for this, because packages can contain arbitrary non-Python files. And you can also do things like base 64 encoding your files in strings in Python code.

Googler, opinions are my own. I know nothing about this space.

I would hope the Go team collaborated with GCP and Drive, as hosting malicious files is something Google has to deal with all the time. This isn't much different from other endpoints Google already allows people to put random data on.

> I would hope the Go team collaborated with GCP and Drive

Former Googler, I know nothing about the Go Dev Tools team, but Google collaborates in this way better than almost any massive company I've worked at or heard about from close friends.

Google is really good at having a central team manage infrastructure, and share it across the company. As long as it's not a messenger app. Surely (pure guessing) the go team is using the internal blob store, and I think there is some internal-infra teams that handle abuse and file scanning automatically.

I know pypi has some non-python projects as well. Python needs the ability to distribute wheels, which are compiled binaries, as the user may not be able to compile library code. Lots of that code is written in C, but Golang[1] is also possible. I can't find an example, but I believe I've seen this used for distributing applications (not libraries) as well. It's kinda cool to write some app in C, upload to pypi, and then ask users to install with `pip install`.

[1] https://github.com/popatam/gopy_build_wheel_example

Hypothetically if they did try to add some requirement to use Python, people could just comply maliciously by providing the most minimal stub of Python code, right? Linux, but ls is written in Python. So it is probably better just to not play games.

I've never encountered this requirement in many years of daily use - pip for me has always happily installed anything if it can.

Now I've definitely seen customized distributions of python from package managers that have taken steps to prevent you from using pip. IIRC, the python you get from `apt-get install python` in Debian does this? I.e., it's designed to support system utilities, not as a user's general purpose python environment, and they want `apt-get` to control this environment, not pip. So they've removed pip and ensure_pip and easy_install from your core system python environment.

TLDR: In my experience, that requirement doesn't come from pip, it's your distro taking steps to prevent https://xkcd.com/1987/

Yeah I copied CMake's idea of using PyPI and I also use it to distribute some pure Rust CLI tools using Maturin. It works really well. Pip is... well it's about on par with most other package managers, i.e. not great, not terrible, but it has some pretty huge advantages over any other software distribution method on Linux:

* Very likely to be installed already on Linux and probably Mac too.

* Doesn't require root to install. You can even have isolated installs via pyenv.

* I don't have to ask anyone's permission to publish a package.

* I only have to make one package.

If any can think of a better option I'm all ears but until then I'm fairly happy with this hack.

Some of those arguments are becoming more and more difficult as pip and distros are pushing for use of venvs and now requires a scary --break-system-packages argument if you were to use the pre installed launcher.

been using PyPI a lot recently for non-Python stuff such as FFmpeg and Eigen. Part of the reason why I have been able to ditch Homebrew entirely!

That's maybe naive, but... how is that different than just pushing files to e.g. a GitHub repository? Is it just the fact that you need to create an account for GitHub? Because I can store arbitrary data there, too. Without the 500M limit...

For me on desktop, the version seems to be the fourth thing down in the right column, under weekly downloads, and there's a checkmark. (Or maybe I'm missing something.)

Sure, but the gosum database is a critical piece of worldwide software infrastructure, so you can count on it being accesible behind many firewalls and always up. And it's completely free and anonymus.

Perfect for the purpose.

Yeah when I did that there was no public rekor instance ran by the sigstore project so I choose the only available public transparency log I could bend to my needs (x509 transparency logs were an alternative but it'd quickly hit rate limits by acme providers)

The CUE team worked with the Go team on the module system. From these interactions, and community input, they decided against using a proxy like Go has. The "exploit" in the article was one of the reasons they made this decision, and chose to use OCI registries instead. The V1 proposal actually proposed using the same Go proxy servers as a stopgap, which received significant pushback from the community (I was probably the loudest voice against the idea). The Go team was supportive at the time, but this would have been exactly what OP talks about, having non-Go projects in the proxy/sumdb.

So CUE's module design can be seen as an evolution on Go's, building on the good parts while addressing some of the shortcomings.

Fun fact, CUE started as a fork of Go, mainly for the internal compiler tooling and packages

One thing that the Go module system solves that seems to be unaddressed in CUE's design based on OCI is the sum database / transparency log.

I could add a "Statement that we might wish to make for a module M" to the "Module contents assurance" section:

- The content of module M is the same content that everyone else sees for the same `$path@$version`.

Though I guess users can utilize existing solutions like https://github.com/sigstore/cosign or rekor (mentioned elsewhere itt).

How do you "fix" that at all?

In the end, there is no definition of "a source control repository that is a Go module" that is robust to this sort of "attack"... although calling it an "attack" is kind of dubious, the reasons why this is a bad thing strike me as very strained and relatively weak. Mostly it hurts Google by hosting too much stuff, but, good luck bringing them down that way.

Color me unsurprised Marwan is on this issue. He and Aaron wrote Athens, Marwan wrote (to my knowledge) the first Go download protocol implementation that Athens is based on.

This issue is kind of curious because Athens already uses the go mod download -json command mentioned as a preflight check for module verification. More or less, if the repo passes the go module commands understanding of a module then Athens will serve it. In more verboten terms:

- a module version, pseudo version, or +incompatible must be able to be formulated

- that module (and it's dependencies) must produce a valid checksum

The checksum of modules just has to do with the current .mod and all files + recursively for each dependency. So, as the author pointed out you can have lots of space for arbitrary files by design so long as you have a basic go program.

W3C laid the groundwork for everything on the Web to be heavily cacheable, so it's weird that there are so few general-purpose proxy caches. Are publishers sending short "Cache-Control: max-age" or "Vary: Cookie" responses when they didn't need to? Are too many ISPs paying for transit rather than peering?

In general there's no way to ensure the cache hasn't tampered with the contents (e.g. ISP proxy ad injection on non HTTPS sites). For software downloads usually there are signatures and checksums. Arbitrary content, not so much.

Maybe I'm being stupid but what exactly is the issue here? It's probably a bit wasteful of the proxy to cache non-Go repos, but even if it didn't you could make it store arbitrary data just by having it cache a Go repo surely? Sounds like a complete non-issue unless I've missed something.

I don't think you've missed anything. The news here appears to be that a unsecured public proxy is willing to proxy things and make them available to the public in an unsecured fashion.

The article does make the point that some monitored networks might trust golang proxy URLs more than arbitrary web URLs and that this could be used for bypassing reputation filters etc -- but there are already several ways to do that, and this one doesn't seem particularly special.

（评论） (comments)

（评论）
(comments)