(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=40383978

公司有时会在未经明确同意的情况下收集和利用客户数据来改进其人工智能模型。 这种做法引起了用户的担忧,特别是当它涉及付费服务或私人通信中的敏感数据时。 尽管一些人承认增强产品的潜在好处,但许多人仍然对这种隐私侵犯感到不安。 已经做出一些努力来解决这些问题,但广泛的认识和行动仍然有限。 关于使用客户数据在企业中进行工业规模货币化的道德界限的争论仍在继续。 尽管一些人认为同意和透明度的重要性,但另一些人则认为这种做法对于人工智能技术的进步至关重要。 最终,个人偏好和价值观塑造了围绕数据隐私和商业环境使用的持续讨论。

相关文章

原文


I contacted support to opt out.. Here is the answer.

"Hi there,

Thank you for reaching out to Slack support. Your opt-out request has been completed.

For clarity, Slack has platform-level machine learning models for things like channel and emoji recommendations and search results. We do not build or train these models in such a way that they could learn, memorize, or be able to reproduce some part of customer data. Our published policies cover those here (https://slack.com/trust/data-management/privacy-principles), and as shared above your opt out request has been processed.

Slack AI is a separately purchased add-on that uses Large Language Models (LLMs) but does not train those LLMs on customer data. Slack AI uses LLMs hosted directly within Slack’s AWS infrastructure, so that customer data remains in-house and is not shared with any LLM provider. This ensures that Customer Data stays in that organization’s control and exclusively for that organization’s use. You can read more about how we’ve built Slack AI to be secure and private here: https://slack.engineering/how-we-built-slack-ai-to-be-secure....

Kind regards, Best regards,"



It's not like L1 support can make official statements on their own initiative. That was written by someone higher up and they're just copypasting it to panicked customers.


> We offer Customers a choice around these practices. If you want to exclude your Customer Data from helping train Slack global models, you can opt out. If you opt out, Customer Data on your workspace will only be used to improve the experience on your own workspace and you will still enjoy all of the benefits of our globally trained AI/ML models without contributing to the underlying models.

Why would anyone not opt-out? (Besides not knowing they have to of course…)

Seems like only a losing situation.



Whats baffling to me is why companies think that when they slap AI on the press release, their customers will suddenly be perfectly fine with them scraping and monetizing all of their data on an industrial scale, without even asking for permission. In a paid service. Where the service is private communication.


I am not pro-exploiting users' ignorance for their data, but I would counter this with the observation that slapping AI on product suddenly makes people care about the fact that companies are monetizing on their usage data.

Monetizing on user activity data through opt-out collection is not new. Pretending that his phenomenon has anything to do with AI seems like a play for attention that exploits peoples AI fears.

I'll sandwich my comments with a reminder that I am not pro-exploiting users' ignorance for their data.



Sure - but isn't this a little like comparing manual wiretapping to dragnet? (Or comparing dragnet to ubiquitous scrape-and-store systems like those employed by five-eyes?)

Scale matters



i mean, i am in complete agreement, but at least in theory the only reason for them to add AI to the product would be to make the product better, which would give you a better product per-dollar.


Most people don't care, paid service or not. People are already used to companies stealing and selling their data up and down. Yes, this is absolutely crazy. But was anything substantial done against it before? No, hardly anyone was raising awareness against it. Now we keep reaping what we were sawing. The world keeps sinking deeper and deeper into digital fascism.


Companies do care: Why would you take additional risk of data leakage for free? In the best case scenario nothing happens but you also don't get anything out of it, in the worst case scenario extremely sensitive data from private chats get exposed and hits your company hard.


Companies are comprised of people. Some people in some enterprises care. I'd wager that in any company beyond a tiny upstart you'll have people all over the hierarchy that dont care. And some of them will be responsible for toggling that setting... Or not, because they just can't be arsed to with how little they care about the chat histories of the people they'll likely never even going to interact with being used to train some AI.


Because they don't seem to make it easy. It doesn't seem as a individual user I have any say in how my data is used, I have to contact the Workspace Owner. When I do I'll be asking them to look at alternative platforms instead.

"Contact us to opt out. If you want to exclude your Customer Data from Slack global models, you can opt out. To opt out, please have your Org or Workspace Owners or Primary Owner contact our Customer Experience team at [email protected] with your Workspace/Org URL and the subject line “Slack Global model opt-out request.” We will process your request and respond once the opt out has been completed."



I'm the one who picked Slack over a decade ago for chat, so hopefully my opinion still holds weight on the matter.

One of the primary reasons Slack was chosen was because they were a chat company, not an ad company, and we were paying for the service. Under these parameters, what was appropriate to say and exchange on Slack was both informally and formally solidified in various processes.

With this change, beyond just my personal concerns, there are legitimate concerns at a business level that need to be addressed. At this point, it's hard to imagine anything but self-hosted as being a viable path forward. The fact that chat as a technology has devolved into its current form is absolutely maddening.



> Why would anyone not opt-out?

This is basically like all privacy on the internet.

Everyone WOULD opt-out, if it was easy, and it becomes a whack-a-optput game.

note how you opt-out (generic contact us), and what happens when you do opt-out (they still train anyway)



> We offer Customers a choice around these practices.

I remembered the joke from The Hitchhiker's Guide to the Galaxy, maybe they will have a small hint in a very inconspicuous place, like inserting this into the user agreement on page 300 or so.



But the plans were on display…” “On display? I eventually had to go down to the cellar to find them.” “That’s the display department.” “With a flashlight.” “Ah, well, the lights had probably gone.” “So had the stairs.” “But look, you found the notice, didn’t you?” “Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.


Never more true than with apple.

Activating an iphone for example has a screen devoted to how privacy is important!

It will show you literally thousands of pages of how they take privacy seriously!

(and you can't say NO anywhere in the dialog, they just show you)

They are normalizing "you cannot do anything", and then everyone does it.



Which third party does Apple sell your profile to or your data to?

What do they do with the content of your emails for advertising?

Which credit card networks at which retailers feed that same profile?

So it’s certainly “more true” with, well, every other vendor you can buy a phone from in a telco store or big box store.



> Why would anyone not opt-out?

Because you might actually want to have the best possible global models ? Think of "not opting out" as "helping them build a better product". You are already paying for that product, if there is anything you can do, for free and without any additional time investment on your side that makes their next release better, why not do it ?

You gain a better product for the same price, they get a better product to sell. It might look like they get more than you do in the trade, and that's probably true; but just because they gain more does not mean you lose. A "win less / win more" situation is still a win-win. (It's even a win-win-win if you take into account all the other users of the platform).

Of course, if you value the privacy of these data a lot, and if you believe that by allowing them to train on them it is actually going to risk exposing private info, the story changes. But then you have an option to say stop. It's up to you to measure how much you value "getting a better product" vs "estimated risk of exposing some information considered private". Some will err on one side, some on the other.



> Of course, if you value the privacy of these data a lot, and if you believe that by allowing them to train on them it is actually going to risk exposing private info, the story changes. But then you have an option to say stop. It's up to you to measure how much you value "getting a better product" vs "estimated risk of exposing some information considered private". Some will err on one side, some on the other.

The problem with this reasoning, at least from what I am understanding is that you don't really know when/where the training of you data crosses the line into information you don't want to share until it's too late. It's also a slippery slope.



> Think of "not opting out" as "helping them build a better product"

Then they can simply pay me for that. I have zero interest in helping any company improve their products for free -- I need some reasonable consideration in return. For example, a percent of their revenues from products that use my data in their development. I'm totally willing to share the data with them for 2-3% of their revenues, that seems acceptable to me.



I for one consider it my duty to bravely sacrifice my privacy to the alter of corporate profit so that the true beauty of LLM trained in emojis and cat gifs can bring humanity to the next epoch.


> Think of "not opting out" as "helping them build a better product"

I feel like someone would only have this opinion if they've never ever dealt with any in the tech industry, or capitalist, in their entire life. So like 8-19 year olds? Except even they seem to understand that the profit absolutist goals undermine everything.

This idea has the same smell as "We're a family" company meetings.



Yep, much like just about every credit card company shares your personal information BY DEFAULT with third parties unless you explicitly opt out (this includes Chase, Amex, Capital One, but likely all others).


For Chase Personal and Amex you can opt out in the settings. When you get new credit cards these institutions have the default setting to sharing your data. For Capital One you need to call them and have a chit chat that you want to exercise the restriction advertised in their privacy policy and they'll do it for you.

PG&E has a "Do not sell my info" form.

For other institutions, go check the settings and read the privacy policies.

I don't see the point of Rocket Money. They seem like they exist to sell your info.

You should keep track of your own subscriptions. My way of doing this is to have a separate zero-annual-fee credit card ONLY for subscriptions and I never use that card for anything else. That way I can cleanly see all my subscriptions on that credit card's bill, cleanly laid out, one per line, without other junk. I can also quickly spot sudden increases in monthly charges. I also never use that card in physical stores so that reduces the chance of a fraud incident where I need to cancel that card and then update all my subscriptions.

If you want to organize it even more, get a zero-annual-fee credit card that lets you set up virtual cards. You can then categorize your subscriptions (utilities, car, cloud/API, media, memberships, etc.) and that lets you keep track of how much you're spending on each category each month.



I'm willing to bet that for smaller companies, they just won't care enough to consider this an issue and that's what Slack/Salesforce is hedging on.

I can't see a universe in which large corpos would allow such blatant corporate espionage for a product they pay for no less. But I can already imagine trying to talk my CTO (who is deep into the AI sycophancy) into opting us out is gonna be arduous at best.



I'd be surprised if any legal department in any company with one will not freak the f out when they read this. They will likely loose the biggest customers first, so even if it is 1% of customers, it will likely affect their bottom line enough to give it a second though. I don't see how they might profit from an in-house LLM more than from their enterprise-tier plans.

Their customer support will have a hell of a day today.



> For any model that will be used broadly across all of our customers, we do not build or train these models in such a way that they could learn, memorise, or be able to reproduce some part of Customer Data

This feels so full of subtle qualifiers and weasel words that it generates far more distrust than trust.

It only refers to models used "broadly across all" customers - so if it's (a) not used "broadly" or (b) only used for some subset of customers, the whole statement doesn't apply. Which actually sounds really bad because the logical implication is that data CAN leak outside those circumstances.

They need to reword this. Whoever wrote it is a liability.



Sure lawyers wrote it but I'd bet a lot there's a product or business person standing behind the lawyer saying - "we want to do this but don't be so obvious about it because we don't want to scare users away". And so lawyers would love to be very upfront about what is happening because that's the best way to avoid liability. However, that conflicts with what the business wants, and because the lawyer will still refuse to write anything that's patently inaccurate, you end up with a weasel word salad that is ambiguous and unhelpful.


Yes, lawyers do tend to have a part to play in writing things that present a legally binding commitment being made by an organisation. Developers really can’t throw stones from their glass houses here. How many of you have a pre-canned spiel explaining why the complexities of whichever codebase you spend your days on are ACTUALLY necessary, and are certainly NOT the result of over-engineering? Thought so.


> How many of you have a pre-canned spiel explaining why the complexities of whichever codebase you spend your days on are ACTUALLY necessary, and are certainly NOT the result of over-engineering? Thought so.

Hm, now you mention it, I don't think I've ever seen this specific example.

Not that we don't have jargon that's bordering on cant, leading to our words being easily mis-comprehended by outsiders: https://i.imgur.com/SL88Z6g.jpeg

Canned cliches are also the only thing I get whenever I try to find out why anyone likes the VIPER design pattern — and that's despite being totally convinced that (one of) the people I was talking to, had genuinely and sincerely considered my confusion and had actually experimented with a different approach to see if my point was valid.



Especially when a few paragraphs below they say:

> If you want to exclude your Customer Data from helping train Slack global models, you can opt out.

So Customer Data is not used to train models "used broadly across all of our customers [in such a way that ...]", but... it is used to help train global models. Uh.



To me it says that they _do_ train global models with customer data, but they are trying to ensure no data leakage (which will be hard, but maybe not impossible, if they are training with it).

The caveats are for “local” models, where you would want the model to be able to answer questions about discussions in the workspace.

It makes me wonder how they handle “private” chats, can they leak across a workspace?

Presumably they are trying to train a generic language model which has very low recall for facts in the training data, then using RAG across the chats that the logged on user can see to provide local content.



My intuition is that it's impossible to guarantee there are no leaks in the LLM as it stands today. It would surely require some new computer science to ensure that no part of any output that could ever possibly be developed isn't sensitive data from any of the input.

It's one thing if the input is the published internet (even if covered by copyright), it's entirely another to be using private training data from corporate water coolers, where bots and other services routinely send updates and query sensitive internal services.



There is a way. Build a preference model from the sensitive dataset. Then use the preference model with RLAIF (like RLHF but with AI instead of humans) to fine-tune the LLM. This way only judgements about the LLM outputs will pass from the sensitive dataset. Copy the sense of what is good, not the data.


If you switch to Teams only for this reason I have some bad news for you - there’s no way Microsoft is not (or will not start in future) doing the same. And you’ll get a subpar experience with that (which is an understatement).


We've been using Mattermost and it works very well. Better than Slack.

The only downside is their mobile app is a bit unreliable, in that it sometimes doesn't load threads properly.



I would guess Microsoft has a lot more government customers (and large customers in general) than Slack does. So I would think they have a lot more to loose if they went this route.


> Why are these kinda things opt-out? And need to be discovered..

Monies.

> We're literally discussing switching to Teams at my company (1500 employees)

Considering what Microsoft does with its "New and Improved(TM)" Outlook and love for OpenAI, I won't be so eager...



we use Teams and it's fine.

Just don't use the "Team" feature of it to chat. Use chat groups and 1-to-1 of course. We use "Team" channels only for bots: CI results, alerts, things like that.

Meetings are also chat groups. We use the daily meeting as the dev-team chat itself so it's all there. Use Loops to track important tasks during the day.

I'm curious what's missing/broken in Teams that you would rather not have chat at all?



The idea that Slack makes companies work better needs some proof behind it, I’d say the amount of extra distraction is a net negative… but as with a lot of things in software and startups nobody researches anything and everyone writes long essays about how they feel things are.


Distraction is not enforced. Learning to control your attention and how to help yourself do it is crucial whatever you do in whatever time and in whatever technological context or otherwise. It is the most long term valuable resource you have.

I think we start to recognize this at larger scale.

Slack easily saves a ton of time solving complex problems that require interaction and expertise of a lot of people, often unpredictable number of them for each problem. They can answer with delay, in a good culture this is totally accepted and people still can independently move forward or switch tasks if necessary, same as with slower communication tools. You are not forced to answer with any particular lag, however slack makes it possible when needed to reduce it to zero.

Sometimes you are unsure if you need help or you can do smthing on your own. I certainly know that a lot of times eventually I had no chance whatsoever, because knowledge requires was too specialized, this is not always clear. Reducing barriers to communication in those cases is crucial and I don't see Slack being in the way here, only helpful.

The goal of organizing Slack is such that you pay right amount of attention to right parts of communication for you. You can do this if you really spend (hmm) attention trying to figure out what that is and how to tune your tools to achieve that.



That’s a lot of words with no proof isn’t it, it’s just your theory. Until I see a well designed study on such things I struggle to believe the conjecture you make either way. It could be quite possible that you benefit from Slack and I don’t.

Even receiving a message and not responding can be disruptive and on top I’d say being offline or ignoring messages is impossible in most companies.



This is your choice to trust only statements backed by scientific rigour or trying things out and applying to your way of life. This is just me talking to you, in that you are correct.

Regarding “receiving a message”: my devices are allowed only limited use of notifications. Of all the messaging/social apps only messages from my wife in our messaging app of choice pop us as notifications. Slack certainly is not allowed there



I'm not sure chat apps improve business communications. They are ephemeral, with differing expectations on different teams. Hardly what I'd label as "cohesive"

Async communications are critical to business success, to be sure -- I'm just not convinced that chat apps are the right tool.



From what I’ve seen (not much actually) Most channels can be replaced by a forum style discussion board. Chat can be great for 1:1 and small team interactions. And for tool interactions.


Nah. Whoever decided to create the reality their counsel is dancing around with this disclaimer is the actual problem, though it's mostly a problem for us, rather than them.


I think it's as clear as it can be, they go into much more detail and provide examples in their bullet points, here are some highlights:

Our model learns from previous suggestions and whether or not a user joins the channel we recommend. We protect privacy while doing so by separating our model from Customer Data. We use external models (not trained on Slack messages) to evaluate topic similarity, outputting numerical scores. Our global model only makes recommendations based on these numerical scores and non-Customer Data.

We do this based on historical search results and previous engagements without learning from the underlying text of the search query, result, or proxy. Simply put, our model can't reconstruct the search query or result. Instead, it learns from team-specific, contextual information like the number of times a message has been clicked in a search or an overlap in the number of words in the query and recommended message.

These suggestions are local and sourced from common public message phrases in the user’s workspace. Our algorithm that picks from potential suggestions is trained globally on previously suggested and accepted completions. We protect data privacy by using rules to score the similarity between the typed text and suggestion in various ways, including only using the numerical scores and counts of past interactions in the algorithm.

To do this while protecting Customer Data, we might use an etrnal model (not trained on Slack messages) to classify the sentiment of the message. Our model would then suggest an emoji only considering the frequency with which a particular emoji has been associated with messages of that sentiment in that workspace.



- Create a Slack account for your 95-year-old grandpa

- Exclude that one account from using the models, he's never going to use Slack anyway

- Now you can learn, memorise, or reproduce all the Customer Data you like



The problem is this also covers very reasonable use cases.

Use sampling across messages for spam detection, predicting customer retention, etc - pretty standard.

Then there's cases where you could have models more like llms that can output data from the training set but you're running them for that customer.



I'm imagining a corporate slack, with information discussed in channels or private chats that exists nowhere else on the internet.. gets rolled into a model.

Then, someone asks a very specific question.. conversationally.. about such a very specific scenario..

Seems plausible confidential data would get out, even if it wasn't attributed to the client.

Not that it’s possible to ask an llm how a specific or random company in an industry might design something…



Sometimes the obvious questions are met with a lot of silence.

I don't think I can be the only one who has had a conversation with GPT about something obscure they might know but there isn't much about online, and it either can't find anything... or finds it, and more.



Whatever lawyer wrote that should be fired. This poorly written nonsense makes it look like Slack is trying to look shady and subversive. Even if well intended this is a PR blunder.


> They need to reword this. Whoever wrote it is a liability.

Wow you're so right. This multi-billion dollar company should be so thankful for your comment. I can't believe they did not consult their in-house lawyers before publishing this post! Can you believe those idiots? Luckily you are here to save the day with your superior knowledge and wisdom.



In summary, you must opt-out if you want to exclude your data from global models.

Incredibly confusing language since they also vaguely state that "data will not leak across workspaces".

Use tools that cannot leak data not "will not".



In this case they mean leak into the global model — so no. You can have sovereignty of your data if you use an open protocol like IRC or Matrix, or a self-hosted tool like Zulip, Mattermost, Rocket Chat, etc


"File over app" is a good way of putting it!

Something strange is happening on your blog, fwiw: Bookmarking it via command + D flips the color scheme to "night mode" – is that intentional?



"Will not" allows the existence of a bridge but it's not on your route and you say you're not going to go over it. "Cannot" is the absence of a bridge or the ability to cross it.


I'm confused about this statement: "When developing AI/ML models or otherwise analyzing Customer Data, Slack can’t access the underlying content. We have various technical measures preventing this from occurring"

"Can't" is a strong word. I'm curious how an AI model could access data, but Slack, Inc itself couldn't. I suspect they mean "doesn't" instead of "can't", unless I'm missing something.



I also find the word "Slack" in that interesting. I assume they mean "employees of Slack", but the word "Slack" obviously means all the company's assets and agents, systems, computers, servers, AI models, etc.

I would find even a statement from Signal like "we can't access our users content" to be tenuous and overly-optimistic. Like, when I heard the word "can't" my brain goes to: there is nothing anyone in the company could do, within the bounds of the law, to do this. Employees at Slack could turn off the technical measures preventing this from occurring. Employees at Signal could push an app update which side-channels all messages through to a different server, unencrypted.

Better phrasing is "Employees of Slack will not access the underlying content".



> I would find even a statement from Signal like "we can't access our users content" to be tenuous and overly-optimistic.

I don't really agree with this statement. Signal literally can't read user data right now. The statement is true, why can't they use it?

If they can't use it, nobody can. there are no services that can't publish an update reversing any security measure available. Also doing that would be illegal, because it would render the statement "we can't access our users content" false.

In Slack case, it is totally different. Data is accessible by Slack systems, the statement "we can't access our users content" is already false. Probably what they mean is something along the lines of: "The data can't be accessed by our systems, but we have measures in place that block the access to most of our employees"



Interestingly I'd probably go the other way.

If it's verifiably E2EE then I consider "we can't access this" to be a fairly powerful statement. Sure, the source could change, but if you have a reasonable distribution mechanism (e.g. all users get the same code, verifiably reproducible) then that's about as good as you can get.

Privacy policies that state "we won't do XYZ" have literally zero value to me to the extent that I don't even look at them. If I give you some data, it's already leaked in my mind, it's just a matter of time.



As an engineer who has worked on systems that handle sensitive data, it seems straightforwardly to me to be a statement about:

1. ACLs

2. The systems that provision those ACLs

3. The policies that determine the rules those systems follow.

In other words, the model training batch job might run as a system user that has access to data annotated as 'interactions' (at timestamp T1 user U1 joined channel C1, at timestamp T2 user U2 ran a query that got 137 results), but no access to data annotated as 'content', like (certainly) message text or (probably) the text of users' queries. An RPC from the training job attempting to retrieve such content would be denied, just the same as if somebody tried to access someone else's DMs without being logged in as them.

As a general rule in a big company, you the engineer or product manager don't get to decide what the ACLs will look like no matter how much you might feel like it. You request access for your batch job from some kind of system that provisions it. In turn the humans who decide how that system work obey the policies set out by the company.

It's not unlike a bank teller who handles your account number. You generally trust them not to transfer your money to their personal account on the sly while they're tapping away at the terminal--not necessarily because they're law abiding citizens who want to keep their job, but because the bank doesn't make it possible and/or would find out. (A mom and pop bank might not be able to make the same guarantee, but Bank of America does.) [*]

In the same vein, this is a statement that their system doesn't make it possible for some Slack PM to jack their team's OKRs by secretly training on customer data that other teams don't use, just because that particular PM felt like ignoring the policy.

[*] Not a perfect analogy, because a bank teller is like a Slack customer service agent who might, presumably after asking for your consent, be able to access messages on your behalf. But in practice I doubt there's a way for an employee to use their personal, probably very time-limited access to funnel that data to a model training job. And at a certain level of maturity a company (hopefully) also no longer makes it possible for a human employee to train a model in a random notebook using whatever personal data access they have been granted and then deploy that same model to prod. Startups might work that way, though.



From their white paper linked in the same comment

> Provisioning To minimize the risk of data exposure, Slack adheres to the principles of least privilege and role-based permissions when provisioning access—workers are only authorized to access data that they reasonably must handle in order to fulfill their current job responsibilities. All production access is reviewed at least quarterly.

so... seems like they very clearly can.



"If you can read assembly, all programs are open source."

Sure, it's easier and less effort if the program is actually open source, but it's absolutely still possible to verify on bytecode, decompiled or disassembled programs, too.



We really need to start using self-hosted solutions. Like matrix / element for team messaging.

It's ok not wanting to run your own hardware at your own premises. But the solution is to run a solution that is end-to-end encrypted so that the hosting service cannot get at the data. cryptpad.fr is another great piece of software.



Eugh. Has anyone compiled a list of companies that do this, so I can avoid them? If anyone knows of other companies training on customer data without an easy highly visible toggle opt out, please comment them below.


Synology updated this policy back in March (Happened to be a Friday afternoon).

Services Data Collection Disclosure

"Synology only uses the information we obtain from technical support requests to resolve your issue. After removing your personal information, we may use some of the technical details to generate bug reports if the problem was previously unknown to implement a solution for our products."

"Synology utilizes the information gathered through technical support requests exclusively for issue resolution purposes. Following the removal of personal data, certain technical details may be utilized for generating bug reports, especially for previously unidentified problems, aimed at implementing solutions for our product line. Additionally, Synology may transmit anonymized technical information to Microsoft Azure and leverage its OpenAI services to enhance the overall technical support experience. Synology will ensure that personally identifiable information, such as names, phone numbers, addresses, email addresses, IP addresses and product serial numbers, is excluded from this process."

I used to just delete privacy policy update emails and the like but now I make a habit of going in to diff them to see if these have been slipped in.



We can fight back by not posting anything useful or accurate to the internet until there are protections in place and each person gets to decide how their data is used and whether they are compensated for it.


The incentive for first party tool providers to do this is going to be huge, whether its Slack, Google, Microsoft, or really any other SaaS tool. Ultimately, if business want to avoid getting commoditized by their vendors, they need be in control of their data, and their AI strategy. And that probably ultimately means turning off all of these small-utility-very-expensive-and-might-ruin-your-business features, and actually creating a centralized, access controlled, well governed knowledge base which you can plug any open source or black box LLM, from any provider.


all your files? no way that cozy of a blanket statement can be true. if you kept cycling in drives full of /dev/random you could fill up M$ servers with petabytes of junk? sounds like an appealing weekend project


How could this possibly comply with European "right to be forgotten" legislation? In fact, how could any of these AI models comply with that? If a user requests to be forgotten, is the entire model retrained (I don't think so).


This "ai" scam going on now is the ultimate convoluted process to hide sooo much tomfuckery: theres no such thing as copyright anymore! this isn't stealing anything, its transforming it! you must opt out before we train our model on the entire internet! (and we still won't spits in our face) this isn't going to reduce any jobs at all! (every company on earth fires 15% of everyone immediately) you must return to office immediately or be fired! (so we get more car data teehee) this one weird trick will turn you into the ultimate productive programmer! (but we will be selling it to individuals not really making profitable products with it ourselves)

and finally the most aggregious and dangerous: censorship at the lowest level of information before it can ever get anywhere near peoples fingertips or eyeballs.



> how could any of these AI models comply with that? If a user requests to be forgotten, is the entire model retrained (I don't think so).

I don't believe that is the current interpretation of GDPR, etc. - if the model is trained, it doesn't have to be deleted due to a RTBF request afaik. there is significant legal uncertainty here

Recent GDPR court decisions mean that this is probably still non-compliant due to the fact that it is opt-out rather than opt-in. Likely they are just filtering out all data produced in the EEA.



> Likely they are just filtering out all data produced in the EEA.

Likely they are just hoping to not get caught and/or consider it cost of doing business. GDPR has truly shown us (as if we didn't already know) that compliance must be enforced.



To add some nuance to this conversation, what they are using this for is Channel recommendations, Search results, Autocomplete, and Emoji suggestion and the model(s) they train are specific to your workspace (not shared between workspaces). All of which seem like they could be handled fairly privately using some sort of vector (embeddings) search.

I am not defending Slack, and I can think of number of cases where training on slack messages could go very badly (ie, exposing private conversations, data leakage between workspaces, etc), but I think it helps to understand the context before reacting. Personally, I do think we need better controls over how our data is used and slack should be able to do better than "Email us to opt out".



> the model(s) they train are specific to your workspace (not shared between workspaces)

That's incorrect -- they're stating that they use your "messages, content, and files" to train "global models" that are used across workspaces.

They're also stating that they ensure no private information can leak from workspace to workspace in this way. It's up to you if you're comfortable with that.



From the wording, it sounds like they are conscious of the potential for data leakage and have taken steps to avoid it. It really depends on how they are applying AI/ML. It can be done in a private way if you are thoughtful about how you do it. For example:

Their channel recommendations: "We use external models (not trained on Slack messages) to evaluate topic similarity, outputting numerical scores. Our global model only makes recommendations based on these numerical scores and non-Customer Data"

Meaning they use a non-slack trained model to generate embeddings for search. Then they apply a recommender system (which is mostly ML not an LLM). This sounds like it can be kept private.

Search results: "We do this based on historical search results and previous engagements without learning from the underlying text of the search query, result, or proxy" Again, this is probably a combination of non-slack trained embeddings with machine learning algos based on engagement. This sounds like it can be kept private and team specific.

autocomplete: "These suggestions are local and sourced from common public message phrases in the user’s workspace." I would be concerned about private messages being leaked via autocomplete, but if it's based on public messages specific to your team, that should be ok?

Emoji suggestions: "using the content and sentiment of the message, the historic usage of the emoji [in your team]" Again, it sounds like they are using models for sentiment analysis (which they probably didn't train themselves and even if they did, don't really leak any training data) and some ML or other algos to pick common emojis specific to your team.

To me these are all standard applications of NLP / ML that have been around for a long time.



The way it's written means this just isn't the case. They _MAY_ use it for what you have mentioned above. They explicitly say "...here are a few examples of improvements..." and "How Slack may use Customer Data" (emph mine). They also... may not? And use it for completely different things that can expose who knows what via prompt hacking.


Agreed, and that is my concern as well that if people get too comfortable with it then companies will keep pushing the bounds of what is acceptable. We will need companies to be transparent about ALL the things they are using our data for.


>Our mission is to build a product that makes work life simpler, more pleasant and more productive.

I know it would be impossible but I wish we go back to the days when we didn't have Slack (or tools alike). Our Slack is a cesspool of people complaining, talking behind other people's backs, echo chamber of negativity etc.

That probably speaks more to the overall culture of the company, but Slack certainly doesn't help.You can also say "tool is not the problem, people are" - sure, we can always explain things away, but Slack certainly plays a role here.



Your company sucks. I’ve used slack at four workplaces and it’s not been at all like that. A previous company had mailing lists and they were toxic as you describe. The tool was not the issue.


Yeah, written communication is harder than in-person communication.

It’s easy to come across poorly in writing, but that issue has no easy resolution unless you’re prepared to ban Slack, email, and any other text-based communication system between employees.

Slack can sometimes be a place for people who don’t feel heard in conventional spaces to vent — but that’s an organisational problem, not a Slack problem.



HN isn't really a bastion of media literacy or tech criticism. If you ever ask "does [some technology] affect [something qualitative] about [anything]", the response on hn is always going to be "technology isn't responsible, it's how the technology is used that is responsible!", asserting, over and over again, that technology is always neutral.

The idea that the mechanism of how people communicate affects what people communicate is a pretty foundational concept in media studies (a topic which is generally met with a hostile audience on HN). Slack almost certainly does play a role, but people who work in technology are incentivized to believe that technology does not affect people's behaviors, because that belief allows people who work in technology to be free of any and all qualitative or moral judgements on any grounds; the assertion that technology does not play a role is something that technology workers cling to because it absolves them of all guilt in all situations, and makes them, above all else, innocent in every situation. On the specific concept of a medium of communication affecting what is being communicated, McLuhan took these ideas to such an extreme that it's almost ludicrous, but he still had some pretty interesting observations worth thinking on, and his writing on this topic is some of the earlier work. This is generally the place where people first look, because much of the other work assumes you've understood McLuhan's work in advance. https://en.wikipedia.org/wiki/Understanding_Media



I disagree slack plays a role. You only mentioned human aspects, nothing to do with technology. There was always going to be instant messaging as software once computers and networks were invented. You'd just say this happens over email and blame email.


> That probably speaks more to the overall culture of the company

Yep. Fun fact, my last workplace had a fairly nontoxic Slack... but there was a whole second Slack dedicated to bitching and shitposting where the bosses weren't invited. Humans gonna human.



Was not limited to just the bosses who were not invited. If you weren’t in the cool club you also did not get an invite.

A very inclusive company on paper that was very exclusionary behind the scenes.



No, I don’t think Slack does play a role in this. It is quite literally a communication tool (and I’d argue one that encourages far _more_ open communication than others).

If Slack is a cesspool, that’s because your company culture is a cesspool.



I think open communication in a toxic environment can obviously amplify toxicity or at least less open communication can act as a damper on toxicity.

Slack is surely not the generator of toxicity but it seems obvious it could act at increasing the bandwidth.

You can't have it both ways.



Tokens (outside of a few trillion ) are worthless imo, I think OAI has pushed that limit, let the others chase them with billions into the ocean of useless conversational data and drown.


We’ll only use it for…

choosing an emoji, and…

a fun little internal only stock picker tool, that suggests to us some fun stocks to buy based on the anonymised real time inner monologue of hundreds of thousands of tech companies



> Emoji suggestion: Slack might suggest emoji reactions to messages using the content and sentiment of the message, the historic usage of the emoji and the frequency of use of the emoji in the team in various contexts. For instance, if [PARTY EMOJI] is a common reaction to celebratory messages in a particular channel, we will suggest that users react to new, similarly positive messages with [PARTY EMOJI].

Finally someone has figured out a sensible application for "AI". This is the future. Soon "AI" will have a similar connotation as "NFT".



"leadership" at my company tallies emoji reactions to their shitty slack messages and not reacting with emojies over a period of time is considered a slight against them.

I had to up my slack emoji game after joining my current employer



And it continues:

> To do this while protecting Customer Data, we might use an external model (not trained on Slack messages) to classify the sentiment of the message. Our model would then suggest an emoji only considering the frequency with which a particular emoji has been associated with messages of that sentiment in that workspace.

This is so stupid and needlessly complicated. And all it does is remove personality from messages, suggesting everyone conforms to the same reactions.



Finally. I am all for this AI if it is going to learn and suggest my passive aggressive "here" emoji that I use when someone @here s on a public channel with hundreds of people for no good reason.


> Data will not leak across workspaces.

> If you want to exclude your Customer Data from helping train Slack global models, you can opt out.

I don't understand how both these statements can be true. If they are using your data to train models used across workspaces then it WILL leak. If they aren't then why do they need an opt out?

Edit: reading through the examples of AI use at the bottom of the page (search results, emoji suggestions, autocomplete), my guess is this policy was put in place a decade ago and doesn't have anything to do with LLMs.

Another edit: From https://slack.com/help/articles/28310650165907-Security-for-...

> Customer data is never used to train large language models (LLMs).

So yeah, sounds like a nothingburger.



They're saying they won't train generative models that will literally regurgitate your text, my guess is classifiers are fair game in their interpretation


You are assuming they're saying that, because it's one charitable interpretation of what they're saying.

But they haven't actually said that. It also happens that people say things based on faulty or disputed beliefs of their own, or people willfully misrprepresent things, etc

Until they actually do say something as explicit as what you suggest, they haven't said anything of the sort.



> Data will not leak across workspaces. For any model that will be used broadly across all of our customers, we do not build or train these models in such a way that they could learn, memorize, or be able to reproduce some part of Customer Data.

I feel like that is explicitly what this is saying.



The problem is, it's really really hard to guarantee that.

Yes if they only train say, classifiers, then the only thing that can leak is the classification outcome. But these things can be super subtle. Even a classifier could leak things if you can hack the context fed into it. They are really playing with fire here.



If is also hard to guarantee that, in a multi-tenant application, users will never see other users' data due to causes like mistakes AuthZ logic, caching gone awry, or other unpredictable situations that come up in distributed systems—yet even before the AI craze we were all happy to use these SaaS products anyway. Maybe this class of vulnerability is indeed harder to tame than most, but third-party software has never been without risks.


yes, i certainly agree with you. i think oftentimes these policies are written by non-technical people

i'm not entirely convinced that classifiers and LLMs are disjoint to begin with



The OP privacy policy explicitly states that autocompletion algorithms are part of the scope. "Our algorithm that picks from potential suggestions is trained globally on previously suggested and accepted completions."

And this can leak: for instance, typing "a good business partner for foobars is" might not send that text upstream per se, but would be consulting a local model whose training data would have contained conversations that other Slack users are having about brands that provide foobars. How can Slack guarantee that the model won't incorporate proprietary insights on sourcing the best foobar producers into its choice of the next token? And sure, one could build an adversarial model that attempts to minimize this kind of leakage, but is Slack incentivized to create such a thing vs. just building an optimal autocomplete as quickly as possible?

Even if it were just creating classifiers, similar leakages could occur there, albeit requiring more effort and time from attackers to extract actionable data.

I can't blame Slack for wanting to improve their product, but I'd also encourage any users with proprietary conversations to encourage their admins to opt out as soon as possible.



> How can Slack guarantee that the model won't incorporate proprietary insights on sourcing the best foobar producers into its choice of the next token?

This is explained literally in the next sentence after the one you quoted: "We protect data privacy by using rules to score the similarity between the typed text and suggestion in various ways, including only using the numerical scores and counts of past interactions in the algorithm."

If all the global model sees is {similarity: 0.93, past_interactions: 6, recommendation_accepted: true} then there is no way to leak tokens, because not only are the tokens not part of the output, they're not even part of the input. But such a simple model could still be very useful for sorting the best autocomplete result to the top.



yeah i absolutely agree that even classifiers can leak, and the autocorrect thing sounds like i was wrong about generative (it sounds like an n-gram setup?)... although they also say they don't train LLMs (what is an n-gram? still an LM, not large... i guess?)


This reminds me of a company called C3.ai which claims in its advertising to eliminate hallucations using any LLM. OpenAI, Mistral, and others at the forefront of this field can't manage this, but a wrapper can?? Hmm...


I would expect this from free service, but from paid service with non trivial cost... It seems insane... Maybe whole model of doing business is broken...


So if you want to opt out, there's no setting to switch, you need to send an email with a specific subject:

> Contact us to opt out. [...] To opt out, please have your Org or Workspace Owners or Primary Owner contact our Customer Experience team at [email protected] with your Workspace/Org URL and the subject line “Slack Global model opt-out request.” [...]



This feels like a corporate greed play, on what should be a relatively simple chat application. Slack has quickly become just another enterprise solution in search of shareholder value at expensive of data privacy. Regulation of these companies should be more apparent to people, but sadly, is not.

I would recommend https://mattermost.com as an alternative.



Good we moved to matrix already. I just hope they start putting more emphasis on Element X, which message handling is broken on iOS for weeks now.


Hm. is this on iOS or Android, and what version? This is first i've heard of this; it should be rock solid. Am wondering if you're stuck on an ancient version or something.


i think i can count the number of crashes i've had with Element X iOS on fingers of one hand since dogfooding it since last summer (whereas classic Element iOS would crash every few days). Is this on Android or iOS? Was it a recent build? If so, which one? It really shouldn't be crashing.


> Contact us to opt out. If you want to exclude your Customer Data from Slack global models, you can opt out. To opt out, please have your org, workspace owners or primary owner contact our Customer Experience team at [email protected]

Sounds like an invitation for malicious compliance. Anyone can email them a huge text with workspace buried somewhere and they have to decipher it somehow.

Example [Answer is Org-12-Wp]:

"

FORMAL DIRECTIVE AND BINDING COVENANT

WHEREAS, the Parties to this Formal Directive and Binding Covenant, to wit: [Your Name] (hereinafter referred to as "Principal") and [AI Company Name] (hereinafter referred to as "Technological Partner"), wish to enter into a binding agreement regarding certain parameters for the training of an artificial intelligence system;

AND WHEREAS, the Principal maintains control and discretion over certain proprietary data repositories constituting segmented information habitats;

AND WHEREAS, the Principal desires to exempt one such segmented information habitat, namely the combined loci identified as "Org", the region denoted as "12", and the territory designated "Wp", from inclusion in the training data utilized by the Technological Partner for machine learning purposes;

NOW, THEREFORE, in consideration of the mutual covenants and promises contained herein, the receipt and sufficiency of which are hereby acknowledged, the Parties agree as follows:

DEFINITIONS

1.1 "Restricted Information Habitat" shall refer to the proprietary data repository identified by the Principal as the conjoined loci of "Org", the region "12", and the territory "Wp".

OBLIGATIONS OF TECHNOLOGICAL PARTNER

2.1 The Technological Partner shall implement all reasonably necessary technical and organizational measures to ensure that the Restricted Information Habitat, as defined herein, is excluded from any training data sets utilized for machine learning model development and/or refinement.

2.2 The Technological Partner shall maintain an auditable record of compliance with the provisions of this Formal Directive and Binding Covenant, said record being subject to inspection by the Principal upon reasonable notice.

REMEDIES

3.1 In the event of a material breach...

[Additional legalese]

IN WITNESS WHEREOF, the Parties have executed this Formal Directive and Binding Covenant."



> We offer Customers a choice around these practices. If you want to exclude your Customer Data from helping train Slack global models, you can opt out. If you opt out, Customer Data on your workspace will only be used to improve the experience on your own workspace and you will still enjoy all of the benefits of our globally trained AI/ML models without contributing to the underlying models.

Sick and tired of these default opt in explicit opt out legalese.

The default should be opt out.

Just stop using my data.



This is, once again, why I wanted us to go to self-hosted Mattermost instead of Slack. I recognize Slack is probably the better product (or mostly better), but you have to own your data.


In case this is helpful to anyone else, I opted out earlier today with an email to [email protected]

Subject: Slack Global Model opt-out request.

Body:

.slack.com

Please opt the above Slack Workspace out of training of Slack Global Models.



Make sure you put a period at the end of the subject line. Their quoted text includes a period at the end.

Please also scold them for behaving unethically and perhaps breaking the law.



Products that have search, autocomplete, etc… use rankers that are trained on System Metadata to build the core experience.

Microsoft Teams, Slack, etc… all do the same thing under the hood.

Nobody is pumping the text into an LLM training. The examples make this very clear as well.

Comment section here is divorced from reality.



"We protect privacy while doing so by separating our model from Customer Data. We use external models (not trained on Slack messages) to evaluate topic similarity, outputting numerical scores. Our global model only makes recommendations based on these numerical scores and non-Customer Data."

I think this deserves more attention. For many tasks like contextual recommendations, you can get most of the way by using an off-the-shelf model, but then you get a floating-point output and need to translate it into a binary "show this to the user, yes or no?" decision. That could be a simple thresholding model "score > θ", but that single parameter still needs to be trained somehow.

I wonder how many trainable parameters people objecting to Slack's training policy would be willing to accept.



It’s very difficult to ensure no data leakage, even for something like an Emoji prediction model. If you can try a large number of inputs and see what the suggested emoji is, that’s going to give you information about the training set. I wouldn’t be surprised to see a trading company or two pop up trying to exploit this to get insider information.


Whatever the models used, or type of data within accounts this operates on, this clause would be red lined in most of the big customer accounts that have leverage during the sales/renewal process. Small to medium accounts will be supplying most of this data.


How does one technically opt-out after model training is completed? You can't exactly go into the model and "erase" parts of the corpus post-hoc.

Like when you send an email to [email protected] with that perfect subject like (jeez, really?) what exactly does the customer support rep do on their end to opt you out?

Now is definitely the time to get/stay loud. If it dies down, the precedent has been set.



Not to be glib, but this why we built Tonic Textual (www.tonic.ai/textual). It’s both very challenging and very important to protect data in training workflows. We designed Textual to make it easy to both redact sensitive data and replace it with contextually relevant synthetic data.


To add on to this: I think it should be mentioned that Slack says they'll prevent data leakage across workspaces in their model, but don't explain how they do this. They don't seem to go into any detail about their data safeguards and how they're excluding sensitive info from training. Textual is good for this purpose since it redacts PII thus preventing it from being leaked by the trained model.

Disclaimer: I work at Tonic



How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "11 spices - mix with 2 cups of white flour ... 2/3 teaspoons of salt, 1/2 teaspoons of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 70 years


Fair question, but you have to consider the realistic alternatives. For most of our customers inaction isn't an option. The combination of NER models + synthesis LLMs actually handles these types of cases fairly well. I put your comment into our web app and this was the output:

How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "17 spices - mix with 2lbs of white flour ... half teaspoon of salt, 1 tablespoon of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 75 years.



I wonder how many people that are really mad about these guys or SE using their professional output to train models thought commercial artists were just being whiny sore losers when Deviant Art, Adobe, OpenAI, Stability, et al did it to them.


squarely in the former camp. there's something deeply abhorrent about creating a place that encourages people to share and build and collaborate, then turning around and using their creative output to put more money in shareholder pockets.

i deleted my reddit and github accounts when they decided the millions of dollars per month they're receiving from their users wasn't enough. don't have the power to move our shop off slack but rest assured many will as a result of this announcement.



Yeah I haven't put a new codebease on GH in years. It's kind of a PITA hosting my own gitea server for personal projects but letting MS copy my work to help make my professional skillset less valuable is far less palatable.

Companies doing this would make me much less angry if they used an opt-in model only for future data. I didn't have a crystal ball and I don't have a time machine, so I simply can't stop these companies from using my work for their gain.



Compared to hosting other things? Nothing! It's great.

Hosting my own service rather than using a free SaaS solution that is entirely someone else's problem? There's a significant difference there. I've been running Linux servers either professionally or personally for almost 25 years, so it's not like it's a giant problem... but my work has been increasingly non-technical over the past 5 years or so, so even minor hiccups require re-acclimating myself to the requisite constructs and tools (wait, how do cron time patterns work? How do I test a variable in bash for this one-liner? How do iptables rules work again?)

It's not a deal breaker, but given the context, it's definitely not not a pain in the ass, either.



Thanks for elaborating! I'm a retired grunt and tech is just a hobby for me. I host my own Gitea with the same challenges, but to me looking up cron patterns etc. is the norm, not the exception, so I don't think much about it.


“But look, you found the notice, didn’t you?”

“Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.”



So much to "if you are not paying you are the product". There is nothing that can stop companies from using your sweet sweet data once give it them.


Wow. I understand business models that are freemium but for a premium priced B2B product? This feels like an incredible rug pull. This changes things for me.


> To develop AI/ML models, our systems analyse Customer Data (e.g. messages, content and files) submitted to Slack

This Is Fine.



While Slack emphasizes that customers own their data, the default of Customer Data being used to train AI/ML models (even if aggregated and disassociated) may not align with all customers' expectations of data ownership and control.


> we do not build or train these models in such a way that they could learn, memorise, or be able to reproduce some part of Customer Data

They don't "build" them this way (whatever that means) but if training data is somehow leaked, they're off the hook because they didn't build it that way?



This is as systemically concerning as the data practices seen on Discord with integrations like statbot.net, though at least Slack is being transparent about it. Regardless, I find all of this highly problematic.


Disord's TOS used to say "we may sell all your convos, including your private ones". Then some time later, they suddenly they changed it to noooo, we would never sell aaanything, and didn't even update the "last changed" date. I deleted my Discord account and stopped using them immediately after I noticed the TOS, but them sneakily trying to cover it up later completely ruined any lingering trust I might have had in them.

And this is just one of many, many problems associated with the platform.



It seems like we've entered an era where not only are you paying for software with money, you're also paying for software with your data, privacy implications be damned. I would love to see people picking f/oss instead.


Problems with f/oss for business applications:

1. Great UX folks almost never work for free. So the UX of nearly all OSS is awful.

2. Great software comes from a close connection to users. When your software is an OS kernel that works just fine for programmers, but how many OSS folks want to spend their free time on zoom talking to hundreds of businesses and understanding their needs, so they can give them free software?

See also: year of Linux desktop



The good news for FOSS is that the UX of most commercial software is also awful and generally getting worse. The bad news is that FOSS software is copying a lot of the same UX trends.


> Contact us to opt out. If you want to exclude your Customer Data from Slack global models, you can opt out. To opt out, please have your Org or Workspace Owners or Primary Owner contact our Customer Experience team at [email protected] with your Workspace/Org URL and the subject line “Slack Global model opt-out request.” We will process your request and respond once the opt out has been completed.

This is not ok. We didn't have to reach out by email to sign up, this should be a toggle in the UI. This is deliberately high friction.



Even if salesforce has the purest intentions of following policy your data is still at risk.

In real life policies have to be enforced and it's not always technically feasible to do so. It doesn’t even have to be calculated or malicious!



If you send the opt out message to slack, take the second and include [email protected]. Helps to get it done faster in most cases
联系我们 contact @ memedata.com