Study exposes privacy risks of AI chatbot conversations

Credit: Andrea Piacquadio from Pexels

Major AI companies are utilizing user conversations for training purposes, raising significant privacy concerns and underscoring the need for more transparent policies.

Last month, Anthropic made a quiet change to its terms of service for customers: Conversations you have with its AI chatbot, Claude, will be used for training its large language model by default, unless you opt out.

Anthropic is not alone in adopting this policy. A recent study of frontier developers’ privacy policies found that six leading U.S. companies feed user inputs back into their models to improve capabilities and win market share. Some give consumers the choice to opt out, while others do not.

Given this trend, should users of AI-powered chat systems worry about their privacy? “Absolutely yes,” says Jennifer King, privacy and data policy fellow at the Stanford Institute for Human-Centered AI, and lead author of the study posted to the arXiv preprint server.

“If you share sensitive information in a dialog with ChatGPT, Gemini, or other frontier models, it may be collected and used for training, even if it’s in a separate file that you uploaded during the conversation.”

King and her team of Stanford scholars examined AI developers’ privacy policies and identified several causes for concern, including long data retention periods, training on children’s data, and a general lack of transparency and accountability in developers’ privacy practices. In light of these findings, consumers should think twice about the information they share in AI chat conversations and, whenever possible, affirmatively opt out of having their data used for training.

The history of privacy policies

As a communication tool, the internet-era privacy policy that’s now being applied to AI chats is deeply flawed. Typically written in convoluted legal language, these documents are difficult for consumers to read and understand. Yet, we have to agree to them if we want to visit websites, query search engines, and interact with large language models (LLMs).

In the last five years, AI developers have been scraping massive amounts of information from the public internet to train their models, a process that can inadvertently pull personal information into their datasets.

“We have hundreds of millions of people interacting with AI chatbots, which are collecting personal data for training, and almost no research has been conducted to examine the privacy practices for these emerging tools,” King explains.

In the United States, she adds, privacy protections for personal data collected by or shared with LLM developers are complicated by a patchwork of state-level laws and a lack of federal regulation.

In an effort to help close this research gap, the Stanford team compared the privacy policies of six U.S. companies: Amazon (Nova), Anthropic (Claude), Google (Gemini), Meta (Meta AI), Microsoft (Copilot), and OpenAI (ChatGPT). They analyzed a web of documents for each LLM, including its published privacy policies, linked subpolicies, and associated FAQs and guidance accessible from the chat interfaces, for a total of 28 lengthy documents.

To evaluate these policies, the researchers followed a methodology used by the California Consumer Privacy Act, as it is the most comprehensive privacy law in the United States, and all six frontier developers are required to comply with it. For each company, the researchers analyzed language in the documentation to discern how the stated policies address three questions:

  1. Are user inputs to chatbots used to train or improve LLMs?
  2. What sources and categories of personal consumer data are collected, stored, and processed to train or improve LLMs?
  3. What are the users’ options for opting into or out of having their chats used for training?

Blurred boundaries

The scholars found all six companies employ users’ chat data by default to train their models, and some developers keep this information in their systems indefinitely. Some, but not all, of the companies state that they de-identify personal information before using it for training purposes. And some developers allow humans to review users’ chat transcripts for model training purposes.

In the case of multiproduct companies, such as Google, Meta, Microsoft, and Amazon, user interactions also routinely get merged with information gleaned from other products consumers use on those platforms—search queries, sales/purchases, social media engagement, and the like.

These practices can become problematic when, for example, users share personal biometric and health data without considering the implications. Here’s a realistic scenario: Imagine asking an LLM for dinner ideas. Maybe you specify that you want low-sugar or heart-friendly recipes. The chatbot can draw inferences from that input, and the algorithm may decide that you fit a classification as a health-vulnerable individual.

“This determination drips its way through the developer’s ecosystem. You start seeing ads for medications, and it’s easy to see how this information could end up in the hands of an insurance company. The effects cascade over time,” King explains.

Another red flag the researchers discovered concerns the privacy of children: Developers’ practices vary in this regard, but most are not taking steps to remove children’s input from their data collection and model training processes. Google announced earlier this year that it would train its models on data from teenagers, if they opt in.

By contrast, Anthropic says it does not collect children’s data nor allow users under the age of 18 to create accounts, although it does not require age verification. And Microsoft says it collects data from children under 18, but does not use it to build language models. All of these practices raise consent issues, as children cannot legally consent to the collection and use of their data.

Privacy-preserving AI

Across the board, the Stanford scholars observed that developers’ privacy policies lack essential information about their practices. They recommend policymakers and developers address data privacy challenges posed by LLM-powered chatbots through comprehensive federal privacy regulation, affirmative opt-in for model training, and filtering personal information from chat inputs by default.

“As a society, we need to weigh whether the potential gains in AI capabilities from training on chat data are worth the considerable loss of consumer privacy. And we need to promote innovation in privacy-preserving AI, so that user privacy isn’t an afterthought,” King concludes.

More information:
Jennifer King et al, User Privacy and Large Language Models: An Analysis of Frontier Developers’ Privacy Policies, arXiv (2025). DOI: 10.48550/arxiv.2509.05382

Journal information:
arXiv


Provided by
Stanford University


Citation:
Study exposes privacy risks of AI chatbot conversations (2025, October 17)
retrieved 17 October 2025
from https://techxplore.com/news/2025-10-exposes-privacy-ai-chatbot-conversations.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.