The End of Online Anonymity: How AI and Data Brokers Could Unmask Millions of Internet Users at Scale

For decades, the implicit bargain of the internet has been that pseudonymity — posting under a screen name, browsing behind a VPN, compartmentalizing identities across platforms — offered a reasonable shield against identification. That assumption is now under direct threat. A recent technical analysis by AI safety researcher Simon Lermen lays out in granular detail how current artificial intelligence capabilities, combined with the vast commercial data broker industry, could enable the large-scale deanonymization of online users at costs that are startlingly low.
The implications extend far beyond academic concern. Journalists protecting sources, political dissidents operating under authoritarian regimes, whistleblowers, domestic abuse survivors, and ordinary citizens who simply prefer not to have their Reddit posts linked to their real names all face a new calculus of risk. The technical barriers that once made mass deanonymization impractical are eroding rapidly, and the policy infrastructure to address this shift barely exists.
A Blueprint for Unmasking the Internet
In his detailed Substack post titled “Large-Scale Online Deanonymization,” Lermen outlines a multi-step pipeline that could theoretically link anonymous online accounts to real-world identities. The process begins with what he calls “seed identities” — cases where a person’s real name is already loosely connected to an online handle through some publicly available information. From these seeds, an attacker can use large language models (LLMs) to analyze writing style, posting patterns, topic interests, and temporal activity to expand the web of identified accounts.
The approach is not purely theoretical. Lermen references existing research on stylometry — the statistical analysis of writing style — which has shown that even short text samples can be matched to authors with surprising accuracy when AI models are applied. Modern LLMs dramatically reduce the cost and expertise required to perform this kind of analysis. What once demanded a team of computational linguists and custom software can now be accomplished with API calls to commercially available AI systems.
The Data Broker Dimension
Writing style analysis alone would be insufficient for mass deanonymization. The real force multiplier, as Lermen explains, is the commercial data broker industry. Companies like Acxiom, LexisNexis, and dozens of smaller firms aggregate enormous quantities of personal data — purchasing histories, location data from mobile apps, voter registration records, property records, social media activity, and much more. This data is legally bought and sold in the United States with minimal regulatory oversight.
When AI-driven stylometric analysis produces a probabilistic match between an anonymous account and a real person, data broker records can serve as a confirmation layer. If an anonymous Reddit user frequently posts about living in a specific neighborhood, working in a particular industry, and owning a certain breed of dog, those details can be cross-referenced against data broker profiles to narrow candidates dramatically. Lermen estimates that the cost of running such a pipeline at scale — potentially deanonymizing millions of accounts — could be as low as a few dollars per identity, making it accessible not just to nation-states but to corporations, stalkers, and political operatives.
The Economics of Exposure
Perhaps the most alarming aspect of Lermen’s analysis is the cost structure. He breaks down the expenses involved: API costs for running LLM queries against text corpora, data broker access fees, and computational overhead for matching algorithms. The numbers suggest that a well-funded operation — a political campaign, a corporate intelligence firm, a foreign intelligence service — could deanonymize large populations of online users for budgets well within their reach. A campaign to identify thousands of anonymous critics on social media might cost less than a single television advertisement.
This economic accessibility represents a fundamental change. Previous deanonymization efforts, such as those conducted by law enforcement agencies seeking to identify users of dark web marketplaces, required significant institutional resources and often court orders. The pipeline Lermen describes operates entirely within the bounds of commercially available tools and legally purchasable data. No hacking is required. No warrants are needed. The information is simply assembled from sources that are already, in one form or another, available for purchase.
Stylometry Meets Modern AI
The academic field of stylometry has a long history, stretching back to efforts to identify the anonymous authors of the Federalist Papers. But the application of modern transformer-based language models to this problem has dramatically changed what is possible. Research has demonstrated that GPT-class models can distinguish between authors based on relatively small writing samples, picking up on patterns in punctuation, sentence structure, vocabulary choice, and even the frequency of certain function words that humans would never consciously notice.
Lermen points out that most people maintain remarkably consistent writing habits across platforms, even when they believe they are disguising their identity. The cadence of someone’s Reddit comments often mirrors their Twitter posts, their blog entries, or even their work emails. When an LLM is given samples from a known identity and asked to score anonymous texts for similarity, the results can be disturbingly accurate. Combined with metadata — posting times that correlate with a specific time zone, references to local events or weather, mentions of workplace details — the stylometric signal becomes very strong.
Who Is Most Vulnerable?
The populations most at risk from large-scale deanonymization are precisely those who most depend on anonymity. Political dissidents in countries like Iran, China, and Russia often use pseudonymous social media accounts to criticize their governments. Whistleblowers in corporate or government settings may post anonymously to forums or contact journalists through channels they believe are secure. Survivors of domestic violence may maintain social media presences under assumed names to avoid being found by abusers.
Even in democratic societies, the consequences of deanonymization can be severe. Anonymous posters on forums discussing addiction, mental health, sexuality, or controversial political views face potential social and professional repercussions if their identities are revealed. Lermen’s analysis suggests that the technical capacity to conduct these exposures at scale already exists and is only becoming cheaper and more accessible with each new generation of AI models. The question is not whether this will happen, but how broadly and by whom.
The Regulatory Vacuum
Current privacy regulations in the United States offer little protection against the kind of deanonymization Lermen describes. The data broker industry operates largely without federal oversight. While the European Union’s General Data Protection Regulation (GDPR) provides stronger protections for personal data, enforcement against cross-border AI-driven analysis remains challenging. There is no U.S. federal law that specifically prohibits the act of linking an anonymous online identity to a real person using commercially available data and AI tools.
Some states have taken partial steps. California’s Consumer Privacy Act (CCPA) gives residents the right to request deletion of their personal data from broker databases, but compliance is inconsistent and the process is burdensome. Vermont requires data brokers to register with the state, providing at least some transparency about the industry’s scope. But these measures are patchwork solutions to a problem that is national and international in scale. Legislative proposals for a comprehensive federal privacy law have stalled repeatedly in Congress, leaving a gap that grows more consequential as AI capabilities advance.
Countermeasures and Their Limits
Lermen discusses several potential countermeasures that individuals and platforms might employ. On the individual level, users can attempt to vary their writing style across platforms, use tools that paraphrase or restyle their text before posting, and minimize the personal details they share. However, research suggests that even deliberate attempts to disguise writing style are often insufficient to defeat AI-based stylometric analysis, particularly when the attacker has access to large corpora of the target’s known writing.
Platform-level defenses could include stripping metadata from posts, introducing random delays in posting timestamps, or offering built-in text anonymization tools. But these measures impose costs on user experience and platform functionality, and few companies have shown willingness to implement them. The advertising-driven business model of most social media platforms is, in fact, fundamentally aligned with data collection rather than data protection. Platforms profit from knowing as much as possible about their users, which creates structural resistance to the kind of aggressive anonymization that would be needed to counter the threat Lermen describes.
A Turning Point for Digital Privacy
The analysis presented by Lermen on his Substack arrives at a moment when public concern about AI and privacy is intensifying but legislative action remains stalled. The convergence of powerful language models, cheap computational resources, and a largely unregulated data broker industry creates conditions under which mass deanonymization is not a hypothetical future risk but a present technical capability. The only barriers are organizational — someone has to decide to build and deploy the pipeline.
For industry insiders, the implications are significant. Companies that promise anonymity to their users — from social media platforms to healthcare forums to anonymous workplace review sites — may find those promises increasingly difficult to keep, not because of any failure on their part, but because the ambient data environment has made anonymity structurally fragile. The question facing policymakers, technologists, and civil society is whether the right to online anonymity will be actively defended through regulation and technical innovation, or whether it will quietly erode until it exists only as a comforting fiction. The technical research suggests the clock is already running.