How AI Search Engines Build Profiles About You

For two decades, managing digital privacy meant managing search engine results. If an executive wanted to reduce their footprint, the objective was straightforward: push sensitive URLs off the first page of Google.

That approach assumed a human searcher scrolling through ten blue links. It no longer holds.

Agentic AI search engines — Perplexity, ChatGPT with web access, enterprise research assistants — do not present a list of pages to click. They synthesise answers. They pull data from multiple sources in real time, cross-reference it, and deliver a single narrative. The UK Information Commissioner’s Office flagged this shift in its March 2026 guidance on agentic AI, warning that “significant and growing quantities” of personal data are now processed through these systems. The corporate implications of this shift are covered in our Corporate Digital Footprint hub.

The question is no longer which page ranks first. The question is what the AI can assemble from everything it reaches.

What AI Search Engines Already Know

When an agentic AI receives a query about an executive, it uses retrieval-augmented generation (RAG) to crawl live sources and combine them with its training data. The result is not a list of links — it is a constructed profile.

Researchers at ETH Zurich demonstrated how far this goes. In their paper “Beyond Memorization: Violating Privacy via Inference with Large Language Models”, they tested nine LLMs against real Reddit profiles and found that GPT-4 could infer personal attributes — location, income bracket, age, marital status — with 85% top-1 accuracy. The models did not retrieve this information from a database. They inferred it from patterns in ordinary text.

Now apply that capability to an executive whose name appears across corporate filings, property registries, conference programmes, people-search profiles, and breach databases. The AI does not need any single source to contain a full profile. It builds one by connecting fragments:

Property and geospatial data. Residential addresses from land registries, property valuations from municipal records, historical addresses from electoral rolls or utility filings.
Corporate and financial footprint. Directorships across jurisdictions, patent registrations, political donations, public contract awards. In the Netherlands, the KVK (Kamer van Koophandel) publishes director names and registered business addresses by default.
The shadow footprint. Data published not by the executive but by their network. A spouse’s public Instagram showing holiday locations. A charity board’s donor acknowledgement list. A child’s sports club roster. None of this requires the executive to have a social media account.
Breach data. Email-password pairs from corporate breaches a decade old, linked to current accounts through credential reuse patterns. An LLM can connect a breached email from 2015 to an active LinkedIn profile in milliseconds. A human analyst would need hours.

This is the mosaic effect operating at machine speed.

Where AI Gets Its Information

To reduce what an AI can assemble, you need to understand where it looks. LLMs do not generate personal data from nothing. They pull it from specific, often overlooked repositories.

Training datasets and web archives

Foundation models are trained on massive text corpora. The Common Crawl archive — the largest open web dataset and a primary training source for most major LLMs — contains over 300 billion captured web pages. If an executive’s personal phone number appeared in a conference speaker PDF in 2018, and that PDF was indexed before being removed from the live web, the information may persist in the model’s training data indefinitely.

This is not a theoretical risk. It is a documented property of how these models work. Deleting a web page does not delete the data from a model trained on a snapshot of that page.

Data broker APIs and syndication networks

Agentic search engines query live data sources. Data brokers expose structured records through APIs and directory listings that AI agents can parse directly — they read the underlying JSON metadata, not the visual website.

The EDPB’s March 2026 market study on data brokers, conducted through its Support Pool of Experts programme, identified eight distinct categories of organisations operating as data brokers or performing related data provisioning functions in the EU alone — from traditional personal data brokers to AI platforms integrating personal data and data marketplaces. These are the upstream nodes feeding agentic search results.

Public registries and passive infrastructure

Corporate registries, land records, and court filings are prime targets for AI ingestion. Beyond these, AI agents parse passive DNS databases and historical WHOIS records. An executive who registered a personal domain ten years ago using a home address and private email created a permanent link between their identity and that infrastructure — even if the domain has since expired.

Certificate Transparency logs, which record every SSL certificate ever issued, provide another vector. A certificate issued for a staging subdomain of a personal project can reveal hosting infrastructure, email addresses, and organisational relationships.

For a self-diagnostic view of which of these surfaces is currently exposed in your own case, our Executive Exposure Checklist walks through ten reconnaissance categories with a per-row weighting.

Why Privacy Is Now a Supply-Chain Problem

Standard privacy advice does not address this architecture. Making an Instagram account private or deleting a LinkedIn profile reduces surface-level exposure, but the structural data sources — brokers, registries, training datasets, archived pages — remain untouched. The AI still has material to work with.

Automated data removal tools face a similar limitation. As the UC Irvine study of 543 California-registered data brokers found, 40% of brokers simply failed to respond to opt-out requests. Among those that did respond, many deployed friction mechanisms — CAPTCHAs, demands for additional personal information, broken submission forms — that automated scripts cannot navigate reliably. We covered this in more detail in our analysis of automated removal limits.

The structural approach requires mapping which sources the AI actually reaches, tracing exposed data points back to their origin, and removing records at the primary aggregator level — not just the downstream nodes that repopulate within weeks.

This is what a Mirror audit is built to do: adversarial reconnaissance against the same data infrastructure that AI agents query, followed by source-level documentation of where each exposure originates. The output is not a list of Google results. It is a map of the data supply chain feeding the models.

What Changes in an AI-First World

In 2026, privacy management is infrastructure work. The shift from search engines to agentic AI means that hiding a single web page accomplishes very little when the AI draws from dozens of sources simultaneously.

The executives who are hardest to profile are not the ones with no internet presence. They are the ones whose data has been methodically removed from the structural sources — brokers, registries, archives — that the models depend on. When those upstream sources return nothing, the AI’s retrieval mechanism fails. The profile thins. The citations break.

For anyone responsible for executive security, the implication is direct: if your threat model still assumes a human adversary typing queries into Google, it is already outdated.