Most people imagine their digital traces as the sum of what they posted: the photos, the comments, the profile, the account they deleted. Remove the content, the thinking goes, and the record shrinks.
It does not.
The more durable record was built by other people, by automated systems in the background of apps you never installed, by institutions whose job is to write things down, and increasingly by AI assistants that log every prompt you send. You did not author most of this. In most cases you did not consent to it.
This article walks through five structural layers of that record, then adds a sixth that most inventories miss: what happens when you type a question into ChatGPT, Claude, Gemini, or Grok. The earlier layers sit underneath the visible footprint covered in our piece on the assembled profile. The AI layer is newer and almost entirely absent from published guides.
Layer 1: Contact-graph harvesting
Truecaller’s database contains mappings for more than 500 million users globally, and substantially more phone numbers than that. Most of those mappings come from one mechanism: when a user installs the app and grants contact-list access, the app uploads their address book. If your number is in someone’s phone with your name attached, your number ends up in Truecaller’s database with your name attached — regardless of whether you have ever heard of the app.
A 2019 audit of the call-blocking app category by Avast estimated approximately three billion phone numbers held across the major apps in the segment. Truecaller does not perform contact-book uploads from European users, presumably to avoid GDPR liability, but does so from most of the rest of the world.
The mechanism illustrates a structural problem in the consent model underlying modern data collection. Your contact can consent on your behalf to the inclusion of your name and number in a third-party database. Once your entry exists, you are in the search interface as soon as anyone with the app types your number.
Facebook and WhatsApp run versions of the same loop. The “find your friends on Facebook” flow, present in some form for over a decade, harvests contact lists of users who agree to it. Where those uploads include data about non-users, the result is a shadow profile: a record assembled about someone who has never registered. The legal status of this practice has been challenged repeatedly under European data-protection law and remains unresolved.
Layer 2: Location ad-tracking
Anyone who has used an ad-supported mobile app has had their precise location sold. The infrastructure works as follows: the app embeds an advertising SDK, the SDK reads device location and a unique advertising identifier, and the resulting location-plus-identifier stream is sold to data brokers who aggregate movements across hundreds of millions of devices.
This was an industry-wide practice with almost no oversight until early 2024. In April 2024 the US Federal Trade Commission finalised an order against X-Mode Social and its successor company Outlogic, prohibiting them from selling sensitive location data — visits to medical clinics, places of worship, and similar. In January 2025 the FTC finalised a parallel order against Mobilewalla, which had collected more than 500 million unique advertising identifiers paired with precise location between January 2018 and June 2020. The Mobilewalla order is narrowly precedential because it was the first FTC action treating the collection of consumer data from real-time bidding advertising exchanges, for purposes outside the auction itself, as an unfair practice.
Even if every named broker were ordered out of business tomorrow, the data they sold remains in the hands of buyers not subject to the same orders. Location data is also trivially re-identifiable: the home-work pair from any single weekday is usually unique enough to identify a person.
Layer 3: Insurance and financial-scoring registers
LexisNexis Risk Solutions operates the Comprehensive Loss Underwriting Exchange (C.L.U.E.), which records up to seven years of auto insurance claims and seven years of home and personal-property claims. Insurers query it during quoting and underwriting. The Fair Credit Reporting Act in the United States entitles consumers to a free copy of their C.L.U.E. report once every twelve months through the LexisNexis consumer disclosure portal — most people request one for the first time after a quote that surprises them.
C.L.U.E. is one of the more visible examples of a wider category. Equifax, Experian, and TransUnion all run marketing-data subsidiaries with dossiers that go beyond their credit-bureau function. CIFAS in the United Kingdom maintains markers that affect financial onboarding. Several analytics tiers inside Acxiom and LexisNexis Risk Solutions are sold to non-financial buyers — political campaigns, debt collectors, screening firms — without appearing on any consumer-facing page.
These are data brokers in the strict sense, distinct from the people-search platforms (Spokeo, BeenVerified, Whitepages, Intelius) that aggregate consumer-facing profiles from public records. People-search platforms publish removal forms because consumer visibility is the core of their business model. Strict-sense brokers do not, because their B2B buyers want them invisible to consumers.
The remedy varies by jurisdiction: a Fair Credit Reporting Act dispute in the United States, a GDPR Article 15 subject-access request in the EU and UK, a CCPA right-to-know request in California. We walk through the European version in our analysis of GDPR Article 15 as a corporate-side reconnaissance vector; the access principle generalises across all three regimes.
Layer 4: Public-record and relational register
Courts, land registries, business registries, professional licensing bodies, and electoral rolls publish records about you because they are required to. So do funeral homes, genealogy sites, and local press, for different reasons.
For an adult with more than a decade of activity in a Western jurisdiction, the public record typically surfaces: civil judgments and bankruptcy filings, real-property purchases and mortgages, company directorships (Companies House in the UK, KvK in the Netherlands, Handelsregister in Germany, SEC EDGAR for US-listed positions), professional licences and disciplinary findings, voter registration in jurisdictions with public rolls, marriage and divorce filings in jurisdictions that publish them, and obituaries that name surviving relatives, employers, and home town.
The reach of this layer varies sharply by jurisdiction. The United States sits at the open end of the spectrum: federal court records aggregate through PACER, state court records through county portals, and property and voter records are mostly searchable. Germany sits at the other end. German civil court case files under the Zivilprozessordnung are not public records and are generally inaccessible to anyone other than the parties and their representatives. German personal-status registers operate under statutory disclosure delays of 110 years for births, 80 years for marriages, and 30 years for deaths. The Netherlands publishes court judgments through rechtspraak.nl with personal data anonymised by default. The United Kingdom sits between, with Companies House and the open electoral roll fully searchable while court records vary by tier and tribunal. Any inventory of public-record exposure is jurisdiction-bound — what is recoverable about a German national through institutional channels is structurally less than what is recoverable about an American, even when both have lived comparable public lives.
Adjacent to the formal register sits the relational layer. Family-tree platforms (Ancestry, MyHeritage, FamilySearch) include living relatives in published trees by default unless the tree owner marks them private. A genealogy-curious cousin is often the largest publisher of biographical data about you that you have never met. Obituaries written by family members frequently include enough detail — schools, employers, addresses, birth year — to anchor an identity profile.
Most individual records in this layer are not sensitive on their own. They exist because an institution is required to keep them, which means closing them is rarely an option for the person they describe.
If the public-record and relational layer alone covers more about you than a manual inventory can reach, that is where a structured Mirror pass starts.
Talk to an AnalystLayer 5: The archive layer
The archive layer is why “I deleted it” is a partial defence at best.
The Internet Archive’s Wayback Machine, which has captured snapshots of public websites since 1996. A page deleted from its origin is, in many cases, still served from a Wayback snapshot taken before deletion. Coverage in 2025 degraded measurably: the Nieman Lab reported in October 2025 that snapshots of news homepages fell roughly 87% between May and October 2025 due to a breakdown in archiving projects. By February 2026 a number of major news organisations, including The Guardian and The New York Times, had begun blocking the Wayback Machine over AI-scraping concerns. Forward coverage is thinning, but the snapshots already on file are unaffected.
Single-page snapshot services, led by archive.today (also reachable as archive.ph, archive.is, and archive.li), which excels at capturing content behind paywalls or complex client-side rendering and honours takedown requests rarely. A user who wants a permanent copy of a tweet, a deleted blog post, or a paywalled article can hold one through this service indefinitely.
Reddit revoked Pushshift’s API access in mid-2023, and the major archive tools that depended on it (Removeddit, Reveddit, Unddit) lost their feed. A successor called PullPush plus aggregator tools like RedReap continue to recover deleted Reddit content, although coverage is partial and inconsistent post-2023. For pre-2023 content, the archive is essentially complete. Anything posted on Reddit before the API change and later deleted is, with high probability, still readable.
What AI assistants record when you use them
Most digital-footprint guides stop at Layer 5. They do not cover what happens when you ask an AI assistant a question about yourself, your work, or the people you are investigating.
Every major AI assistant — ChatGPT, Claude, Gemini, Grok — logs your conversations. What differs between providers is whether those conversations are used to train future models, whether humans review them, and what the opt-out mechanism is. The following is drawn from each provider’s published privacy policy, verbatim where quoted, all accessed 28 May 2026.
ChatGPT (OpenAI)
OpenAI’s US Privacy Policy (updated May 18, 2026) states: “We may use Content you provide us to improve our Services, for example to train the models that power ChatGPT.” This applies to consumer accounts by default.
The opt-out sits in Settings → Data Controls → “Improve the model for everyone.” When that toggle is off, new conversations are not used for training. The same policy notes: “If you do not want us to use your Content to train our models, you can opt out by following the instructions in this article.” Conversations already logged remain in OpenAI’s systems unless deleted individually or removed via a formal privacy rights request.
OpenAI’s API operates under a different default: data submitted via the API is not used to train models unless the customer explicitly opts in. This matters for developers and businesses, but not for the majority of users accessing ChatGPT directly through the web or mobile app.
A separate consideration is the Memory feature, available on ChatGPT Plus and Team plans. When memory is enabled, ChatGPT retains facts from previous conversations and carries them forward. A user who never manages their memory store is building a persistent profile across sessions.
Claude (Anthropic)
Anthropic’s Privacy Policy (effective January 12, 2026) states: “We may use your Inputs and Outputs to train our models and improve our Services, unless you opt out through your account settings.”
The safety exception is explicit: even after opting out, Anthropic will use inputs and outputs for model improvement when conversations are “flagged for safety review to improve our ability to detect harmful content, enforce our policies, or advance AI safety research,” or when materials have been explicitly reported. An opt-out does not guarantee that a conversation flagged for safety review is excluded from training use.
The commercial and API distinction is clear in the policy: it “does not apply where Anthropic acts as a data processor” — which covers employer-provisioned accounts and API deployments. If you access Claude through a Claude for Work account provisioned by your employer, Anthropic is acting as a processor for that employer, not as a controller of your data for its own purposes. The employer’s policy governs.
Gemini (Google)
Google’s Gemini Apps Privacy Hub states: “Please don’t enter confidential information that you wouldn’t want a reviewer to see or Google to use to improve our services, including machine-learning technologies.”
The human-review clause is specific: “A subset of chats are reviewed by human reviewers.” Reviewed conversations “are retained for up to three years.” The Gemini Apps Activity dashboard allows users to review and delete conversation history, and turning off Gemini Apps Activity stops conversations from being saved to the user’s Google Account — though Google notes that conversations may still be used to improve its products for a short period regardless.
Gemini conversations are associated with a user’s Google Account by default, placing them alongside search history, Maps queries, and YouTube watch history in Google’s unified data layer.
Grok (xAI)
xAI’s Privacy Policy (effective April 4, 2026) lists among its purposes for using personal data: “To develop and improve our Service and to conduct research: For example to develop new product features, to train our models.”
Grok is embedded in X (formerly Twitter) for most users, meaning conversations are associated with an X account. The policy describes a Private Chat control: “when Private Chat is turned on, conversations will not appear in your conversation history and your conversations will be deleted from xAI systems within 30 days.” Private Chat is opt-in, not the default.
Across all four providers, consumer-facing products train on your conversations by default unless you change a setting that most users do not know exists. Human reviewers may read a subset of conversations at most providers — as a routine quality-and-safety function. API access carries different, generally weaker, default training rights at OpenAI and Anthropic; this distinction does not exist for the average person using the consumer product. Deleting a conversation from your interface does not delete it from provider infrastructure on the same timeline — retention periods run independently of what is visible to you.
If you are conducting any investigation, handling a sensitive matter, or researching any subject you would not want associated with your identity in a third-party log, AI assistant interfaces are not the right tool for that work. The reasoning is covered in our piece on why using AI for OSINT leaves a trail.
The cross-correlation principle
A single layer in isolation is mostly noise. A phone number alone, a property record alone, a single ad-network location point — none of these tells a searcher much. The risk emerges when those layers are stitched together.
Cross-correlation turns several thin signals into a stable identity. A phone number from the contact-graph layer plus a name from a public record plus a location pattern from an ad-broker corpus plus a claim record from C.L.U.E. plus a username from an old forum post is an identity anchor that survives any single rotation. Change your phone number; the location pattern is still tied to your previous device’s advertising identifier. Move house; the property record is now part of the historical chain. Delete the social account; the archive copy is intact. The AI layer follows the same pattern: a researcher who asked a series of questions about a specific legal matter, a specific company, or a specific person has created a timestamped record of their investigative direction — stored in a third-party system they do not control, subject to breach, subpoena, or internal disclosure.
Closing one layer leaves the others legible. Reducing exposure requires working the whole graph.
What you can do, in order
A first pass costs nothing except time and covers most of what is recoverable through self-service.
- Run your primary email through Have I Been Pwned and any associated emails through it as well. Note the breaches by date. This bounds the breach-corpus layer.
- If you are in the United States, request your free annual LexisNexis consumer disclosure — the C.L.U.E. report and the broader FullFile report — through the LexisNexis consumer portal.
- Search yourself on Truecaller’s free web lookup if your number is non-European. The result tells you what name and tag set someone calling you sees.
- Run your old domain names, handles, and personal site URLs through both the Wayback Machine and archive.today. Note what is archived.
- Search yourself in Companies House (UK), KvK (NL), Handelsregister (DE), SEC EDGAR (US), and any equivalent register for jurisdictions where you have lived.
- Read your Google Ads Settings and your Facebook Ads Preferences. The interest categories are a partial inventory of how the ad-graph layer has classified you.
- Check training-data settings on any AI assistant account you use regularly: ChatGPT (Settings → Data Controls → “Improve the model for everyone”), Claude (Settings → Privacy), Gemini (Gemini Apps Activity dashboard), Grok (Private Chat setting in X). The defaults are permissive at every provider.
- For data brokers without a public-facing portal, send a written subject-access request under GDPR Article 15 (UK/EU) or a CCPA right-to-know request (California). The mechanic is described in our Article 15 piece.
Done thoroughly, this self-service pass surfaces what a casual searcher would find. It will not surface the cross-correlated graph, stylometric or username-graph linkage to handles you considered separate, the long-tail of strict-sense brokers without consumer portals, or stealer-log corpora traded outside HIBP-class indexes. Those require structured tooling and primary-source access.
Where the inventory ends and an investigation begins
A self-service inventory tells you what a curious neighbour, mid-skill recruiter, or early-stage threat actor would find in an hour. A structured investigator with a week is a different category of result.
The Mirror is a fixed-price OSINT investigation that produces that second category: the cross-correlation graph, broker dossiers without consumer portals, username and stylometric linkage across handles you thought were separate, and breach-corpus search across HIBP plus deeper indexes. The methodology is described in our walkthrough of how a Mirror investigation runs, and the structured overview of what such a digital footprint audit covers sits on its own page.
For organisations running this inventory at scale across executives and named staff — where infrastructure-focused ASM, EASM, and CAASM platforms stop at the network edge and miss the per-individual layer entirely — the corporate-side reasoning sits in our analysis of the identity attack surface ASM vendors miss.
Sources
AI provider privacy policies (accessed 28 May 2026)
- OpenAI US Privacy Policy (updated May 18, 2026) — “We may use Content you provide us to improve our Services, for example to train the models that power ChatGPT”; opt-out via Settings → Data Controls
- Anthropic Privacy Policy (effective January 12, 2026) — training on Inputs and Outputs unless opted out; safety-review exception; API and commercial carve-out
- Google Gemini Apps Privacy Hub — human reviewer clause; three-year retention on reviewed chats; Gemini Apps Activity dashboard
- xAI Privacy Policy (effective April 4, 2026) — training use; Private Chat 30-day deletion
FTC enforcement actions on location data brokers
- FTC Finalizes Order with X-Mode and Successor Outlogic (April 2024)
- FTC Finalizes Order Banning Mobilewalla from Selling Sensitive Location Data (January 2025)
Contact-graph harvesting
- Truecaller — overview, user base, data practices
- Avast: Popular call-blocking apps expose 3 billion users’ phone numbers (2019)
- Shadow profile — concept, history, legal status
Insurance and financial-scoring registers
- LexisNexis Risk Solutions Consumer Disclosure portal
- CFPB: LexisNexis C.L.U.E. and Telematics OnDemand
Archive layer