GUIDE

What Does the Internet Know About You? The Layers You Didn’t Author

Most people imagine their digital footprint as the sum of what they posted. The photos, the comments, the LinkedIn profile, the resignation tweet they regret. Close that account, the thinking goes, and the record shrinks.

It does not.

The larger and more durable record on you was created by other people, by automated systems running in the background of apps you never installed, and by institutions whose mandate is to write things down. You did not author this layer. In most cases you did not consent to it. It is the layer that matters most when someone is assembling a profile of you.

This article walks through five layers of the record on you that you did not create. They sit underneath the visible footprint covered in our earlier piece on the assembled profile, and they are usually the part that surprises people most when an investigator returns a structured report.

Layer 1: Contact-graph harvesting

Truecaller’s database contains more than 500 million users globally, of whom roughly 350 million are in India, and a substantially larger number of phone-number-to-name mappings. Most of those mappings come from a single mechanism: when a user installs the app and grants contact-list access, the app uploads their address book. If your number is in someone’s phone with your name attached, your number ends up in Truecaller’s database with your name attached, regardless of whether you have ever heard of the app.

A 2019 audit of the call-blocking app category by Avast estimated approximately three billion phone numbers held across the major apps in the segment. Truecaller’s posture is asymmetric by jurisdiction: it does not perform contact-book uploads from European users, presumably to avoid GDPR liability, but does so from most of the rest of the world.

This is the cleanest example of a structural problem in the consent model that underwrites most modern data collection. Your friend can consent on your behalf to the inclusion of your name and number in a third-party database. Once it is there, you are in the search interface as soon as anyone with the app types your number.

Facebook and WhatsApp run versions of the same loop. The “find your friends on Facebook” flow, present in some form for over a decade, harvests the contact lists of users who agree to it. Where those uploads include data about non-users, the result is what is usually called a shadow profile: a record assembled about a person who has never registered. The legal status of this practice has been challenged repeatedly under European data-protection law and remains unresolved.

Layer 2: Location ad-tracking

Anyone who has used an ad-supported mobile app has had their precise location sold. The infrastructure works as follows: the app embeds an advertising SDK, the SDK reads device location and a unique advertising identifier, and the resulting location-plus-identifier stream is sold to data brokers who aggregate movements across hundreds of millions of devices.

This was an industry-wide practice with almost no oversight until early 2024. In April 2024 the US Federal Trade Commission finalised an order against X-Mode Social and its successor company Outlogic, prohibiting them from selling sensitive location data — visits to medical clinics, places of worship, and similar. In January 2025 the FTC finalised a parallel order against Mobilewalla, which had collected more than 500 million unique advertising identifiers paired with precise location between January 2018 and June 2020. The Mobilewalla order is narrowly precedential because it was the first FTC action treating the collection of consumer data from real-time bidding advertising exchanges, for purposes outside the auction itself, as an unfair practice.

Two practical implications for the individual reader. First, even if every named broker were ordered out of business tomorrow, the data they sold for years is in the hands of buyers who are not subject to the same orders. Second, location data is trivially re-identifiable: the home-work pair from any single weekday is usually unique enough to identify a person.

Layer 3: Insurance and financial-scoring registers

LexisNexis Risk Solutions operates the Comprehensive Loss Underwriting Exchange, known as C.L.U.E., which records up to seven years of auto insurance claims and seven years of home and personal-property claims. Insurers query it during quoting and underwriting. The Fair Credit Reporting Act in the United States entitles consumers to a free copy of their C.L.U.E. report once every twelve months through the LexisNexis consumer disclosure portal — most people request one for the first time after a quote that surprises them.

C.L.U.E. is one of the more visible examples of a wider category. Equifax, Experian, and TransUnion all run marketing-data subsidiaries with dossiers that go beyond their credit-bureau function. CIFAS in the United Kingdom maintains markers that affect financial onboarding. Several of the analytics tiers inside Acxiom and LexisNexis Risk Solutions are sold to non-financial buyers (political campaigns, debt collectors, screening firms) without being visible on any consumer-facing page.

This is the data-broker layer in the strict sense, distinct from the people-search platforms (Spokeo, BeenVerified, Whitepages, Intelius) that aggregate consumer-facing profiles from public records. People-search platforms publish removal forms because consumer visibility is the core of their business. Strict-sense brokers do not, because their B2B buyers want them invisible to consumers in the first place.

The remedy varies by jurisdiction: a Fair Credit Reporting Act dispute in the United States, a GDPR Article 15 subject-access request in the EU and UK, a CCPA right-to-know request in California. The underlying mechanic is the same — a written demand to a named controller for what they hold on you, with statutory teeth behind it. We walk through the European version in our analysis of GDPR Article 15 as a corporate-side reconnaissance vector; the access principle generalises across the three regimes.

Layer 4: Public-record and relational register

Courts, land registries, business registries, professional licensing bodies, and electoral rolls publish records about you because they are required to. So do funeral homes, genealogy sites, and the local press, although for different reasons.

A non-exhaustive list of what is typically findable about an adult who has lived in a Western jurisdiction for more than a decade: civil judgments and bankruptcy filings, real-property purchases and mortgages, company directorships (Companies House in the UK, KvK in the Netherlands, Handelsregister in Germany, SEC EDGAR for US-listed positions), professional licences and disciplinary findings, voter registration in jurisdictions with public rolls, marriage and divorce filings in jurisdictions that publish them, and obituaries that name surviving relatives, employers, and home town.

The reach of this layer varies sharply by jurisdiction, and the variation changes what is actually recoverable about a person. The United States sits at the open end of the spectrum: federal court records are aggregated through PACER, state court records through county portals, and property and voter records are mostly searchable. Germany sits at the other end. German civil court case files under the Zivilprozessordnung are not public records and are generally inaccessible to anyone other than the parties and their representatives. German personal-status registers (Personenstandsregister) operate under statutory disclosure delays of 110 years for births, 80 years for marriages, and 30 years for deaths, with access before those periods restricted to direct relatives or applicants who can demonstrate a legitimate interest. The Netherlands publishes court judgments through rechtspraak.nl with personal data anonymised by default. The United Kingdom sits between, with Companies House and the open electoral roll fully searchable while court records vary by tier and tribunal. The practical implication is that any inventory of public-record exposure is jurisdiction-bound — what is recoverable about a German national through institutional channels is structurally less than what is recoverable about an American, even when both have lived comparable public lives.

A second layer sits adjacent: the relational layer. Family-tree platforms (Ancestry, MyHeritage, FamilySearch) include living relatives in published trees by default unless the tree owner deliberately marks them private. A genealogy-curious cousin is often the largest publisher of biographical data about you that you have never met. Obituaries written by family members frequently include enough biographical detail (schools, employers, addresses, birth year) to anchor an identity profile.

Most individual records in this layer are not sensitive on their own. What matters is that they exist because an institution is required to keep them, which means closing them is rarely an option for the person they describe.

If the public-record and relational layer alone covers more about you than a manual inventory can reach, that is where a structured Mirror pass starts.

Talk to an Analyst

Layer 5: The archive layer

The archive layer is why “I deleted it” is a partial defence at best. Three sub-layers matter.

The first is the Internet Archive’s Wayback Machine, which has captured snapshots of public websites since 1996. A page deleted from its origin is, in many cases, still served from a Wayback snapshot taken before deletion. Coverage in 2025 has degraded measurably: the Nieman Lab reported in October 2025 that snapshots of news homepages on the Wayback Machine fell roughly 87% between May and October 2025 due to a breakdown in archiving projects. By February 2026 a number of major news organisations, including The Guardian and The New York Times, had begun blocking the Wayback Machine over AI-scraping concerns. Forward coverage is thinning, but the snapshots already on file are unaffected.

The second is the family of single-page snapshot services led by archive.today (also reachable as archive.ph, archive.is, and archive.li), which excels at capturing content behind paywalls or complex client-side rendering, and which honours takedown requests rarely. A user who wants a permanent copy of a tweet, a deleted blog post, or a paywalled article can hold one through this service indefinitely.

The third is the social-platform archive ecosystem. Reddit revoked Pushshift’s API access in mid-2023, and the major archive tools that depended on it (Removeddit, Reveddit, Unddit) lost their feed. A successor named PullPush plus aggregator tools like RedReap continue to recover deleted Reddit content, although coverage is partial and inconsistent post-2023. For pre-2023 content, the archive is essentially complete. Anything you posted on Reddit before the API change and later deleted is, with high probability, still readable.

Why this matters: the cross-correlation principle

A single layer in isolation is mostly noise. A phone number alone, a property record alone, a single ad-network location point alone — none of these tells a searcher much. The risk emerges when those layers are stitched together.

Cross-correlation is what turns several individually-thin signals into a stable identity. A phone number from the contact-graph layer plus a name from a public record plus a location pattern from an ad-broker corpus plus a claim record from C.L.U.E. plus a username from an old forum post is an identity anchor that survives any single rotation. Change your phone number; the location pattern is still tied to your previous device’s advertising identifier. Move house; the property record is now part of the historical chain. Delete the social account; the archive copy is intact.

This is why removal work is harder than it appears from the outside. Closing one layer leaves the others legible. The work of reducing exposure is structurally a graph problem, not a list problem.

What you can actually do, in order

A first pass costs nothing except time, and it covers most of what is recoverable through self-service. We recommend the following sequence.

  1. Run your primary email through Have I Been Pwned and any associated emails through it as well. Note the breaches by date. This bounds the breach-corpus layer.
  2. If you are in the United States, request your free annual LexisNexis consumer disclosure (the C.L.U.E. report and the broader FullFile report) through the LexisNexis consumer portal.
  3. Search yourself on Truecaller’s free web lookup if your number is non-European. The result tells you what name and tag set someone calling you sees.
  4. Run your old domain names, handles, and personal site URLs through both the Wayback Machine and archive.today. Note what is archived.
  5. Search yourself in Companies House (UK), KvK (NL), Handelsregister (DE), SEC EDGAR (US), and any equivalent register for jurisdictions where you have lived.
  6. Read your Google Ads Settings and your Facebook Ads Preferences. The interest categories are a partial inventory of how the ad-graph layer has classified you.
  7. For data brokers without a public-facing portal, send a written subject-access request under GDPR Article 15 (UK/EU) or a CCPA right-to-know request (California). The mechanic is described in our Article 15 piece.

Done thoroughly, this self-service pass surfaces the bulk of what a casual searcher would find. It will not surface the cross-correlated graph, stylometric or username-graph linkage to handles you thought were separate, the long-tail of strict-sense brokers without consumer-facing portals, or stealer-log corpora traded outside HIBP-class indexes. Those require structured tooling and primary-source access.

Where the inventory ends and an investigation begins

A self-service inventory is a useful exercise. It tells you what a curious neighbour, mid-skill recruiter, or early-stage threat actor would find in an hour. A structured investigator with a week is a different category of result.

The Mirror is a fixed-price OSINT investigation that produces that second category: the cross-correlation graph, the broker dossiers without consumer portals, the username and stylometric linkage across handles you thought were separate, and the breach-corpus search across HIBP plus deeper indexes. The methodology is described in our walkthrough of how a Mirror investigation runs.

If this article surfaced layers you had not catalogued, that is the gap a Mirror is for.

Sources

FTC enforcement actions on location data brokers

Contact-graph harvesting

Insurance and financial-scoring registers

Archive layer

If this is your situation

If you want to know what a search like this returns about you, a Snapshot Scan tells you in 48 hours.

See The Mirror

Share this briefing

If this was useful, sharing it helps others protect themselves. It also helps keep the intelligence briefings free.