Do you ever read The New York Times, The Atlantic, or Gizmodo online? Do you look up medical information on the Mayo Clinic's website, or shop on Home Depot's e-commerce site? Have you ever had to utilize the US Federal Trade Commission's IdentityTheft.gov portal?
If the answer to any of these is “yes”, then it is likely Google knows where you were physically sitting when you browsed these websites. All of these websites, plus several thousand others, use Google's 'free' web analytics service and have not configured IP anonymization.
Google Analytics is a part of the Google Marketing Platform, which allows website administrators to track and analyze website traffic. Google shares data from its Google Analytics silos with its Google Advertising Divisions. For example, data pools from first-party cookies set by Google Analytics can be shared with Google Ads infrastructure.
When you open a website that has installed Google Analytics javascript on their webpage, your browser sends an HTTPS network request to google-analytics.com/collect. Google Analytics' measurement protocol has a configuration option that lets website administrators 'anonymize' the IP address of users before Google's servers save the IP address to disk for processing. Note that, regardless of whether or not the anonymization feature is enabled, the initial HTTPS request sent will always disclose the user's IP address via the IP header. The distinction here is whether or not Google, as a data processor, is instructed to store the full IP address on its servers.
Image from Google Analytics documentation, illustrating how the IP address anonymization feature works. Note that, regardless of whether IP address anonymization is enabled, the user's IP address is still initially sent to Google's servers. The anonymization occurs server-side, before a request is saved to disk. To avoid sending any data to Google, the website would have to entirely remove the Google Analytics Javascript tag.If a website is using Google Analytics and has configured the "Anonymize IP" option, network requests made to Google's servers will include the query string parameters "aip" (or as part of the POST body).
Screenshot of Chrome browser Developer Tools Network panel, illustrating an HTTPS network request sent to www.google-analytics.com. This screenshot was taken on irs.gov, and includes the “aip” parameter. The “aip” parameter needs to be configured for Google Analytics to anonymize a user’s IP address.By analyzing network traffic from the 100,000 most popular websites (according to the Tranco 1M list), this analysis observed that at least 31,639 websites use Google Analytics on their webpages. Of these 31,639, only 4,435 (14%) had enabled the "aip" parameter. The other 86% of websites did not have this query string parameter present in their HTTPS requests to google-analytics.com/collect, meaning they were sending their customer's full IP addresses to Google. As you may already be aware, an IP address can be converted into a physical geolocation through the use of services such as whatismyipaddress.com or ip2location.com.
Why does enabling the “aip” parameter matter?
If one assumes that Google is a good-faith data processor that honors the IP address anonymization request parameter, then there are several reasons why using this parameter may be important.
The first reason is a matter of legal liability and regulatory compliance. Users' IP addresses may be categorized as "personal data" or "Personally identifiable information" (PII) by various legal privacy frameworks, such as EU's GDPR, California's CCPA, and Brazil's LGPD. If a website is sharing personal data with third parties such as Google, this may trigger additional regulatory and compliance requirements. Several sources have noted that IP anonymization may be a requirement for meeting GDPR compliance. In 2016, European Court of Justice reviewed the Breyer case, in which the Court ruled that "IP addresses may be personal data even though information may have to be sought from third parties to identify the subjects."
Google's own documentation suggests the use of various privacy controls if a website may be regulated under GDPR or CCPA. The Google Analytics Terms of Service somewhat ironically stipulate that website owners "will not and will not assist or permit any third party to pass information to Google that Google could use or recognize as personally identifiable information".
The second is a matter of potential punitive action. The GDPR Enforcement Tracker includes 498 examples of fines issued under GDPR by various European Union data protection authorities. Some of these were issued for "Non-compliance with general data processing principles" or "Insufficient fulfillment of data subjects rights". A handful of these fines cost tens of millions of Euros. Google itself is currently facing a class action lawsuit in California over the use of tools like Google Analytics to track users, even when they have switched to "Incognito mode" on their browsers.
The third reason is a matter of simple consumer trust. Robin Berjon, the Vice President of Data Governance at the New York Times (NYT), wrote in July that "privacy is about trust" and "the trust of our readers is essential." Berjon notes that readers "are overwhelmingly unhappy with data being shared with third parties that can use the data for entirely different purposes."
Consumers may lose trust in institutions like the Mayo Clinic if they find out that their browsing patterns, geo-location, and device fingerprints are being relayed by these organizations directly to Google without any anonymization.
So who is sending their users' IP addresses to Google?
By analyzing network HTTPS requests to google-analytics.com/collect or google-analytics.com/__utm.gif endpoint made when a user browses to different websites, one can look for the presence or absence of the "aip" query string or POST body parameters. Of 31,639 websites in the top Tranco 100k, only 4,435 appear to include the "aip" parameter in at least one network request (14%).
Screenshot of Chrome browser Developer Tools Network panel, illustrating an HTTPS network request sent to www.google-analytics.com, with the “aip” parameter enabled. This is from one of the 4,435 websites that were observed to be using IP address anonymization in Google Analytics.This is consistent with consumer expectations. A recent Twitter poll found that the vast majority of consumers expect <25% of websites to properly anonymize their IP address before transmitting data to Google Analytics.
Twitter poll conducted in December, 2020A large number of United States government institutions appear to be using Google Analytics without IP address anonymization. Somewhat ironically, the Federal Trade Commission's (FTC) IdentityTheft.gov is not using IP address anonymization. Other sensitive government websites, such as FBI.gov, FCC.gov, clinicaltrials.gov, studentaid.gov, and the US Patent Office's uspto.gov, also transmit the full IP addresses of their users to Google. California’s ca.gov portal was also in this list.
A previous study found that nearly all US Senators and Representatives (98.9%) are using third party tracking scripts on their Congressional websites. 502 out of 537 (93.4%) members of Congress utilize Google Analytics on their taxpayer-funded senate.gov and house.gov websites. The vast majority of these lawmakers are also relaying their constituents full IP addresses to Google’s servers, including self-proclaimed champions of digital privacy such as Elizabeth Warren or Ron Wyden.
Table showing which domains are using IP address anonymization in their Google Analytics accounts.In Europe, where there are stricter regulations on consumer PII, far fewer government websites use Google Analytics to begin with. However, even here one can observe possible examples of PII leakage. For example, the Polish government's portal for requesting EU grants, funduszeeuropejskie.gov.pl and the Polish national pension and social insurance agency ZUS (zus.pl), do not appear to use the "aip" parameter.
Major newspapers in virtually every EU country were observed to be using Google Analytics without IP address anonymization. German news portals t-online.de and stern.de, the French newspaper Le Figaro, the Polish Gazeta Wyborcza, the Dutch public broadcaster NH, Austrian daily Kurier, the Italian IT industry site agendadigitale.eu, and the leading European Biotech news site labiotech.eu were all observed making Google Analytics requests without the "aip" query string.
Even venerable medical information sources such as the Mayo Clinic (both mayoclinic.com and mayoclinichealthsystem.org), Cleveland Clinic, healthcare.gov, clinicaltrials.gov, and Duke University Health System were operating Google Analytics without the "aip" parameter. This theoretically means that, if an individual user was browsing articles about sexually transmitted diseases or depression on the Mayo Clinic's website, Google would know about this and could target that IP address with ads for STD medication or mental health treatments.
Robin Berjon's excellent NYT article on protecting their users’ online privacy states that "As of April 2019, we removed all third-party data controllers from our homepage, section fronts and articles." However, this most recent analysis indicates that the nytimes.com landing page is utilizing Google Analytics and has not activated the IP address anonymization. The very well-written and thorough NYT privacy policy page notes that the NYT stores users' IP addresses and location, as well as that the Times uses Google Analytics. However, it does not explicitly disclose the fact that the Times sends users' full 32-bit IPv4 addresses to Google.
Screenshot from the New York Times privacy policy page. Screenshot from The Markup’s Blacklight privacy inspector, showing that nytimes.com uses Google Analytics with “remarketing audiences”.Screenshot from Chrome browser’s Developer Tools, illustrating that the “anonymizeIp” parameter is “undefined” for all Google Analytics accounts running on the nytimes.com landing page.Other entities that are 'copying' Google Analytics data
It appears that a number of third parties are making direct copies of data sent to Google. When a user browses to a webpage like feedingamerica.org or marchofdimes.org, these pages send over a dozen different data points to Google Analytics' server. These include:
- cid - Google Analytics client ID, which is a “unique identifier for a browser–device pair”
- uid - Google Analytics user ID
- _gid - Used to distinguish users for 24 hours
Each of these parameters is a highly unique identifier that can be used to label users or devices when they browse the internet. It appears though that these unique identifiers are being copied and sent to other domains besides Google's. For example, on feedingamerica.org or marchofdimes.org, the Google Analytics _gid parameters are also being sent to 'px.steelhousemedia.com', which is owned by a California-based ad tech company. Browsing on allrecipes.com or cargurus.com shows Google Analytics query string parameters being copied to beacon.krxd.net, which is owned by Krux, a Data Management Platform that was acquired by Salesforce in 2017.
Screenshot from Chrome browser’s Developer Tools on feedingamerica.org, illustrating that the some of the user ID query string parameters sent to Google Analytics are also being sent to px.steelhousemedia.comIt is not clear what the purpose of these data captures is. A large number of these domains appear to be either ad tech, data management, or analytics related. It is possible that some website administrators would like to incorporate parts of Google's user measurement data into other platforms. It is also unclear whether or not such 're-playing' of Google Analytics user identifiers is compatible with the Google Analytics Terms of Service. A previous article noted that in some cases, it is possible for third parties to intentionally manipulate Google Analytics telemetry through client side Javascript exploits.
Table illustrating which domains were observed receiving Google Analytics user ID values most frequently during a sample of the landing pages of the most popular 100,000 websites.Conclusion & recommendations
This analysis was performed from a static US IP address, which may bias the results. For example, it is possible that many websites behave differently with regards to Google Analytics telemetry if a user is in a different geographic jurisdiction. The study also did not look at other Google Analytics query string parameters, such as "npa" (for disabling advertising personalization", “ua” (for user agent override) or “uip” (for user IP address override). Nonetheless, this study illustrates that the vast majority of web properties that have installed Google Analytics do not appear to instruct Google to perform server-side IP address anonymization before processing HTTPS requests.
One possibility for this is the need to obtain very granular information on user geo-location. A previous study found that adding IP anonymization reduces the accuracy of user city-level geo-location data (but not country-level data).
Website maintainers can very easily change their Google Analytics configuration to utilize the "aip" parameters. Here is one simple walkthrough. A more comprehensive guide is also available here.
On a further note, even when a web developer enables the "aip" parameter, HTTPS requests are still being made from a users' browser to the Google Analytics server. The anonymization feature is purely a server-side functionality that Google has enabled prior to saving the data to disk. Web developers who use Google Analytics and have enabled this feature are effectively trusting a third party to be correctly executing this IP address anonymization on their hardware.
To avoid the risk of leaking any personal data to Google's servers, web developers can consider a number of privacy-focused alternatives. Many of these are designed with GDPR and CCPA compliance by default, and may give more accurate page view data by virtue of not being on ad block lists. Some of the ones the most highly recommended are: