Webpage Download Tracking

G5 Cyber Security

4 months ago

TL;DR

Yes, a webpage can track if you download its source code or save it as a web archive (like using your browser’s ‘Save Page As…’). They do this by checking for specific events and behaviours. However, there are ways to reduce tracking.

How Webpages Track Downloads

JavaScript Events: Many websites use JavaScript to detect when you try to save the page or its source code.

beforeunload event: This event fires when the browser is about to leave a page, including when saving. Websites can attach code to this event to log the action. It’s not reliable for downloads but can be used as an indicator.
File Download API: If the website uses JavaScript to initiate file downloads (e.g., dynamically generated files), they have more control and can track those directly.

Network Requests: When you save a page, your browser makes requests for all its resources (HTML, CSS, images, etc.). The server logs these requests, which can be correlated to identify download attempts.

Web Archive Services: Saving as a web archive often involves sending the page content to a third-party service (like Archive.org). This service will have a record of the saved page and its origin.

Content Integrity Checks: Some websites embed hidden code or unique identifiers within their HTML that are checked when the page is loaded. If this code is missing in a downloaded version, it can indicate a download attempt.

How to Reduce Tracking

Disable JavaScript (Use with Caution): Disabling JavaScript will prevent many tracking methods but may break website functionality.

In your browser settings, find the JavaScript options and disable it for all sites or specific sites you suspect are tracking downloads.

Browser Extensions: Use privacy-focused browser extensions like uBlock Origin or Privacy Badger to block trackers and scripts. These can often prevent download tracking code from running.

Save as Text Only: Instead of ‘Save Page As…’, try viewing the page source (usually right-click -> ‘View Page Source’) and then copying and pasting it into a plain text editor. This removes all JavaScript and other potentially tracking elements.

Right-click on the webpage, select ‘View Page Source’.
Select all the content in the source code window (Ctrl+A or Cmd+A).
Copy the content (Ctrl+C or Cmd+C).
Paste it into a plain text editor like Notepad (Windows) or TextEdit (Mac).

Use Command-Line Tools: Use tools like wget or curl to download the page content. These tools give you more control over what is downloaded and can be configured to avoid running JavaScript.

wget -q --no-check-certificate  -O filename.html

Incognito/Private Browsing: While not a perfect solution, using incognito mode or a private browsing window can limit the amount of tracking data associated with your session.

VPN and Tor: Using a VPN (Virtual Private Network) or the Tor network can mask your IP address and make it harder to track your downloads.

Important Considerations

Saving as PDF: Saving a page as a PDF is less likely to be tracked than saving the HTML source, but some websites may still attempt to embed tracking code within the PDF itself.
Dynamic Content: If the webpage relies heavily on dynamic content loaded after the initial page load, simply downloading the HTML source might not capture everything you see on the screen.