TL;DR
Yes, system administrators can detect if their website is being scraped by a headless browser using various techniques like checking User-Agent strings, analysing JavaScript execution patterns, implementing CAPTCHAs, and monitoring traffic for unusual behaviour. These methods aren’t foolproof but significantly increase the difficulty for scrapers.
How to Check for Headless Browser Scraping
- Check the User-Agent String:
- Headless browsers often have identifiable User-Agent strings. Look for patterns like ‘HeadlessChrome’, ‘Puppeteer’, or similar in your server logs.
- You can access this information in your web server configuration (e.g., Apache, Nginx) or through server-side scripting languages like PHP or Python.
- Example (Python using Flask):
from flask import request def index(): user_agent = request.headers.get('User-Agent') print(f"User Agent: {user_agent}") # Check if 'HeadlessChrome' or similar is present in the user_agent string if "HeadlessChrome" in user_agent: return "Possible headless browser detected!" else: return "Normal request." - JavaScript Execution Analysis:
- Headless browsers execute JavaScript. You can use JavaScript to detect characteristics of the environment that are unusual for typical user browsers.
- For example, check if certain browser APIs or features are available and behave as expected.
- Example (JavaScript):
if (typeof navigator !== 'object' || !navigator.webdriver) { // Likely a normal browser or headless browser. } else { // Possible headless browser detected! } - Implement CAPTCHAs:
- CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are effective at blocking automated scraping.
- Use a reputable CAPTCHA service like reCAPTCHA or hCaptcha.
- Be mindful of user experience; excessive CAPTCHAs can frustrate legitimate users.
- Rate Limiting:
- Limit the number of requests from a single IP address within a specific timeframe. This prevents scrapers from overwhelming your server.
- Configure rate limiting in your web server or using a firewall (e.g., fail2ban).
- Example (Nginx):
limit_req_zone $binary_remote_addr zone=mylimit:10m rate=5r/s; server { ... location / { limit_req zone=mylimit burst=20 nodelay; ... } } - Monitor Traffic Patterns:
- Look for unusual traffic patterns, such as a high volume of requests from the same IP address or requests that follow a predictable pattern.
- Use web analytics tools (e.g., Google Analytics) and server logs to identify suspicious activity.
- HTTP Header Checks:
- Check for missing or inconsistent HTTP headers that are typically present in normal browser requests.
- For example, the ‘Accept-Language’ header might be missing or have an unusual value.
- Web Application Firewalls (WAFs):
- A WAF can help detect and block malicious traffic, including scraping attempts.
- Popular WAF options include Cloudflare, Sucuri, and ModSecurity.
Important Considerations:
- These techniques are not foolproof. Sophisticated scrapers can often bypass these measures.
- False positives are possible; legitimate users might be incorrectly identified as scrapers.
- Regularly update your detection methods to stay ahead of evolving scraping techniques.

