Get a Pentest and security assessment of your IT network.

Cyber Security

Website Bot Activity: Find Data Leaks

TL;DR

Bots are likely discovering your new website through publicly available information (like WHOIS records) and automated scanning. Check for data leaks by examining server logs, using web vulnerability scanners, monitoring search engine indexing, and reviewing third-party services that might be exposing your site’s details.

How Bots Find New Websites

When you launch a new website, bots (often used by search engines, but also malicious actors) quickly find it. Here’s how:

  • DNS Records: Your domain’s DNS records become public when you register it.
  • WHOIS Data: Public WHOIS databases contain registration information.
  • Crawling: Search engine bots crawl the web, discovering new links and sites.
  • Server Scans: Automated tools scan IP address ranges for open ports and running services.

Checking For Data Leaks

Here’s a step-by-step guide to find potential leaks:

1. Server Logs

  1. Access Your Logs: Access your web server’s access logs (e.g., Apache, Nginx). These record every request made to your site.
  2. Look for Unusual Activity: Search for patterns indicating bot activity:
    • High Request Rates: A large number of requests from a single IP address in a short time.
    • Unusual User Agents: Requests with strange or unknown user agent strings (the software identifying the requester). Common bot user agents include those from search engine crawlers, but also tools like curl or wget.
    • Requests for Non-Existent Pages: Bots often try to access common files and directories that shouldn’t exist (e.g., /wp-admin if you don’t use WordPress).
  3. Example Log Analysis (Apache): Use tools like grep or log analysis software.
    grep -i 'bot' /var/log/apache2/access.log | less

2. Web Vulnerability Scanners

  1. Choose a Scanner: Use an online web vulnerability scanner (e.g., OWASP ZAP, Burp Suite Community Edition, Detectify). Many offer free tiers.
  2. Run the Scan: Enter your website’s URL and start a scan. The scanner will check for common vulnerabilities like SQL injection, cross-site scripting (XSS), and outdated software.
  3. Review Results: Carefully examine the scanner’s report and address any identified issues.

3. Search Engine Indexing

  1. Check Google’s Cache: Use site:yourdomain.com in Google search to see what pages are indexed.
  2. Google Search Console: Add your website to Google Search Console and check its indexing status.
    • Look for any unexpected or sensitive information being indexed.

4. Third-Party Service Checks

  1. Archive.org (Wayback Machine): Check if your website has been archived, potentially revealing older versions of content.
  2. Shodan: Search Shodan (https://www.shodan.io/) for your IP address to see what services are exposed and any associated banners or information.
    shodan search 'your_ip_address'
  3. BuiltWith: Use BuiltWith (https://builtwith.com/) to see what technologies your website is using, which can help identify potential vulnerabilities.

5. Robots.txt

  1. Review Your robots.txt File: Ensure it’s correctly configured to prevent bots from crawling sensitive areas of your site. A misconfigured file could accidentally expose information.
Related posts
Cyber Security

Zip Codes & PII: Are They Personal Data?

Cyber Security

Zero-Day Vulnerabilities: User Defence Guide

Cyber Security

Zero Knowledge Voting with Trusted Server

Cyber Security

ZeroNet: 51% Attack Risks & Mitigation