Get a Pentest and security assessment of your IT network.

Cyber Security

Web Page Authenticity Checker Robot

TL;DR

This guide shows you how to build a simple robot (crawler) using Python and libraries like requests, BeautifulSoup4, and potentially Selenium to automatically check if web pages are authentic. We’ll focus on checking for common signs of fake or altered websites.

1. Set Up Your Environment

  1. Install Python: Make sure you have Python 3 installed. You can download it from python.org.
  2. Create a Virtual Environment (Recommended): This keeps your project dependencies separate.
    python -m venv my_checker_env
    source my_checker_env/bin/activate  # On Linux/macOS
    my_checker_envScriptsactivate # On Windows
  3. Install Libraries: Use pip to install the necessary libraries.
    pip install requests beautifulsoup4 selenium webdriver-manager

2. Basic Web Page Retrieval

Use the requests library to download the HTML content of a web page.

  1. Import the Library: Add this line at the top of your Python script.
    import requests
  2. Fetch the Page:
    url = "https://www.example.com" # Replace with the URL you want to check
    try:
      response = requests.get(url)
      response.raise_for_status()  # Raise an exception for bad status codes
      html_content = response.text
    except requests.exceptions.RequestException as e:
      print(f"Error fetching {url}: {e}")
      exit()

3. Parsing HTML with BeautifulSoup4

Use BeautifulSoup4 to easily navigate and extract information from the downloaded HTML.

  1. Import the Library: Add this line at the top of your script.
    from bs4 import BeautifulSoup
  2. Create a BeautifulSoup Object:
    soup = BeautifulSoup(html_content, 'html.parser')

4. Authenticity Checks

Here are some checks you can perform. Combine these for a more robust assessment.

  1. Check SSL Certificate: Verify the website uses HTTPS and has a valid certificate.
    • requests automatically handles basic SSL verification, but you might want to use a dedicated library like ssl for more detailed checks.
  2. Check Domain Age: Older domains are generally more trustworthy. Use a WHOIS lookup service (there are Python libraries available) or an online tool.
  3. Look for Contact Information: A legitimate website should have clear contact details (address, phone number, email).
    contact_info = soup.find('a', href=lambda href:
      href and 'mailto:' in href) # Find mailto links
  4. Check for Privacy Policy & Terms of Service: These are standard on legitimate websites.
    privacy_policy = soup.find('a', string=lambda text:
      text and 'Privacy Policy' in text) # Find link to privacy policy
  5. Analyze Website Content: Look for poor grammar, spelling errors, or unusual phrasing.
    • This is harder to automate; consider using a natural language processing (NLP) library.
  6. Check for Broken Links: A large number of broken links can indicate neglect or malicious intent.
    for link in soup.find_all('a', href=True):
      href = link['href']
      if not href.startswith(('http://', 'https://')):
        continue # Skip relative links
      try:
        response = requests.head(href)
        if response.status_code >= 400:
          print(f"Broken link: {href}")
      except requests.exceptions.RequestException:
        print(f"Error checking link: {href}")

5. Using Selenium for Dynamic Content

If the website uses JavaScript to load content, requests and BeautifulSoup4 might not be enough. Use Selenium.

  1. Import the Library: Add this line at the top of your script.
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager # Automatically manage ChromeDriver
  2. Set up WebDriver:
    service = Service(executable_path=ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service)
  3. Load the Page:
    driver.get("https://www.example.com") # Replace with your URL
  4. Get HTML Content:
    html_content = driver.page_source
  5. Close the Browser:
    driver.quit()

6. Automating with a Crawler

To check multiple pages, create a loop that iterates through a list of URLs.

  1. Create a URL List: Store the URLs you want to check in a file or a Python list.
  2. Loop Through URLs: Iterate through the list and perform the checks on each page.
    urls = ["https://www.example.com", "https://anotherwebsite.com"] # Replace with your URLs
    for url in urls:
      try:
        # Fetch, parse, and check the URL as described above
        print(f"Checking {url}...")
      except Exception as e:
        print(f"Error processing {url}: {e}")
Related posts
Cyber Security

Zip Codes & PII: Are They Personal Data?

Cyber Security

Zero-Day Vulnerabilities: User Defence Guide

Cyber Security

Zero Knowledge Voting with Trusted Server

Cyber Security

ZeroNet: 51% Attack Risks & Mitigation