TL;DR
This guide shows you how to build a simple robot (crawler) using Python and libraries like requests, BeautifulSoup4, and potentially Selenium to automatically check if web pages are authentic. We’ll focus on checking for common signs of fake or altered websites.
1. Set Up Your Environment
- Install Python: Make sure you have Python 3 installed. You can download it from python.org.
- Create a Virtual Environment (Recommended): This keeps your project dependencies separate.
python -m venv my_checker_envsource my_checker_env/bin/activate # On Linux/macOSmy_checker_envScriptsactivate # On Windows - Install Libraries: Use
pipto install the necessary libraries.pip install requests beautifulsoup4 selenium webdriver-manager
2. Basic Web Page Retrieval
Use the requests library to download the HTML content of a web page.
- Import the Library: Add this line at the top of your Python script.
import requests - Fetch the Page:
url = "https://www.example.com" # Replace with the URL you want to checktry: response = requests.get(url) response.raise_for_status() # Raise an exception for bad status codes html_content = response.text except requests.exceptions.RequestException as e: print(f"Error fetching {url}: {e}") exit()
3. Parsing HTML with BeautifulSoup4
Use BeautifulSoup4 to easily navigate and extract information from the downloaded HTML.
- Import the Library: Add this line at the top of your script.
from bs4 import BeautifulSoup - Create a BeautifulSoup Object:
soup = BeautifulSoup(html_content, 'html.parser')
4. Authenticity Checks
Here are some checks you can perform. Combine these for a more robust assessment.
- Check SSL Certificate: Verify the website uses HTTPS and has a valid certificate.
requestsautomatically handles basic SSL verification, but you might want to use a dedicated library likesslfor more detailed checks.
- Check Domain Age: Older domains are generally more trustworthy. Use a WHOIS lookup service (there are Python libraries available) or an online tool.
- Look for Contact Information: A legitimate website should have clear contact details (address, phone number, email).
contact_info = soup.find('a', href=lambda href: href and 'mailto:' in href) # Find mailto links - Check for Privacy Policy & Terms of Service: These are standard on legitimate websites.
privacy_policy = soup.find('a', string=lambda text: text and 'Privacy Policy' in text) # Find link to privacy policy - Analyze Website Content: Look for poor grammar, spelling errors, or unusual phrasing.
- This is harder to automate; consider using a natural language processing (NLP) library.
- Check for Broken Links: A large number of broken links can indicate neglect or malicious intent.
for link in soup.find_all('a', href=True): href = link['href'] if not href.startswith(('http://', 'https://')): continue # Skip relative links try: response = requests.head(href) if response.status_code >= 400: print(f"Broken link: {href}") except requests.exceptions.RequestException: print(f"Error checking link: {href}")
5. Using Selenium for Dynamic Content
If the website uses JavaScript to load content, requests and BeautifulSoup4 might not be enough. Use Selenium.
- Import the Library: Add this line at the top of your script.
from selenium import webdriver from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager # Automatically manage ChromeDriver - Set up WebDriver:
service = Service(executable_path=ChromeDriverManager().install()) driver = webdriver.Chrome(service=service) - Load the Page:
driver.get("https://www.example.com") # Replace with your URL - Get HTML Content:
html_content = driver.page_source - Close the Browser:
driver.quit()
6. Automating with a Crawler
To check multiple pages, create a loop that iterates through a list of URLs.
- Create a URL List: Store the URLs you want to check in a file or a Python list.
- Loop Through URLs: Iterate through the list and perform the checks on each page.
urls = ["https://www.example.com", "https://anotherwebsite.com"] # Replace with your URLsfor url in urls: try: # Fetch, parse, and check the URL as described above print(f"Checking {url}...") except Exception as e: print(f"Error processing {url}: {e}")

