Get a Pentest and security assessment of your IT network.

Cyber Security

Automated URL Finding

TL;DR

This guide shows you how to automatically find URLs on a website using command-line tools and Python scripts. This is useful for security testing, web archiving, or simply understanding the structure of a site.

1. Using wget (Simple but Effective)

wget is a common tool for downloading files from the internet. It can also be used to recursively download an entire website, effectively discovering all its URLs.

  1. Basic Recursive Download: This downloads everything linked from the starting page. Be careful with this on large sites!
  2. wget -r -l 5 --no-parent 
    • -r: Recursive download.
    • -l 5: Limit recursion depth to 5 levels (adjust as needed). Higher numbers take longer and use more bandwidth.
    • --no-parent: Don’t follow links outside of the original domain.
  3. Extracting URLs from wget output: The URLs are saved in a file (usually index.html or similar). You can then use command-line tools like grep to extract them.
  4. grep -o 'href="[^"]*"' index.html | sed 's/href="//g' | sed 's/"//g' > urls.txt
    • This command finds all lines containing href="...", extracts the URL within the quotes, and saves them to a file called urls.txt.

2. Using curl and HTML Parsing

curl is another command-line tool for transferring data with URLs. Combined with tools like grep or dedicated HTML parsers, it’s a powerful option.

  1. Download the HTML:
  2. curl  > index.html
  3. Extract URLs with grep (similar to wget):
  4. grep -o 'href="[^"]*"' index.html | sed 's/href="//g' | sed 's/"//g' > urls.txt

3. Using Python with requests and BeautifulSoup4

Python offers more flexibility for parsing HTML and handling complex websites.

  1. Install Libraries:
  2. pip install requests beautifulsoup4
  3. Python Script Example: This script downloads the HTML, parses it with BeautifulSoup4, and extracts all href attributes from a tags.
  4. from bs4 import BeautifulSoup
    import requests
    
    url = 'https://www.example.com/'
    response = requests.get(url)
    html_content = response.content
    
    soup = BeautifulSoup(html_content, 'html.parser')
    
    urls = []
    for a_tag in soup.find_all('a', href=True):
        urls.append(a_tag['href'])
    
    for url in urls:
        print(url)
  5. Running the Script: Save the code as a Python file (e.g., url_finder.py) and run it from your terminal.
  6. python url_finder.py

4. Handling Relative URLs

Websites often use relative URLs (e.g., /about instead of https://www.example.com/about). You need to convert these into absolute URLs.

  1. Python Example (using urljoin):
  2. from bs4 import BeautifulSoup
    import requests
    from urllib.parse import urljoin
    
    base_url = 'https://www.example.com/'
    response = requests.get(base_url)
    hlml_content = response.content
    soup = BeautifulSoup(html_content, 'html.parser')
    
    urls = []
    for a_tag in soup.find_all('a', href=True):
        href = a_tag['href']
        absolute_url = urljoin(base_url, href)
        urls.append(absolute_url)
    
    for url in urls:
        print(url)

5. Considerations for Large Websites

  • Respect robots.txt: Check the website’s robots.txt file to see which pages are disallowed from crawling.
  • Rate Limiting: Don’t overload the server with requests. Add delays between requests using time.sleep() in your Python script.
  • Error Handling: Implement error handling to gracefully handle broken links or network issues.
  • Cyber security: Be mindful of potential vulnerabilities when crawling websites, especially if you are submitting data or interacting with forms. Avoid sending sensitive information.
Related posts
Cyber Security

Zip Codes & PII: Are They Personal Data?

Cyber Security

Zero-Day Vulnerabilities: User Defence Guide

Cyber Security

Zero Knowledge Voting with Trusted Server

Cyber Security

ZeroNet: 51% Attack Risks & Mitigation