Blog | G5 Cyber Security

Automated URL Finding

TL;DR

This guide shows you how to automatically find URLs on a website using command-line tools and Python scripts. This is useful for security testing, web archiving, or simply understanding the structure of a site.

1. Using wget (Simple but Effective)

wget is a common tool for downloading files from the internet. It can also be used to recursively download an entire website, effectively discovering all its URLs.

  1. Basic Recursive Download: This downloads everything linked from the starting page. Be careful with this on large sites!
  2. wget -r -l 5 --no-parent 
  • Extracting URLs from wget output: The URLs are saved in a file (usually index.html or similar). You can then use command-line tools like grep to extract them.
  • grep -o 'href="[^"]*"' index.html | sed 's/href="//g' | sed 's/"//g' > urls.txt

    2. Using curl and HTML Parsing

    curl is another command-line tool for transferring data with URLs. Combined with tools like grep or dedicated HTML parsers, it’s a powerful option.

    1. Download the HTML:
    2. curl  > index.html
    3. Extract URLs with grep (similar to wget):
    4. grep -o 'href="[^"]*"' index.html | sed 's/href="//g' | sed 's/"//g' > urls.txt

    3. Using Python with requests and BeautifulSoup4

    Python offers more flexibility for parsing HTML and handling complex websites.

    1. Install Libraries:
    2. pip install requests beautifulsoup4
    3. Python Script Example: This script downloads the HTML, parses it with BeautifulSoup4, and extracts all href attributes from a tags.
    4. from bs4 import BeautifulSoup
      import requests
      
      url = 'https://www.example.com/'
      response = requests.get(url)
      html_content = response.content
      
      soup = BeautifulSoup(html_content, 'html.parser')
      
      urls = []
      for a_tag in soup.find_all('a', href=True):
          urls.append(a_tag['href'])
      
      for url in urls:
          print(url)
    5. Running the Script: Save the code as a Python file (e.g., url_finder.py) and run it from your terminal.
    6. python url_finder.py

    4. Handling Relative URLs

    Websites often use relative URLs (e.g., /about instead of https://www.example.com/about). You need to convert these into absolute URLs.

    1. Python Example (using urljoin):
    2. from bs4 import BeautifulSoup
      import requests
      from urllib.parse import urljoin
      
      base_url = 'https://www.example.com/'
      response = requests.get(base_url)
      hlml_content = response.content
      soup = BeautifulSoup(html_content, 'html.parser')
      
      urls = []
      for a_tag in soup.find_all('a', href=True):
          href = a_tag['href']
          absolute_url = urljoin(base_url, href)
          urls.append(absolute_url)
      
      for url in urls:
          print(url)

    5. Considerations for Large Websites

    Exit mobile version