Automated URL Finding

G5 Cyber Security

2 months ago

TL;DR

This guide shows you how to automatically find URLs on a website using command-line tools and Python scripts. This is useful for security testing, web archiving, or simply understanding the structure of a site.

1. Using `wget` (Simple but Effective)

wget is a common tool for downloading files from the internet. It can also be used to recursively download an entire website, effectively discovering all its URLs.

Basic Recursive Download: This downloads everything linked from the starting page. Be careful with this on large sites!

wget -r -l 5 --no-parent

-r: Recursive download.
-l 5: Limit recursion depth to 5 levels (adjust as needed). Higher numbers take longer and use more bandwidth.
--no-parent: Don’t follow links outside of the original domain.

Extracting URLs from wget output: The URLs are saved in a file (usually index.html or similar). You can then use command-line tools like grep to extract them.

grep -o 'href="[^"]*"' index.html | sed 's/href="//g' | sed 's/"//g' > urls.txt

This command finds all lines containing href="...", extracts the URL within the quotes, and saves them to a file called urls.txt.

2. Using `curl` and HTML Parsing

curl is another command-line tool for transferring data with URLs. Combined with tools like grep or dedicated HTML parsers, it’s a powerful option.

Download the HTML:

curl  > index.html

Extract URLs with grep (similar to wget):

grep -o 'href="[^"]*"' index.html | sed 's/href="//g' | sed 's/"//g' > urls.txt

3. Using Python with `requests` and `BeautifulSoup4`

Python offers more flexibility for parsing HTML and handling complex websites.

Install Libraries:

pip install requests beautifulsoup4

Python Script Example: This script downloads the HTML, parses it with BeautifulSoup4, and extracts all href attributes from a tags.

from bs4 import BeautifulSoup
import requests

url = 'https://www.example.com/'
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, 'html.parser')

urls = []
for a_tag in soup.find_all('a', href=True):
    urls.append(a_tag['href'])

for url in urls:
    print(url)

Running the Script: Save the code as a Python file (e.g., url_finder.py) and run it from your terminal.

python url_finder.py

4. Handling Relative URLs

Websites often use relative URLs (e.g., /about instead of https://www.example.com/about). You need to convert these into absolute URLs.

Python Example (using urljoin):

from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

base_url = 'https://www.example.com/'
response = requests.get(base_url)
hlml_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')

urls = []
for a_tag in soup.find_all('a', href=True):
    href = a_tag['href']
    absolute_url = urljoin(base_url, href)
    urls.append(absolute_url)

for url in urls:
    print(url)

5. Considerations for Large Websites

Respect robots.txt: Check the website’s robots.txt file to see which pages are disallowed from crawling.
Rate Limiting: Don’t overload the server with requests. Add delays between requests using time.sleep() in your Python script.
Error Handling: Implement error handling to gracefully handle broken links or network issues.
Cyber security: Be mindful of potential vulnerabilities when crawling websites, especially if you are submitting data or interacting with forms. Avoid sending sensitive information.

TL;DR

1. Using wget (Simple but Effective)

2. Using curl and HTML Parsing

3. Using Python with requests and BeautifulSoup4

4. Handling Relative URLs

5. Considerations for Large Websites

1. Using `wget` (Simple but Effective)

2. Using `curl` and HTML Parsing

3. Using Python with `requests` and `BeautifulSoup4`