TL;DR
This guide shows you how to automatically find URLs on a website using command-line tools and Python scripts. This is useful for security testing, web archiving, or simply understanding the structure of a site.
1. Using wget (Simple but Effective)
wget is a common tool for downloading files from the internet. It can also be used to recursively download an entire website, effectively discovering all its URLs.
- Basic Recursive Download: This downloads everything linked from the starting page. Be careful with this on large sites!
-r: Recursive download.-l 5: Limit recursion depth to 5 levels (adjust as needed). Higher numbers take longer and use more bandwidth.--no-parent: Don’t follow links outside of the original domain.- Extracting URLs from
wgetoutput: The URLs are saved in a file (usuallyindex.htmlor similar). You can then use command-line tools likegrepto extract them. - This command finds all lines containing
href="...", extracts the URL within the quotes, and saves them to a file calledurls.txt.
wget -r -l 5 --no-parent
grep -o 'href="[^"]*"' index.html | sed 's/href="//g' | sed 's/"//g' > urls.txt
2. Using curl and HTML Parsing
curl is another command-line tool for transferring data with URLs. Combined with tools like grep or dedicated HTML parsers, it’s a powerful option.
- Download the HTML:
- Extract URLs with
grep(similar to wget):
curl > index.html
grep -o 'href="[^"]*"' index.html | sed 's/href="//g' | sed 's/"//g' > urls.txt
3. Using Python with requests and BeautifulSoup4
Python offers more flexibility for parsing HTML and handling complex websites.
- Install Libraries:
- Python Script Example: This script downloads the HTML, parses it with BeautifulSoup4, and extracts all
hrefattributes fromatags. - Running the Script: Save the code as a Python file (e.g.,
url_finder.py) and run it from your terminal.
pip install requests beautifulsoup4
from bs4 import BeautifulSoup
import requests
url = 'https://www.example.com/'
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
urls = []
for a_tag in soup.find_all('a', href=True):
urls.append(a_tag['href'])
for url in urls:
print(url)
python url_finder.py
4. Handling Relative URLs
Websites often use relative URLs (e.g., /about instead of https://www.example.com/about). You need to convert these into absolute URLs.
- Python Example (using
urljoin):
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
base_url = 'https://www.example.com/'
response = requests.get(base_url)
hlml_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
urls = []
for a_tag in soup.find_all('a', href=True):
href = a_tag['href']
absolute_url = urljoin(base_url, href)
urls.append(absolute_url)
for url in urls:
print(url)
5. Considerations for Large Websites
- Respect
robots.txt: Check the website’srobots.txtfile to see which pages are disallowed from crawling. - Rate Limiting: Don’t overload the server with requests. Add delays between requests using
time.sleep()in your Python script. - Error Handling: Implement error handling to gracefully handle broken links or network issues.
- Cyber security: Be mindful of potential vulnerabilities when crawling websites, especially if you are submitting data or interacting with forms. Avoid sending sensitive information.

