Module 4: Information Gathering - Web Edition
Introduction
Introduction
Web reconnaissance is the first step in the information gathering phase of penetration testing. Objective: enumerate assets, identify exposures, map the attack surface, and collect intelligence for later use.
Active Reconnaissance
Direct interaction with the target. Produces detailed results but carries higher detection risk.
Port Scanning Command:
user01@AcmeCorp:~$ nmap -p- -T4 target.AcmeCorp.localOutput:
PORT STATE SERVICE
80/tcp open http
443/tcp open https
8080/tcp open http-proxy-p-→ scan all 65,535 TCP ports-T4→ faster/aggressive timing (less stealthy)
Purpose: Identify open ports and services. High detection risk. Tools: Nmap, Masscan, Unicornscan
Vulnerability Scanning Command:
Output:
-h→ specify target host
Purpose: Probe for misconfigurations and known CVEs. Very noisy; often logged. Tools: Nessus, OpenVAS, Nikto
Network Mapping Command:
Output:
Purpose: Show the path packets take across hops to reach the target. Medium–high detection risk. Tools: Traceroute, Nmap
Banner Grabbing Command:
Output:
Purpose: Retrieve service banners to reveal software and version. Low interaction, but often logged. Tools: Netcat, curl
OS Fingerprinting Command:
Output:
-O→ enable OS detection
Purpose: Identify operating system via TCP/IP fingerprinting. Low detection risk. Tools: Nmap, Xprobe2
Service Enumeration Command:
Output:
-sV→ probe open ports to determine service version-p→ specify which ports to scan
Purpose: Gather service versions for vulnerability matching. Tools: Nmap
Web Spidering (mapping mode) Command:
--spider→ spider mode; check links without downloading-r→ recursive crawling-l 2→ recursion depth of 2-e robots=off→ ignore robots.txt rules-O /dev/null→ discard any file output--no-parent→ don’t crawl above start directory--domains=…→ restrict crawl to this domain-nv+-o→ quiet output, log results
Purpose: Build a map of site endpoints without saving files. Tools: Burp Suite Spider, OWASP ZAP Spider, Scrapy
Passive Reconnaissance
No direct interaction with target infrastructure. Stealthier, less complete.
Search Engine Queries Command:
Purpose: Use search operators to locate public documents and leaks. Very low detection risk. Tools: Google, DuckDuckGo, Bing, Shodan
WHOIS Lookup Command:
Output:
Purpose: Retrieve domain ownership, contacts, and nameservers. Very low detection. Tools: whois command-line, online WHOIS services
DNS Enumeration Command:
Output:
axfr→ request full zone transfer@ns1→ query specific nameserver
Purpose: Collect all DNS records if zone transfer is misconfigured. Very low detection. Tools: dig, nslookup, host, dnsenum, fierce, dnsrecon
Web Archive Analysis Command:
Purpose: Review historical versions of the website for deprecated endpoints and leaks. Very low detection. Tools: Wayback Machine
Social Media Analysis Command:
Purpose: Identify employees, roles, and technologies for pivoting or social engineering. Very low detection. Tools: LinkedIn, Twitter, Facebook, specialised OSINT tools
Code Repositories Command:
Purpose: Search public repos for credentials, tokens, or config leaks. Very low detection. Tools: GitHub, GitLab
WHOIS
WHOIS
WHOIS is a query and response protocol for retrieving registration data about internet resources. Primarily used for domains, but also supports IP address ranges and autonomous systems. Think of it as a phonebook for the internet: it shows who owns or manages online assets.
Example Command
Output:
Typical WHOIS Record Fields
Domain Name → e.g.,
example.comRegistrar → company managing the registration (GoDaddy, Namecheap, etc.)
Registrant Contact → individual/organization who owns the domain
Administrative Contact → responsible for domain management
Technical Contact → handles domain technical issues
Creation/Expiration Dates → when domain was registered, when it expires
Name Servers → servers resolving the domain into IP addresses
History of WHOIS
Elizabeth Feinler and her team at the Stanford Research Institute’s NIC created the first WHOIS directory in the 1970s for ARPANET resource management. It stored hostnames, users, and domains. This groundwork evolved into the modern WHOIS protocol.
Why WHOIS Matters for Web Recon
WHOIS records provide valuable intel during reconnaissance:
Identifying Key Personnel Contact details (names, emails, phone numbers) can highlight potential phishing or social engineering targets.
Discovering Network Infrastructure Name servers and IP address data reveal parts of the network footprint, useful for finding entry points or misconfigurations.
Historical Data Analysis Services like WhoisFreaks show how ownership, contacts, or infrastructure changed over time, helping track target evolution.
Utilising WHOIS
WHOIS provides valuable intelligence across multiple scenarios and is a key recon tool for analysts, researchers, and threat hunters.
Scenario 1: Phishing Investigation
Trigger: Suspicious email flagged by gateway
Look for:
Domain registered only days ago
Registrant hidden by privacy service
Nameservers tied to bulletproof hosting
Interpretation: Strong phishing indicators → block domain, alert employees, investigate hosting/IP for related domains.
Scenario 2: Malware Analysis
Trigger: Malware communicating with C2 server
Look for:
Free/anonymous registrant email
Registrant address in high-risk cybercrime country
Registrar with lax abuse history
Interpretation: C2 likely on bulletproof/compromised infra → pivot to hosting provider, expand infra hunting.
Scenario 3: Threat Intelligence Report
Trigger: Tracking activity of a threat actor group
Look for:
Clusters of registrations before attacks
Fake or alias registrants
Shared name servers across campaigns
Past takedowns of similar domains
Interpretation: Identify attacker TTPs, generate IOCs, feed into threat intel reporting and detections.
Using WHOIS
Install WHOIS on Linux
Perform WHOIS Lookup
Output:
DNS & Subdomains
DNS
The Domain Name System (DNS) translates human-readable domain names into machine-usable IP addresses. It functions like an online GPS, ensuring users don’t need to remember raw IPs when navigating the web. Without DNS, browsing would be like navigating without a map.
How DNS Works
Local Cache Check – Computer first checks memory for stored IP mappings.
DNS Resolver Query – If not cached, query sent to resolver (usually ISP’s).
Root Name Server – Root directs query to appropriate TLD server.
TLD Name Server – TLD server points to the authoritative server for the requested domain.
Authoritative Name Server – Provides the correct IP address.
Resolver Returns Answer – Resolver gives IP back to computer and caches it.
Client Connects – Browser connects to the web server using the IP.
Think of DNS as a relay race: request passes from resolver → root → TLD → authoritative → back to resolver → to client.
Hosts File
A local file that maps hostnames to IP addresses, bypassing DNS. Useful for testing, overrides, or blocking.
Windows:
C:\Windows\System32\drivers\etc\hostsLinux/macOS:
/etc/hosts
Format:
Examples:
Key DNS Concepts
Zone – Portion of namespace managed by an entity. Example:
example.comand its subdomains.Zone File – Text file defining resource records. Example:
Common DNS Concepts
Domain Name – Human-readable identifier (e.g.,
www.example.com).IP Address – Numeric identifier (e.g.,
192.0.2.1).DNS Resolver – Translates names to IPs (ISP resolver, Google DNS 8.8.8.8).
Root Name Server – Top-level servers that direct queries to TLD servers.
TLD Name Server – Responsible for domains like
.comor.org.Authoritative Name Server – Holds actual IPs for a domain.
DNS Record Types – Store specific info (A, AAAA, CNAME, MX, NS, TXT, SOA, SRV, PTR).
Common DNS Record Types
A (Address Record) – Maps hostname to IPv4.
AAAA (IPv6 Address Record) – Maps hostname to IPv6.
CNAME (Canonical Name Record) – Alias to another hostname.
MX (Mail Exchange Record) – Mail servers for domain.
NS (Name Server Record) – Delegates a DNS zone.
TXT (Text Record) – Arbitrary data, often for verification/security.
SOA (Start of Authority Record) – Zone administration info.
SRV (Service Record) – Defines hostname/port for services.
PTR (Pointer Record) – Reverse DNS (IP → hostname).
Why DNS Matters for Web Recon
Uncovering Assets – Records may expose subdomains, MX servers, name servers, or outdated CNAMEs (e.g.,
dev.example.com → oldserver.example.net).Mapping Infrastructure – A/NS/MX records reveal providers, load balancers, and interconnections. Useful for network mapping and identifying choke points.
Monitoring for Changes – New records (e.g.,
vpn.example.com) may indicate new entry points. TXT records may reveal tools in use (_1password=), enabling social engineering.
Digging DNS
After reviewing DNS fundamentals and record types, reconnaissance moves into practical tooling. These utilities query DNS servers to extract records, uncover infrastructure, and identify potential entry points.
DNS Tools
dig – Flexible DNS lookup; supports many record types (A, MX, NS, TXT, etc.), zone transfers, troubleshooting.
nslookup – Simpler DNS lookup; mainly for A, AAAA, MX queries.
host – Streamlined DNS lookups with concise output.
dnsenum – Automates enumeration; brute-forces subdomains, attempts zone transfers.
fierce – Recon and subdomain discovery; recursive search and wildcard detection.
dnsrecon – Combines multiple techniques; outputs in various formats.
theHarvester – OSINT tool; collects DNS records, email addresses, and related data.
Online DNS Lookup Services – Web-based interfaces for quick lookups when CLI tools aren’t available.
The Domain Information Groper
The dig command (Domain Information Groper) is a versatile and powerful utility for querying DNS servers and retrieving various types of DNS records. Its flexibility and detailed output make it a go-to choice for DNS recon.
Common dig Commands
dig AcmeCorp.local→ Default A record lookupdig AcmeCorp.local A→ IPv4 addressdig AcmeCorp.local AAAA→ IPv6 addressdig AcmeCorp.local MX→ Mail serversdig AcmeCorp.local NS→ Authoritative name serversdig AcmeCorp.local TXT→ TXT recordsdig AcmeCorp.local CNAME→ Canonical name recorddig AcmeCorp.local SOA→ Start of authority recorddig @1.1.1.1 AcmeCorp.local→ Query a specific resolver (Cloudflare in this case)dig +trace AcmeCorp.local→ Show full DNS resolution pathdig -x 203.0.113.10→ Reverse lookup for an IP addressdig +short AcmeCorp.local→ Short answer onlydig +noall +answer AcmeCorp.local→ Display only the answer sectiondig AcmeCorp.local ANY→ Request all record types (often ignored per RFC 8482)
Note: Some DNS servers may detect or block excessive queries. Always respect rate limits and get permission before performing extensive DNS reconnaissance.
Groping DNS
Output:
Breakdown
Header: Type = QUERY, status = NOERROR, transaction ID = 5421
Flags:
qr(response),rd(recursion desired),ad(authentic data)Question Section: Requested A record for AcmeCorp.local
Answer Section: Returned IP 203.0.113.10 with TTL of 3600s
Footer: Response time, responding server, timestamp, message size
Short Answer Example
Output:
Subdomains
Subdomains extend a main domain into functional segments (e.g., blog.AcmeCorp.local, shop.AcmeCorp.local, mail.AcmeCorp.local). They often host resources and services not visible on the main site.
Why Subdomains Matter in Web Recon
Development/Staging Environments – May be less secure, exposing features or sensitive data.
Hidden Login Portals – Admin panels or internal logins not intended for public access.
Legacy Applications – Old apps may remain online with unpatched vulnerabilities.
Sensitive Information – Configs, docs, or internal data might be exposed.
Subdomain Enumeration
Process of identifying subdomains, typically via A/AAAA records (direct mappings) or CNAME records (aliases).
Active Enumeration
Direct interaction with DNS servers or brute-force guessing.
Zone Transfer Attempt
Output:
Brute Force Enumeration with dnsenum
Output:
Fuzzing Subdomains with ffuf
Output:
Brute Force with gobuster
Output:
Passive Enumeration
No direct interaction with the target. Uses public data.
Certificate Transparency Logs
Example tool: crt.sh
Query:
%.AcmeCorp.local→ returns certificates listing subdomains in SAN fields.
Search Engine Operators
site:AcmeCorp.local→ reveals indexed subdomains (e.g.,vpn.AcmeCorp.local,blog.AcmeCorp.local).
Aggregated DNS Databases
Public repositories collect DNS records and expose subdomain lists without querying target servers.
Strategy Note
Active Enumeration – More comprehensive but noisy and detectable.
Passive Enumeration – Stealthier but may miss subdomains.
Best Practice – Combine both for stronger coverage.
Subdomain Bruteforcing
Subdomain brute-force enumeration is an active discovery technique that tests lists of possible names against a target domain to identify valid subdomains. Wordlists are critical:
General-purpose → common names (dev, staging, blog, mail, admin, test).
Targeted → industry- or technology-specific patterns.
Custom → created from intel or observed naming conventions.
Process
Wordlist Selection – Choose appropriate wordlist (broad, targeted, or custom).
Iteration and Querying – Tool appends each word to the domain (e.g.,
dev.AcmeCorp.local).DNS Lookup – Query each candidate with A/AAAA lookups.
Filtering/Validation – Keep resolving subdomains, validate by further checks.
Tools for Subdomain Brute-Forcing
DNSEnum
Perl-based toolkit for DNS recon.
Record Enumeration – A, AAAA, NS, MX, TXT.
Zone Transfer Attempts – Attempts AXFR on name servers.
Subdomain Brute-Forcing – Uses wordlists.
Google Scraping – Finds subdomains via search results.
Reverse Lookups – Maps IPs back to domains.
WHOIS Queries – Gathers registration info.
Example Command
--enum→ shortcut enabling multiple options (record lookups, transfers, brute force).-f→ specify wordlist.-r→ recursive brute force on discovered subdomains.
Example Output
Fierce
Python-based tool for recursive DNS recon.
Supports recursive search (find sub-subdomains).
Handles wildcard detection to reduce false positives.
Example Command
--domain→ target domain.--wordlist→ specify subdomain list.
Example Output
DNSRecon
Comprehensive enumeration framework.
Supports standard record enumeration, brute force, zone transfers.
Can export results in multiple formats (JSON, XML, CSV).
Example Command
-d→ target domain.-D→ wordlist file.-t brt→ brute force mode.
Example Output
Amass
Popular subdomain enumeration tool with extensive integrations.
Supports brute force, API integrations, and OSINT sources.
Maintains updated databases of discovered assets.
Example Command
enum→ enumeration mode.-brute→ enable brute force.-d→ target domain.-w→ wordlist.
Example Output
Assetfinder
Lightweight tool focused on discovering subdomains.
Uses OSINT sources and APIs.
Designed for quick checks.
Example Command
Example Output
PureDNS
Efficient brute-forcer and resolver.
Handles wildcards and filters results.
Designed for performance at scale.
Example Command
Example Output
Strategy Note
dnsenum / dnsrecon / fierce → classic brute-forcing and recursive discovery.
amass / assetfinder / puredns → modern, scalable, OSINT-integrated.
Best Practice → combine both classes of tools for comprehensive coverage and validation.
DNS Zone Transfers
Zone transfers are designed for replication between DNS servers but can expose a complete domain map if misconfigured.
What is a Zone Transfer
A DNS zone transfer is a copy of all records in a zone (domain and subdomains) from one server to another. It ensures redundancy and consistency across DNS infrastructure.
Steps in the process:
Zone Transfer Request (AXFR): Secondary server requests transfer from primary.
SOA Record Transfer: Primary sends Start of Authority (SOA) record with zone details.
DNS Records Transmission: All records (A, AAAA, MX, CNAME, NS, etc.) are transferred.
Zone Transfer Complete: Primary signals end of records.
Acknowledgement: Secondary confirms receipt.
The Zone Transfer Vulnerability
If misconfigured, anyone can request a zone transfer and obtain:
Subdomains – complete list, including hidden or internal services.
IP Addresses – mappings for each subdomain, useful for network recon.
Name Server Records – reveals authoritative servers and potential hosting info.
This effectively hands over the target’s DNS map. Historically common, but now mitigated by restricting transfers to trusted secondary servers. Misconfigurations still appear due to human error or outdated setups.
Exploiting Zone Transfers
Use dig to attempt a transfer:
axfr→ request a full zone transfer.@ns1.AcmeCorp.local→ query specific name server.AcmeCorp.local→ target domain.
Example Output (fictionalized):
Remediation
Restrict zone transfers to trusted secondary servers only.
Monitor logs for unauthorized AXFR requests.
Regularly review DNS server configs for errors.
Field Notes
Safe practice domain: zonetransfer.me (intentionally misconfigured for training).
Quick test command:
dig axfr @nsztm1.digi.ninja zonetransfer.meIf a real target responds with records → severe misconfiguration, report immediately.
Virtual Hosts
Virtual hosting allows one web server to serve multiple sites using the HTTP Host header. Servers such as Apache HTTP Server, Nginx, and Microsoft IIS support this to separate domains, subdomains, and application roots.
How Virtual Hosts Work: VHosts vs Subdomains
Subdomains: blog.example.com → DNS record for parent domain; resolves to same or different IPs; used for segmentation.
VHosts: Server configs mapping Host header → document root and settings. Supports top-level domains and subdomains.
Local Overrides: /etc/hosts or hosts file entry bypasses DNS.
Private Names: Internal subdomains not in public DNS; discovered via VHost fuzzing.
Server VHost Lookup
Browser requests server IP with Host header.
Web server reads Host header.
Server matches to VHost config.
Files from matched document root returned.
Types of Virtual Hosting
Name-Based: Common. Host header selects site. One IP, many sites. Requires SNI for TLS.
IP-Based: Each site has unique IP. Protocol-agnostic. More isolation. Consumes IPs.
Port-Based: Different sites on different ports (80, 8080). Saves IPs, requires port in URL.
Field Notes:
Use name-based by default.
IP-based if isolation or legacy TLS required.
Port-based suitable for admin tools/labs.
Virtual Host Discovery Tools
gobuster
Bruteforces Host headers against target IP.
Valid vhosts return distinct responses.
Preparation:
Identify target server IP.
Use curated or custom wordlist.
Command Usage:
-u= target URL/IP.-w= wordlist path.--append-domainrequired in newer versions.
Version Notes: Older releases appended base domain automatically; newer require --append-domain.
Performance and Output:
-t= threads.-k= ignore TLS errors.-o= save output file.
Example:
Field Notes:
Adjust
-tcarefully; too high = rate limiting.Save output with
-ofor review.Validate small-size 200 responses (may be default pages).
Certificate Transparency Logs
SSL/TLS certificates enable encrypted communication between browsers and websites. Attackers can abuse mis-issued or rogue certificates to impersonate domains, intercept data, or spread malware. Certificate Transparency (CT) logs mitigate this risk by recording certificate issuance publicly.
What are Certificate Transparency Logs?
CT logs = public, append-only ledgers of SSL/TLS certificates.
Certificate Authorities (CAs) must submit new certificates to multiple CT logs.
Maintained by independent organisations, open for inspection.
Purposes:
Early Detection: Spot rogue/misissued certificates early, revoke before abuse.
CA Accountability: Public visibility of issuance practices; missteps damage trust.
Strengthen Web PKI: Adds oversight and verification to the Public Key Infrastructure.
Field Notes:
Think of CT logs as a global registry of certificates.
Transparency = trust enforcement for CAs.
CT Logs and Web Recon
Subdomain enumeration from CT logs = based on actual certificate records, not guesses.
Reveals historical and inactive subdomains (expired/old certs).
Exposes assets missed by brute-force or wordlist-based methods.
Searching CT Logs
Web interface, search by domain, shows cert details and SAN entries.
Quick subdomain checks, certificate history.
Free, no registration, simple to use.
Limited filtering and analysis.
Search engine for devices and certificates, advanced filtering.
Deep analysis, misconfig detection, related hosts.
Extensive data, API access, flexible filters.
Requires registration (free tier).
Field Notes:
crt.sh = fast, simple queries.
Censys = powerful filtering + pivoting on cert/IP attributes.
crt.sh Lookup
API queries allow automation. Example: find “dev” subdomains for facebook.com.
curlfetches JSON output from crt.sh.jqfiltersname_valuefields containing "dev".sort -uremoves duplicates, sorts results.
Fingerprinting
Fingerprinting
Fingerprinting extracts technical details about the technologies behind a site to expose stack components, versions, and potential weaknesses. Findings guide targeted exploitation, reveal misconfigurations, and help prioritize targets.
Why Fingerprinting Matters
Targeted Attacks: Map tech/version → known exploits.
Find Misconfigurations: Default settings, outdated software, risky headers.
Prioritization: Focus on systems with higher risk/value.
Comprehensive Profile: Combine with other recon for full context.
Field Notes:
Correlate versions with CVEs; validate before exploitation.
Respect scope and authorization boundaries.
Fingerprinting Techniques
Banner Grabbing: Read service banners for product/version.
HTTP Header Analysis: Inspect
Server,X-Powered-By, security headers.Probing for Specific Responses: Send crafted requests; analyze unique errors/behaviors.
Page Content Analysis: Inspect HTML/JS; look for framework/CMS artifacts and comments.
Tools
Fingerprinting SecureMail.net
Apply manual + automated techniques to a purpose-built host (external demo domain).
Banner Grabbing
Fetch headers only with curl:
Follow the redirect to HTTPS:
Final destination:
wafw00f
Detect presence/type of WAF with wafw00f:
Nikto (Fingerprinting Modules)
Use Nikto for software identification (-Tuning b):
Field Notes:
Dual-stack IPs observed (IPv4 + IPv6).
Apache/2.4.41 (Ubuntu) + WordPress; check for known CVEs and hardening gaps.
Headers: add HSTS; set
X-Content-Type-Options: nosniff; review redirects.Outdated server version → verify before reporting; consider vendor backports.
Crawling
Crawling
Concept
Crawling (spidering) = automated bots systematically browse the web.
Process: seed URL → fetch page → extract links → add to queue → repeat.
Purpose: indexing, reconnaissance, mapping.
Example Crawl
Homepage shows
link1,link2,link3.Visiting link1 reveals:
Homepage,link2,link4,link5.Crawler continues expanding until all reachable links are found.
Difference from fuzzing: crawling follows discovered links; fuzzing guesses paths.
Strategies
Breadth-First: explore wide first, level by level. Best for site overview.
Depth-First: follow one path deep, then backtrack. Best for nested content.
Data Collected
Links (internal/external): map structure, hidden areas, external ties.
Comments: may leak sensitive info.
Metadata: titles, keywords, authors, timestamps.
Sensitive files: backups (
.bak,.old), configs (web.config,settings.php), logs, credentials, snippets.
Context
One data point (e.g., “software version” in a comment) grows in value when linked with:
Metadata showing outdated software.
Exposed config/backup files.
Example: repeated
/files/directory → open browsing exposes archives/docs.Example: “file server” in comments +
/files/discovery = exposed storage confirmed.
robots.txt
Concept
robots.txt= simple text file in a website’s root directory (www.example.com/robots.txt).Follows the Robots Exclusion Standard to guide crawlers.
Acts like an etiquette guide: tells bots which areas they may or may not access.
Structure
Organized into records, separated by blank lines.
Each record =
User-agent → specifies which bot (e.g.,
*for all,Googlebot,Bingbot).Directives → instructions for that bot.
Common Directives
Disallow
Block crawling of specified path(s).
Disallow: /admin/
Allow
Explicitly permit crawling of a path even under broader restrictions.
Allow: /public/
Crawl-delay
Sets time (seconds) between requests.
Crawl-delay: 10
Sitemap
Points bots to XML sitemap.
Sitemap: https://www.example.com/sitemap.xml
Example
All bots blocked from
/admin/and/private/.All bots allowed into
/public/.Googlebot must wait 10s between requests.
Sitemap provided at
/sitemap.xml.
Importance
Server protection: avoids overload from aggressive bots.
Sensitive info: prevents indexing of private/confidential areas.
Compliance: ignoring rules can breach terms of service or laws.
Limitations: not enforceable—rogue bots can ignore it.
Use in Reconnaissance
Hidden directories: disallowed entries often reveal admin panels, backups, or sensitive files.
Mapping: disallow/allow entries create a rough site structure.
Crawler traps: honeypot directories may be listed to catch malicious bots.
.Well-Known URIs
Concept
Defined in RFC 8615,
.well-knownis a standardized directory located at/.well-known/in a website’s root.Provides a central location for metadata and configuration files.
Purpose: simplify discovery and access for browsers, apps, and security tools.
Example:
https://example.com/.well-known/security.txt→ security policy information.
IANA Registry
Registry maintained by the Internet Assigned Numbers Authority (IANA).
Each URI suffix is tied to a specification and standard.
security.txt
Contact info for security researchers to report vulnerabilities
Permanent
RFC 9116
change-password
Standard URL for directing users to a password change page
Provisional
W3C draft
openid-configuration
Configuration details for OpenID Connect (OIDC)
Permanent
OpenID spec
assetlinks.json
Verifies ownership of digital assets (e.g., apps) linked to domain
Permanent
Google spec
mta-sts.txt
Policy for SMTP MTA Strict Transport Security (MTA-STS)
Permanent
RFC 8461
Web Recon Use
.well-knownentries often reveal endpoints and configurations of interest.Reconnaissance value: discover hidden areas, authentication details, or security policies.
Particularly useful:
openid-configurationendpoint.
Example: OpenID Connect Discovery
Endpoint: https://example.com/.well-known/openid-configuration
Sample JSON Response:
Creepy Crawlies
Popular Web Crawlers
Burp Suite Spider → integrated crawler in Burp Suite, maps applications, identifies hidden content, and uncovers vulnerabilities.
OWASP ZAP Spider → part of ZAP, a free and open-source scanner; supports automated and manual crawling.
Scrapy → Python framework for custom crawlers; powerful for structured data extraction and tailored reconnaissance.
Apache Nutch → Java-based, extensible, scalable crawler; suitable for massive crawls or domain-focused projects.
Scrapy
Used here with a custom spider called ReconSpider for reconnaissance on
AcmeCorp.local.Additional information on crawling techniques is covered in the “Using Web Proxies” module in CBBH.
Installing Scrapy
Installs Scrapy and its dependencies.
ReconSpider
Download and extract the custom spider:
Run the spider against a target:
Replace domain with target of choice.
Output is saved to
results.json.
results.json
Sample structure:
JSON Keys
emails
Email addresses found.
links
Internal and external links.
external_files
External files such as PDFs.
js_files
JavaScript files referenced.
form_fields
HTML form fields.
images
Image files referenced.
videos
Video files (if found).
audio
Audio files (if found).
comments
HTML comments in source code.
Search Engine Discovery
Search Engine Discovery
Concept
Also called OSINT (Open Source Intelligence) gathering.
Uses search engines as reconnaissance tools to uncover information on websites, organizations, and individuals.
Leverages indexing and search operators to extract data not directly visible on target sites.
Importance
Open Source → publicly accessible, legal, and ethical.
Broad Coverage → access to a wide range of indexed data.
Ease of Use → requires no advanced technical skills.
Cost-Effective → freely available resource.
Applications
Security Assessment → identify vulnerabilities, exposed data, login pages.
Competitive Intelligence → gather data on competitors’ products, services, strategies.
Investigative Journalism → reveal hidden connections or unethical practices.
Threat Intelligence → track malicious actors and emerging threats.
Search Operators
Operators refine searches to uncover precise information. Syntax may vary by search engine.
site:
Limit results to a domain
site:AcmeCorp.local
Find indexed pages on AcmeCorp.local
inurl:
Match term in URL
inurl:login
Look for login pages
filetype:
Find specific file types
filetype:pdf
Locate PDF documents
intitle:
Match term in page title
intitle:"confidential report"
Find pages titled with “confidential report”
intext: / inbody:
Match term in body
intext:"password reset"
Find text mentioning password reset
cache:
Show cached version
cache:AcmeCorp.local
View cached snapshot
link:
Find backlinks
link:AcmeCorp.local
Show sites linking to AcmeCorp.local
related:
Find similar sites
related:AcmeCorp.local
Show similar websites
info:
Show page details
info:AcmeCorp.local
Get metadata on domain
define:
Provide definitions
define:phishing
Fetch definitions of phishing
numrange:
Search within number ranges
site:AcmeCorp.local numrange:1000-2000
Find numbers in range
allintext:
Match all words in body
allintext:admin password reset
Find both terms in body
allinurl:
Match all words in URL
allinurl:admin panel
Find “admin panel” in URLs
allintitle:
Match all words in title
allintitle:confidential report 2025
Find these words in titles
AND
Require all terms
site:AcmeCorp.local AND inurl:admin
Find admin pages on AcmeCorp.local
OR
Match any term
"Linux" OR "Ubuntu"
Find pages with either term
NOT
Exclude terms
site:BankCorp.local NOT inurl:login
Find pages excluding login
* (wildcard)
Placeholder for words
filetype:pdf user* manual
Match “user guide,” “user handbook,” etc.
..
Range search
"price" 100..500
Match numbers between 100 and 500
"" (quotes)
Exact phrase search
"information security policy"
Match the exact phrase
- (minus)
Exclude term
site:NewsPortal.net -inurl:sports
Exclude sports content
Google Dorking
Technique using Google search operators to find sensitive or hidden information.
Often referenced in the Google Hacking Database.
Examples:
Login Pages
site:AcmeCorp.local inurl:loginsite:AcmeCorp.local (inurl:login OR inurl:admin)
Exposed Files
site:AcmeCorp.local filetype:pdfsite:AcmeCorp.local (filetype:xls OR filetype:docx)
Configuration Files
site:AcmeCorp.local inurl:config.phpsite:AcmeCorp.local (ext:conf OR ext:cnf)
Database Backups
site:AcmeCorp.local inurl:backupsite:AcmeCorp.local filetype:sql
Web Archives
Web Archives
What is the Wayback Machine?
A digital archive of the World Wide Web and other internet resources.
Created by the Internet Archive, a non-profit organization.
Online since 1996, capturing website snapshots (“archives” or “captures”).
Allows users to revisit earlier versions of websites, showing historical design, content, and functionality.
How it Works
Three-step process:
Crawling → Automated bots systematically browse websites, following links and downloading pages.
Archiving → Pages and resources (HTML, CSS, JS, images, etc.) are stored with a timestamp, creating a snapshot.
Accessing → Users enter a URL and select a date to view historical captures, search terms within archives, or download archived content.
Frequency of snapshots varies (multiple daily to few per year).
Influenced by popularity, update rate, and Internet Archive resources.
Not every page is captured; priority is given to cultural, historical, or research value.
Website owners can request exclusions, though not always guaranteed.
Reconnaissance Value
Uncover Hidden Assets → Old pages, files, directories, or subdomains may expose sensitive data.
Track Changes → Compare historical snapshots to identify shifts in structure, technologies, or vulnerabilities.
Gather Intelligence → Archived content provides OSINT on past activities, marketing, employees, and technology.
Stealthy Reconnaissance → Accessing archives is passive, leaving no trace on the target’s infrastructure.
Example
The first archived version of HackTheBox is available on the Wayback Machine.
Earliest capture: 2017-06-10 @ 04:23:01.
Automating Recon
Automating Recon
Why Automate Reconnaissance?
Automation improves web reconnaissance by:
Efficiency → handles repetitive tasks faster than humans.
Scalability → expands recon across many targets or domains.
Consistency → follows rules for reproducible results and fewer errors.
Comprehensive Coverage → tasks include DNS enumeration, subdomain discovery, crawling, port scanning, etc.
Integration → frameworks connect with other tools for seamless workflows.
Reconnaissance Frameworks
FinalRecon → Python tool with modules for SSL checks, Whois, headers, crawling, DNS, subdomains, and directories.
Recon-ng → modular Python framework for DNS, subdomains, crawling, port scanning, and vulnerability exploitation.
theHarvester → gathers emails, subdomains, employee names, and host data from search engines, PGP servers, Shodan, etc.
SpiderFoot → OSINT automation tool collecting IPs, domains, emails, and social media data; supports DNS lookups, crawling, and port scans.
OSINT Framework → curated collection of OSINT tools and resources.
FinalRecon
Capabilities include:
Header Information → server details, technologies, security misconfigurations.
Whois Lookup → domain registration and contact details.
SSL Certificate Info → validity, issuer, and details.
Crawler → extracts links, resources, comments,
robots.txt, andsitemap.xml.DNS Enumeration → supports over 40 record types.
Subdomain Enumeration → queries crt.sh, AnubisDB, ThreatMiner, CertSpotter, VirusTotal API, Shodan API, etc.
Directory Enumeration → custom wordlists/extensions to uncover hidden files/paths.
Wayback Machine → retrieves URLs from historical archives.
Fast Port Scan → quick service discovery.
Full Recon → runs all modules together.
Installing FinalRecon
Help Output (excerpt):
Example Command
Gather header information and Whois lookup for AcmeCorp.local:
Sample Output (excerpt):
Last updated