Arachnida: Web Scraping and Metadata Analysis in Cybersecurity
Within the cybersecurity bootcamp at 42Madrid, the Arachnida project combines two highly relevant disciplines in the real world: automated information gathering and metadata analysis. The result is two complementary tools — Spider and Scorpion — designed to understand how much sensitive information can be passively extracted from publicly available web content.
This project is not just about downloading files; it’s about learning to look beyond the obvious and detect data that is often published unintentionally.
🌐What Are Metadata and Why Do They Matter?
Metadata is information associated with other data. In files such as images, PDFs, or office documents, metadata can include details like:
-
Creation date and time
-
Software used to generate the file
-
User or author information
-
Camera or device model
-
Operating system details
-
GPS coordinates (in some cases)
From a cybersecurity and OSINT (Open Source Intelligence) perspective, this information can reveal critical insights about individuals, companies, or infrastructures without exploiting any vulnerabilities.
🧩Project Objective
The main goal of Arachnida is to develop two tools that work together to:
-
Automatically extract public files from a website
-
Analyze these files for sensitive metadata
-
Display the information clearly
-
Raise awareness about the importance of removing metadata before sharing content online
🕷️ Spider: Automated File Extraction
Spider is responsible for the data collection phase. Its main functions include:
-
Receiving a URL as a parameter
-
Crawling the website recursively, following internal links
-
Automatically downloading relevant files such as images, PDFs, and documents
-
Saving files locally for later analysis
This behavior closely resembles real-world tools used in security audits and passive information-gathering processes, commonly known as scraping.

🦂 Scorpion: Metadata Analysis
Once the files are downloaded, Scorpion comes into play as the analysis tool. Its capabilities include:
-
Receiving files as input parameters
-
Analyzing EXIF and other embedded metadata
-
Displaying basic attributes such as creation date, author, and software used
-
Extracting any other available metadata
-
Supporting at least the same file types handled by Spider
This part of the project highlights its true educational value: seemingly harmless files can reveal more information than expected.

🔍Real-World Cybersecurity Application
The Spider → Scorpion workflow reflects a common professional scenario:
-
Collecting publicly available information from the web
-
Analyzing downloaded files
-
Identifying exposed sensitive data
-
Evaluating risks and providing mitigation recommendations
These techniques are used in pentesting, security audits, digital forensics, OSINT investigations, and security awareness initiatives.
⚙️Running the Project
Arachnida is developed in Python and uses external libraries for both web scraping and metadata analysis. To run the scripts, first install the dependencies listed in requirements.txt:
After installation, the tools can be executed from the command line, allowing quick and automated testing.
🕷️ Spider
# Recursive mode python3 spider.py -r <URL> # Recursive mode + level depth python3 spider.py -r <URL> -l <Nº> # Recursive mode + directory download path python3 spider.py -r <URL> -p <PATH> # Recursive mode + Silent output python3 spider.py -r <URL> -S # File mode python3 spider.py -f <URL-RESOURCE> # Print help message python3 spider.py -h
Running Spider generates /data and /logs directories in the repository. /data contains downloaded files, and /logs stores an action log.
🦂 Scorpion
# Resources mode python3 scorpion.py FILE1 FILE2 FILE3 ... # Directory mode python3 scorpion.py -d <DIRECTORY-PATH>
🐳Running a Docker Test
-
Install Docker Desktop and run the application.
-
Install
makeand use the Makefile to build the container and access a bash session:
make && make exec
# Build image and container ->> make # Get a bash from container ->> make exec # Build a new container ->> make dock # Build image ->> make image # Remove image and container ->> make fclean
Inside the container, two additional directories are created in the user’s /home directory and synchronized with both tools’ directories using volumes. This allows sharing and persisting data between the container environment and the user’s system.
🚀Conclusion
Arachnida is an excellent project to learn about:
-
Controlled web scraping
-
Metadata and EXIF analysis
-
Python automation
-
Basic OSINT techniques
-
Privacy awareness and information exposure
Beyond the code, it leaves a clear lesson: in cybersecurity, what is not immediately visible is often the most interesting… and the most dangerous 🕵️♂️💻.
📌 Source code available on GitHub
- Date: 2025-12-29
- Categories: Python - Security
- Código Fuente goldcod3/Arachnida