Arachnida – Gomesold

Description

Arachnida: Web Scraping and Metadata Analysis in Cybersecurity

Within the cybersecurity bootcamp at 42Madrid, the Arachnida project combines two highly relevant disciplines in the real world: automated information gathering and metadata analysis. The result is two complementary tools — Spider and Scorpion — designed to understand how much sensitive information can be passively extracted from publicly available web content.

This project is not just about downloading files; it’s about learning to look beyond the obvious and detect data that is often published unintentionally.

🌐What Are Metadata and Why Do They Matter?

Metadata is information associated with other data. In files such as images, PDFs, or office documents, metadata can include details like:

Creation date and time
Software used to generate the file
User or author information
Camera or device model
Operating system details
GPS coordinates (in some cases)

From a cybersecurity and OSINT (Open Source Intelligence) perspective, this information can reveal critical insights about individuals, companies, or infrastructures without exploiting any vulnerabilities.

🧩Project Objective

The main goal of Arachnida is to develop two tools that work together to:

Automatically extract public files from a website
Analyze these files for sensitive metadata
Display the information clearly
Raise awareness about the importance of removing metadata before sharing content online

🕷️ Spider: Automated File Extraction

Spider is responsible for the data collection phase. Its main functions include:

Receiving a URL as a parameter
Crawling the website recursively, following internal links
Automatically downloading relevant files such as images, PDFs, and documents
Saving files locally for later analysis

This behavior closely resembles real-world tools used in security audits and passive information-gathering processes, commonly known as scraping.

🦂 Scorpion: Metadata Analysis

Once the files are downloaded, Scorpion comes into play as the analysis tool. Its capabilities include:

Receiving files as input parameters
Analyzing EXIF and other embedded metadata
Displaying basic attributes such as creation date, author, and software used
Extracting any other available metadata
Supporting at least the same file types handled by Spider

This part of the project highlights its true educational value: seemingly harmless files can reveal more information than expected.

🔍Real-World Cybersecurity Application

The Spider → Scorpion workflow reflects a common professional scenario:

Collecting publicly available information from the web
Analyzing downloaded files
Identifying exposed sensitive data
Evaluating risks and providing mitigation recommendations

These techniques are used in pentesting, security audits, digital forensics, OSINT investigations, and security awareness initiatives.

⚙️Running the Project

Arachnida is developed in Python and uses external libraries for both web scraping and metadata analysis. To run the scripts, first install the dependencies listed in requirements.txt:

After installation, the tools can be executed from the command line, allowing quick and automated testing.

🕷️ Spider

# Recursive mode
python3 spider.py -r <URL>

# Recursive mode + level depth
python3 spider.py -r <URL> -l <Nº>

# Recursive mode +  directory download path
python3 spider.py -r <URL> -p <PATH>

# Recursive mode + Silent output
python3 spider.py -r <URL> -S

# File mode 
python3 spider.py -f <URL-RESOURCE>

# Print help message
python3 spider.py -h

Running Spider generates /data and /logs directories in the repository. /data contains downloaded files, and /logs stores an action log.

🦂 Scorpion

# Resources mode
python3 scorpion.py FILE1 FILE2 FILE3 ...

# Directory mode
python3 scorpion.py -d <DIRECTORY-PATH>

🐳Running a Docker Test

Install Docker Desktop and run the application.
Install make and use the Makefile to build the container and access a bash session:

make && make exec

# Build image and container
->> make
# Get a bash from container 
->> make exec  
# Build a new container 
->> make dock  
# Build image 
->> make image  
# Remove image and container 
->> make fclean

Inside the container, two additional directories are created in the user’s /home directory and synchronized with both tools’ directories using volumes. This allows sharing and persisting data between the container environment and the user’s system.

🚀Conclusion

Arachnida is an excellent project to learn about:

Controlled web scraping
Metadata and EXIF analysis
Python automation
Basic OSINT techniques
Privacy awareness and information exposure

Beyond the code, it leaves a clear lesson: in cybersecurity, what is not immediately visible is often the most interesting… and the most dangerous 🕵️‍♂️💻.

📌 Source code available on GitHub

Details

Date: 2025-12-29
Categories: Python - Security
Código Fuente goldcod3/Arachnida