Arachnida

Description

Arachnida: Web Scraping and Metadata Analysis in Cybersecurity

Within the cybersecurity bootcamp at 42Madrid, the Arachnida project combines two highly relevant disciplines in the real world: automated information gathering and metadata analysis. The result is two complementary tools — Spider and Scorpion — designed to understand how much sensitive information can be passively extracted from publicly available web content.

This project is not just about downloading files; it’s about learning to look beyond the obvious and detect data that is often published unintentionally.


🌐What Are Metadata and Why Do They Matter?

Metadata is information associated with other data. In files such as images, PDFs, or office documents, metadata can include details like:

  • Creation date and time

  • Software used to generate the file

  • User or author information

  • Camera or device model

  • Operating system details

  • GPS coordinates (in some cases)

From a cybersecurity and OSINT (Open Source Intelligence) perspective, this information can reveal critical insights about individuals, companies, or infrastructures without exploiting any vulnerabilities.


🧩Project Objective

The main goal of Arachnida is to develop two tools that work together to:

  • Automatically extract public files from a website

  • Analyze these files for sensitive metadata

  • Display the information clearly

  • Raise awareness about the importance of removing metadata before sharing content online

🕷️ Spider: Automated File Extraction

Spider is responsible for the data collection phase. Its main functions include:

  • Receiving a URL as a parameter

  • Crawling the website recursively, following internal links

  • Automatically downloading relevant files such as images, PDFs, and documents

  • Saving files locally for later analysis

This behavior closely resembles real-world tools used in security audits and passive information-gathering processes, commonly known as scraping.

🦂 Scorpion: Metadata Analysis

Once the files are downloaded, Scorpion comes into play as the analysis tool. Its capabilities include:

  • Receiving files as input parameters

  • Analyzing EXIF and other embedded metadata

  • Displaying basic attributes such as creation date, author, and software used

  • Extracting any other available metadata

  • Supporting at least the same file types handled by Spider

This part of the project highlights its true educational value: seemingly harmless files can reveal more information than expected.


🔍Real-World Cybersecurity Application

The Spider → Scorpion workflow reflects a common professional scenario:

  • Collecting publicly available information from the web

  • Analyzing downloaded files

  • Identifying exposed sensitive data

  • Evaluating risks and providing mitigation recommendations

These techniques are used in pentesting, security audits, digital forensics, OSINT investigations, and security awareness initiatives.


⚙️Running the Project

Arachnida is developed in Python and uses external libraries for both web scraping and metadata analysis. To run the scripts, first install the dependencies listed in requirements.txt:

pip install -r requirements.txt

After installation, the tools can be executed from the command line, allowing quick and automated testing.

🕷️ Spider

# Recursive mode
python3 spider.py -r <URL>

# Recursive mode + level depth
python3 spider.py -r <URL> -l <Nº>

# Recursive mode +  directory download path
python3 spider.py -r <URL> -p <PATH>

# Recursive mode + Silent output
python3 spider.py -r <URL> -S

# File mode 
python3 spider.py -f <URL-RESOURCE>

# Print help message
python3 spider.py -h

Running Spider generates /data and /logs directories in the repository. /data contains downloaded files, and /logs stores an action log.

🦂 Scorpion

# Resources mode
python3 scorpion.py FILE1 FILE2 FILE3 ...

# Directory mode
python3 scorpion.py -d <DIRECTORY-PATH>

🐳Running a Docker Test

  • Install Docker Desktop and run the application.

  • Install make and use the Makefile to build the container and access a bash session:

make && make exec
# Build image and container
->> make
# Get a bash from container 
->> make exec  
# Build a new container 
->> make dock  
# Build image 
->> make image  
# Remove image and container 
->> make fclean

Inside the container, two additional directories are created in the user’s /home directory and synchronized with both tools’ directories using volumes. This allows sharing and persisting data between the container environment and the user’s system.


🚀Conclusion

Arachnida is an excellent project to learn about:

  • Controlled web scraping

  • Metadata and EXIF analysis

  • Python automation

  • Basic OSINT techniques

  • Privacy awareness and information exposure

Beyond the code, it leaves a clear lesson: in cybersecurity, what is not immediately visible is often the most interesting… and the most dangerous 🕵️‍♂️💻.

📌 Source code available on GitHub

Details