Data Processing Pipeline - Job Market Analytics

1

Market Research & Platform Selection

Conducted comprehensive research of the Polish job market to identify the most suitable data source. After evaluating multiple job boards, we selected JustJoin.it for its comprehensive tech job listings and structured data format.

2

Web Structure Investigation & Scraper Development

Analyzed the website's structure, API endpoints, and data formats. Built a robust web scraper capable of extracting job offers across multiple technology categories (Java, PHP, Ruby, Python, JavaScript, Data).

3

Data Scraping

Executed the scraper to collect job offers from all target categories. Raw data aggregated into offersCombined.json containing thousands of job postings with details on skills, salaries, locations, and requirements.

4

Core Data Processing Pipeline

Implemented a comprehensive data cleaning and transformation pipeline to standardize and enrich the raw data:

graph TD
    RawData[("Raw Data
(offersCombined.json)")] --> |Load JSON| DeDup{Duplicate URL?}
    DeDup -- Yes --> Skip[Skip Entry]
    DeDup -- No --> Extraction

    subgraph Processing["Processing Pipeline"]
        Extraction[Extract Data]
        
        %% Location Branch
        Extraction --> LocProc[Location Processing]
        LocProc --> |"Warsaw → Warszawa"| City[Clean City]
        
        %% Salary Branch
        Extraction --> SalProc[Salary Processing]
        SalProc --> |"Hourly × 168"| Monthly[Monthly Basis]
        Monthly --> |"NBP API Rates"| EurConv[Convert to EUR]
        
        %% Skill Branch
        Extraction --> SkillProc[Skill Categorizer]
        SkillProc --> |"Embeddings & Cosine Sim"| AI[Sentence Transformer]
        AI --> |"Similarity > 0.65"| Category[Standardized Category]
    end

    City --> ObjBuilder[Build Pydantic Object]
    EurConv --> ObjBuilder
    Category --> ObjBuilder
    
    ObjBuilder --> |Save| Output[("Clean Data
(ClearOffers2.json)")]

    style RawData fill:#e8d5b7
    style Output fill:#e8d5b7
    style Processing fill:#f5f0e8

Deduplication: Removed duplicate entries based on unique job URLs

Location Normalization: Standardized city names (e.g., "Warsaw" → "Warszawa")

Salary Conversion: Converted all salaries to EUR using NBP API exchange rates, normalized hourly rates to monthly

Skill Categorization: Used ML embeddings (Sentence Transformer) with cosine similarity to group similar skills into standardized categories

5

Visualization-Specific Data Processing

Generated specialized datasets for each visualization component:

calculateJaccardIndex.js Computes skill co-occurrence patterns using Jaccard similarity index for the skill relationships network

calculateBoxplotData.js Generates salary distribution statistics (quartiles, outliers) grouped by skill and experience level

processExperienceLevel.js Aggregates job offer counts and statistics by experience level (Junior, Mid, Senior, Lead)

processContractType.js Analyzes distribution of contract types (B2B, UoP, etc.) across job offers

processWorkMode.js Categorizes work arrangements (Remote, Hybrid, Office) for market trend analysis

CategoriesCount.py Counts job offers per technology category for treemap visualization

SkillToSalary.py Correlates individual skills with salary ranges for skill value analysis

AverageSalary.py Calculates average salaries segmented by experience level for career trajectory insights

6

Data → Insights

The final transformation brings processed data to life through interactive visualizations. This entire pipeline was built with a clear purpose: to help us, as developers and data enthusiasts, make informed decisions about which skills and technologies to learn next.

By analyzing thousands of job offers, salary ranges, and skill combinations, we can now see clear patterns in the market. Which technologies are in highest demand? What skills command the best salaries? How do different experience levels affect compensation? What's the optimal career progression path?

These insights transform raw market data into actionable knowledge, empowering anyone to strategically plan their learning journey and career development based on real market trends rather than guesswork.

Explore the Dashboard →