Data Processing Pipeline
From raw data to actionable insights
Market Research & Platform Selection
Conducted comprehensive research of the Polish job market to identify the most suitable data source. After evaluating multiple job boards, we selected JustJoin.it for its comprehensive tech job listings and structured data format.
Web Structure Investigation & Scraper Development
Analyzed the website's structure, API endpoints, and data formats. Built a robust web scraper capable of extracting job offers across multiple technology categories (Java, PHP, Ruby, Python, JavaScript, Data).
Data Scraping
Executed the scraper to collect job offers from all target categories. Raw data aggregated
into offersCombined.json containing thousands of job postings with details on
skills, salaries, locations, and requirements.
Core Data Processing Pipeline
Implemented a comprehensive data cleaning and transformation pipeline to standardize and enrich the raw data:
graph TD
RawData[("Raw Data
(offersCombined.json)")] --> |Load JSON| DeDup{Duplicate URL?}
DeDup -- Yes --> Skip[Skip Entry]
DeDup -- No --> Extraction
subgraph Processing["Processing Pipeline"]
Extraction[Extract Data]
%% Location Branch
Extraction --> LocProc[Location Processing]
LocProc --> |"Warsaw → Warszawa"| City[Clean City]
%% Salary Branch
Extraction --> SalProc[Salary Processing]
SalProc --> |"Hourly × 168"| Monthly[Monthly Basis]
Monthly --> |"NBP API Rates"| EurConv[Convert to EUR]
%% Skill Branch
Extraction --> SkillProc[Skill Categorizer]
SkillProc --> |"Embeddings & Cosine Sim"| AI[Sentence Transformer]
AI --> |"Similarity > 0.65"| Category[Standardized Category]
end
City --> ObjBuilder[Build Pydantic Object]
EurConv --> ObjBuilder
Category --> ObjBuilder
ObjBuilder --> |Save| Output[("Clean Data
(ClearOffers2.json)")]
style RawData fill:#e8d5b7
style Output fill:#e8d5b7
style Processing fill:#f5f0e8
Visualization-Specific Data Processing
Generated specialized datasets for each visualization component:
calculateJaccardIndex.js
Computes skill co-occurrence patterns using Jaccard similarity index for the skill
relationships network
calculateBoxplotData.js
Generates salary distribution statistics (quartiles, outliers) grouped by skill
and experience level
processExperienceLevel.js
Aggregates job offer counts and statistics by experience level (Junior, Mid,
Senior, Lead)
processContractType.js
Analyzes distribution of contract types (B2B, UoP, etc.) across job offers
processWorkMode.js
Categorizes work arrangements (Remote, Hybrid, Office) for market trend
analysis
CategoriesCount.py
Counts job offers per technology category for treemap visualization
SkillToSalary.py
Correlates individual skills with salary ranges for skill value analysis
AverageSalary.py
Calculates average salaries segmented by experience level for career trajectory
insights
Data → Insights
The final transformation brings processed data to life through interactive visualizations. This entire pipeline was built with a clear purpose: to help us, as developers and data enthusiasts, make informed decisions about which skills and technologies to learn next.
By analyzing thousands of job offers, salary ranges, and skill combinations, we can now see clear patterns in the market. Which technologies are in highest demand? What skills command the best salaries? How do different experience levels affect compensation? What's the optimal career progression path?
These insights transform raw market data into actionable knowledge, empowering anyone to strategically plan their learning journey and career development based on real market trends rather than guesswork.