-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
134 additions
and
213 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,106 +1,74 @@ | ||
# id-jobs: Your Ultimate Explosion of Indonesian Job Market Data! 💥🧙♀️ | ||
# id-jobs: Indonesian Job Market Data Aggregator 💼🇮🇩 | ||
|
||
[![Daily Explosion of Job Data](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml/badge.svg)](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml) | ||
[![Daily Job Data Update](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml/badge.svg)](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml) | ||
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) | ||
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/) | ||
![Powered by Scrapy](https://img.shields.io/badge/Powered%20by-Scrapy-green.svg) | ||
![Enhanced by Playwright](https://img.shields.io/badge/Enhanced%20by-Playwright-orange.svg) | ||
|
||
## 🎆 What's New in Our Latest Explosion? 🎆 | ||
## 🆕 Latest Updates | ||
|
||
- **Kredivo Integration**: Unleashed the power to scrape job listings from Kredivo's career portal with explosive precision! | ||
- **Karir.com Pagination Mastery**: Implemented an unstoppable pagination system that blasts through ALL available job opportunities on Karir.com! | ||
- **Enhanced Data Sanitization**: Improved our pre-upload cleansing rituals for job titles and types, ensuring purer, more potent data! | ||
- **Work Arrangement Detection**: Automatically identify Remote and On-site opportunities with magical accuracy! | ||
- **Job Level Extraction**: Implemented arcane algorithms to determine job levels from titles! | ||
- **Explosive Error Handling**: Fortified our spiders with robust error containment spells, ensuring smooth operation even in chaotic data environments! | ||
- Added Koltiva job listings | ||
- Improved Karir.com data collection | ||
- Enhanced data cleaning for job titles and types | ||
- Added work arrangement and job level detection | ||
- Improved error handling | ||
|
||
## 🌋 Overview | ||
## 📊 Overview | ||
|
||
id-jobs harnesses the explosive power of web scraping to gather job listings from a vast array of Indonesian job portals and company websites, always respecting each site's terms of service. It's like casting a wide-area Explosion spell on the job market! | ||
id-jobs collects job listings from Indonesian job portals and company websites, respecting each site's terms of service. | ||
|
||
📊 **Witness the Explosion of Job Data:** [https://s.id/id-jobs-v2](https://s.id/id-jobs-v2) | ||
**View the Data:** [https://s.id/id-jobs-v2](https://s.id/id-jobs-v2) | ||
|
||
🇮🇩 **Note:** id-jobs is specifically enchanted for the Indonesian job market. | ||
## 🎨 Job Age Colors | ||
|
||
## 🔥 Job Age Color Codex | ||
| Age | Time | Color | | ||
|-----|------|-------| | ||
| New | ≤ 1 day | ![#00CC00](https://via.placeholder.com/15/00CC00/000000?text=+) Bright Green | | ||
| Hot | 1-7 days | ![#FF6600](https://via.placeholder.com/15/FF6600/000000?text=+) Bright Orange | | ||
| Recent | 8-15 days | ![#FFFF00](https://via.placeholder.com/15/FFFF00/000000?text=+) Bright Yellow | | ||
| Aging | 16-21 days | ![#E6E6E6](https://via.placeholder.com/15/E6E6E6/000000?text=+) Light Gray | | ||
| Old | 22-30 days | ![#CCCCCC](https://via.placeholder.com/15/CCCCCC/000000?text=+) Medium Gray | | ||
| Expired | > 30 days | ![#B3B3B3](https://via.placeholder.com/15/B3B3B3/000000?text=+) Dark Gray | | ||
## 🔧 How It Works | ||
|
||
Quickly identify the freshness of job listings with our color-coded system, inspired by the varying intensities of magical explosions: | ||
id-jobs automatically collects job data from various websites, cleans the information, and compiles it into a single spreadsheet. | ||
|
||
| Job Age Category | Time Range | Color | Description | | ||
|------------------|------------|-------|-------------| | ||
| New | <= 1 day | ![#B3E6B3](https://via.placeholder.com/15/B3E6B3/000000?text=+) Bright Light Green | Fresh as a newly cast spell! | | ||
| Hot | 1 to 7 days | ![#FFCC66](https://via.placeholder.com/15/FFCC66/000000?text=+) Warm Light Orange | Still sizzling with opportunity! | | ||
| Recent | 8 to 15 days | ![#99CCFF](https://via.placeholder.com/15/99CCFF/000000?text=+) Light Blue | The magic lingers... | | ||
| Aging | 16 to 21 days | ![#F2F2F2](https://via.placeholder.com/15/F2F2F2/000000?text=+) Very Light Gray | The spell's power wanes... | | ||
| Old | 22 to 30 days | ![#E6E6E6](https://via.placeholder.com/15/E6E6E6/000000?text=+) Light Gray | Ancient arcana, approach with caution. | | ||
| Expired | > 30 days | ![#D9D9D9](https://via.placeholder.com/15/D9D9D9/000000?text=+) Medium Gray | The magic has dissipated. | | ||
![Scraping Process](how-scraper-works.gif) | ||
|
||
## 💥 How It Works | ||
## 👀 Preview | ||
|
||
id-jobs automatically casts its net wide, visiting Indonesian job websites with the precision of a perfectly aimed Explosion spell. It collects relevant information and organizes it into a single, powerful spreadsheet. The data undergoes rigorous magical cleansing and formatting before being uploaded, ensuring consistency and readability worthy of the finest spell books. | ||
![id-jobs Preview](screen-capture-dev.png) | ||
|
||
Our latest enhancements include: | ||
- Explosive pagination for comprehensive data collection | ||
- Advanced string sanitization for cleaner, more consistent job data | ||
- Intelligent work arrangement and job level detection | ||
## 🌟 Why Use id-jobs? | ||
|
||
![The Explosive Scraping Process](how-scraper-works.gif) | ||
id-jobs simplifies job searching by gathering information from multiple sources into one place, providing insights on work arrangements, job levels, and application deadlines. | ||
|
||
## 🔮 Preview | ||
## 📚 Data Sources | ||
|
||
Behold, a glimpse into the arcane power of id-jobs data: | ||
We collect data from various job portals and company websites, including: | ||
Blibli, Dealls, Evermos, Flip, GoTo, Jobstreet, Kalibrr, Karir.com, Kredivo, SoftwareOne, Tiket, and more. | ||
|
||
![id-jobs in Action](screen-capture-dev.png) | ||
## 🚀 Features | ||
|
||
## 🚀 Why Harness the Power of id-jobs? | ||
- Daily updates | ||
- Work arrangement identification | ||
- Job level detection | ||
- Application deadline calculation | ||
- Improved data accuracy | ||
- User-friendly Google Sheets interface | ||
- Job age tracking | ||
|
||
Navigating the labyrinth of job opportunities in Indonesia can be as challenging as mastering Explosion magic. id-jobs simplifies this quest by consolidating information from multiple realms (websites) into one central grimoire (spreadsheet), providing additional insights such as work arrangements, job levels, and application deadlines that even Megumin would approve of! | ||
## 🏁 Getting Started | ||
|
||
## 📚 Tomes of Knowledge (Data Sources) | ||
For a quick guide, see our [Quickstart Guide](QUICKSTART.md). | ||
|
||
We gather our arcane knowledge from a wide range of sources, each represented by a powerful spider in our magical arsenal: | ||
## ❓ FAQ | ||
|
||
- Blibli 🛒 | ||
- Dealls 🤝 | ||
- Evermos 🌟 | ||
- Flip 💳 | ||
- GoTo 🚗 | ||
- Jobstreet 💼 | ||
- Kalibrr 🎓 | ||
- Karir.com 🌐 | ||
- Kredivo 💰 (New!) | ||
- SoftwareOne 💻 | ||
- Tiket ✈️ | ||
- Various company career portals 🏢 | ||
Check our [FAQ](FAQ.md) for common questions. | ||
|
||
Each of these sources is a realm of opportunity, waiting to be explored by our job-seeking wizards. Our spiders weave through these portals, extracting valuable job data with the precision and power of a well-cast Explosion spell! | ||
## 📄 License | ||
|
||
🔮 Note: Our collection of magical spiders is ever-growing, as we continuously enhance our ability to scry the Indonesian job market. Keep an eye out for new additions to our arcane arsenal! | ||
id-jobs is open source under the GPL-3.0 license. You can use, modify, and share the code, as long as you keep it open source. | ||
|
||
## ✨ Magical Features | ||
|
||
- **Daily Explosions of Updates**: Automated daily updates through CI/CD pipelines that would make any archmage jealous. | ||
- **Work Arrangement Scrying**: Identify Remote, Hybrid, and On-site opportunities with crystal-clear clarity. | ||
- **Job Level Divination**: Automatically determine job levels from titles, providing deeper insights into career opportunities. | ||
- **Application Deadline Divination**: Calculated end dates for job applications, because timing is everything in both magic and job hunting. | ||
- **Optimized Data Collection Rituals**: Improved accuracy and coverage of job listings, leaving no stone unturned. | ||
- **User-Friendly Spell Interface**: Access job data through a Google Sheets interface so intuitive, even a novice wizard could use it. | ||
- **Comprehensive Information Gathering**: Data from multiple job boards and company websites, all in one place. | ||
- **Job Age Tracking**: Identify the freshest job listings with the precision of a finely tuned magical sensor. | ||
|
||
## 🧙♀️ Getting Started on Your Magical Journey | ||
|
||
For a quick guide on how to harness the power of id-jobs, consult our [Quickstart Grimoire](QUICKSTART.md). | ||
|
||
## 🔍 Frequently Asked Arcane Questions | ||
|
||
Have questions about our magical processes? Check out our [FAQ Scroll](FAQ.md) for answers to common queries from fellow wizards and job seekers. | ||
|
||
## 📜 Legal Incantations | ||
|
||
id-jobs is open source under the GPL-3.0 license. You're free to use, modify, and share the code, as long as you keep it open source too. Think of it as sharing the secrets of Explosion magic with the world! | ||
|
||
We always respect website terms of service when collecting data, because even the most powerful wizards need to follow the rules of the realms they visit. | ||
|
||
Now go forth and explode your job search with the power of id-jobs! 💥🎆 | ||
We respect website terms of service when collecting data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
import scrapy | ||
import logging | ||
import json | ||
from datetime import datetime | ||
from typing import Dict, Any, Optional | ||
from freya.pipelines import calculate_job_age | ||
from freya.utils import calculate_job_apply_end_date | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
class KoltivaSpider(scrapy.Spider): | ||
name = 'koltiva' | ||
BASE_URL = 'https://career.koltiva.com' | ||
API_URL = 'https://erp-api.koltitrace.com/api/v1/jobs?limit=100' | ||
|
||
def __init__(self, *args, **kwargs): | ||
super().__init__(*args, **kwargs) | ||
self.timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S") | ||
|
||
def start_requests(self): | ||
headers = { | ||
'accept': 'application/json, text/plain, */*', | ||
'authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJlbWFpbCI6InBzYW5qYXlhLndvcmtAZ21haWwuY29tIiwiaWF0IjoxNzI2MTI5NTIwfQ.4NDbf50RCpgcpQ8tz2oPBULtom0o-A5JgJOjDOHXtIY', | ||
'origin': 'https://career.koltiva.com', | ||
'user-agent': 'Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.188 Safari/537.36 CrKey/1.54.250320' | ||
} | ||
yield scrapy.Request(self.API_URL, headers=headers, callback=self.parse) | ||
|
||
def parse(self, response): | ||
data = json.loads(response.text) | ||
for job in data['data']['data']: | ||
yield self.parse_job(job) | ||
|
||
def parse_job(self, job_data: Dict[str, Any]) -> Dict[str, Any]: | ||
first_seen = self.timestamp | ||
last_seen = self.timestamp | ||
|
||
return { | ||
'job_title': self.sanitize_string(job_data['position_name'], is_title=True), | ||
'job_location': f"{self.sanitize_string(job_data['unit_name'])} - {self.sanitize_string(job_data['country_name'])}", | ||
'job_department': self.sanitize_string(job_data['unitsec_name']), | ||
'job_url': f"{self.BASE_URL}/list-job/{job_data['slug']}", | ||
'first_seen': first_seen, | ||
'base_salary': 'N/A', | ||
'job_type': self.sanitize_string(job_data['work_period_name'], is_job_type=True), | ||
'job_level': self.sanitize_string(job_data['level_name']), | ||
'job_apply_end_date': job_data['close_date'].split('T')[0], | ||
'last_seen': last_seen, | ||
'is_active': 'True', | ||
'company': 'Koltiva', | ||
'company_url': self.BASE_URL, | ||
'job_board': 'Koltiva Careers', | ||
'job_board_url': self.BASE_URL, | ||
'job_age': calculate_job_age(first_seen, last_seen), | ||
'work_arrangement': self.get_work_arrangement(job_data['jobs_benefits_perks']), | ||
} | ||
|
||
def get_work_arrangement(self, benefits: str) -> str: | ||
return 'Remote' if 'Work-from-home' in benefits else 'On-site' | ||
|
||
@staticmethod | ||
def sanitize_string(s: Optional[str], is_title: bool = False, is_job_type: bool = False) -> str: | ||
if s is None: | ||
return 'N/A' | ||
s = s.strip() | ||
s = s.replace(',', ' -') # Replace commas with hyphens for CSV compatibility | ||
if is_title: | ||
s = s.title() | ||
elif is_job_type: | ||
s = s.replace('Contract', '').strip() | ||
return ' '.join(s.split()) or 'N/A' |
Oops, something went wrong.