Skip to content

Commit

Permalink
Merge branch 'development'
Browse files Browse the repository at this point in the history
  • Loading branch information
ceroberoz committed Sep 12, 2024
2 parents 4395c3f + 7f9465e commit 6217901
Show file tree
Hide file tree
Showing 3 changed files with 134 additions and 213 deletions.
122 changes: 45 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,106 +1,74 @@
# id-jobs: Your Ultimate Explosion of Indonesian Job Market Data! 💥🧙‍♀️
# id-jobs: Indonesian Job Market Data Aggregator 💼🇮🇩

[![Daily Explosion of Job Data](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml/badge.svg)](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml)
[![Daily Job Data Update](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml/badge.svg)](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
![Powered by Scrapy](https://img.shields.io/badge/Powered%20by-Scrapy-green.svg)
![Enhanced by Playwright](https://img.shields.io/badge/Enhanced%20by-Playwright-orange.svg)

## 🎆 What's New in Our Latest Explosion? 🎆
## 🆕 Latest Updates

- **Kredivo Integration**: Unleashed the power to scrape job listings from Kredivo's career portal with explosive precision!
- **Karir.com Pagination Mastery**: Implemented an unstoppable pagination system that blasts through ALL available job opportunities on Karir.com!
- **Enhanced Data Sanitization**: Improved our pre-upload cleansing rituals for job titles and types, ensuring purer, more potent data!
- **Work Arrangement Detection**: Automatically identify Remote and On-site opportunities with magical accuracy!
- **Job Level Extraction**: Implemented arcane algorithms to determine job levels from titles!
- **Explosive Error Handling**: Fortified our spiders with robust error containment spells, ensuring smooth operation even in chaotic data environments!
- Added Koltiva job listings
- Improved Karir.com data collection
- Enhanced data cleaning for job titles and types
- Added work arrangement and job level detection
- Improved error handling

## 🌋 Overview
## 📊 Overview

id-jobs harnesses the explosive power of web scraping to gather job listings from a vast array of Indonesian job portals and company websites, always respecting each site's terms of service. It's like casting a wide-area Explosion spell on the job market!
id-jobs collects job listings from Indonesian job portals and company websites, respecting each site's terms of service.

📊 **Witness the Explosion of Job Data:** [https://s.id/id-jobs-v2](https://s.id/id-jobs-v2)
**View the Data:** [https://s.id/id-jobs-v2](https://s.id/id-jobs-v2)

🇮🇩 **Note:** id-jobs is specifically enchanted for the Indonesian job market.
## 🎨 Job Age Colors

## 🔥 Job Age Color Codex
| Age | Time | Color |
|-----|------|-------|
| New | ≤ 1 day | ![#00CC00](https://via.placeholder.com/15/00CC00/000000?text=+) Bright Green |
| Hot | 1-7 days | ![#FF6600](https://via.placeholder.com/15/FF6600/000000?text=+) Bright Orange |
| Recent | 8-15 days | ![#FFFF00](https://via.placeholder.com/15/FFFF00/000000?text=+) Bright Yellow |
| Aging | 16-21 days | ![#E6E6E6](https://via.placeholder.com/15/E6E6E6/000000?text=+) Light Gray |
| Old | 22-30 days | ![#CCCCCC](https://via.placeholder.com/15/CCCCCC/000000?text=+) Medium Gray |
| Expired | > 30 days | ![#B3B3B3](https://via.placeholder.com/15/B3B3B3/000000?text=+) Dark Gray |
## 🔧 How It Works

Quickly identify the freshness of job listings with our color-coded system, inspired by the varying intensities of magical explosions:
id-jobs automatically collects job data from various websites, cleans the information, and compiles it into a single spreadsheet.

| Job Age Category | Time Range | Color | Description |
|------------------|------------|-------|-------------|
| New | <= 1 day | ![#B3E6B3](https://via.placeholder.com/15/B3E6B3/000000?text=+) Bright Light Green | Fresh as a newly cast spell! |
| Hot | 1 to 7 days | ![#FFCC66](https://via.placeholder.com/15/FFCC66/000000?text=+) Warm Light Orange | Still sizzling with opportunity! |
| Recent | 8 to 15 days | ![#99CCFF](https://via.placeholder.com/15/99CCFF/000000?text=+) Light Blue | The magic lingers... |
| Aging | 16 to 21 days | ![#F2F2F2](https://via.placeholder.com/15/F2F2F2/000000?text=+) Very Light Gray | The spell's power wanes... |
| Old | 22 to 30 days | ![#E6E6E6](https://via.placeholder.com/15/E6E6E6/000000?text=+) Light Gray | Ancient arcana, approach with caution. |
| Expired | > 30 days | ![#D9D9D9](https://via.placeholder.com/15/D9D9D9/000000?text=+) Medium Gray | The magic has dissipated. |
![Scraping Process](how-scraper-works.gif)

## 💥 How It Works
## 👀 Preview

id-jobs automatically casts its net wide, visiting Indonesian job websites with the precision of a perfectly aimed Explosion spell. It collects relevant information and organizes it into a single, powerful spreadsheet. The data undergoes rigorous magical cleansing and formatting before being uploaded, ensuring consistency and readability worthy of the finest spell books.
![id-jobs Preview](screen-capture-dev.png)

Our latest enhancements include:
- Explosive pagination for comprehensive data collection
- Advanced string sanitization for cleaner, more consistent job data
- Intelligent work arrangement and job level detection
## 🌟 Why Use id-jobs?

![The Explosive Scraping Process](how-scraper-works.gif)
id-jobs simplifies job searching by gathering information from multiple sources into one place, providing insights on work arrangements, job levels, and application deadlines.

## 🔮 Preview
## 📚 Data Sources

Behold, a glimpse into the arcane power of id-jobs data:
We collect data from various job portals and company websites, including:
Blibli, Dealls, Evermos, Flip, GoTo, Jobstreet, Kalibrr, Karir.com, Kredivo, SoftwareOne, Tiket, and more.

![id-jobs in Action](screen-capture-dev.png)
## 🚀 Features

## 🚀 Why Harness the Power of id-jobs?
- Daily updates
- Work arrangement identification
- Job level detection
- Application deadline calculation
- Improved data accuracy
- User-friendly Google Sheets interface
- Job age tracking

Navigating the labyrinth of job opportunities in Indonesia can be as challenging as mastering Explosion magic. id-jobs simplifies this quest by consolidating information from multiple realms (websites) into one central grimoire (spreadsheet), providing additional insights such as work arrangements, job levels, and application deadlines that even Megumin would approve of!
## 🏁 Getting Started

## 📚 Tomes of Knowledge (Data Sources)
For a quick guide, see our [Quickstart Guide](QUICKSTART.md).

We gather our arcane knowledge from a wide range of sources, each represented by a powerful spider in our magical arsenal:
## ❓ FAQ

- Blibli 🛒
- Dealls 🤝
- Evermos 🌟
- Flip 💳
- GoTo 🚗
- Jobstreet 💼
- Kalibrr 🎓
- Karir.com 🌐
- Kredivo 💰 (New!)
- SoftwareOne 💻
- Tiket ✈️
- Various company career portals 🏢
Check our [FAQ](FAQ.md) for common questions.

Each of these sources is a realm of opportunity, waiting to be explored by our job-seeking wizards. Our spiders weave through these portals, extracting valuable job data with the precision and power of a well-cast Explosion spell!
## 📄 License

🔮 Note: Our collection of magical spiders is ever-growing, as we continuously enhance our ability to scry the Indonesian job market. Keep an eye out for new additions to our arcane arsenal!
id-jobs is open source under the GPL-3.0 license. You can use, modify, and share the code, as long as you keep it open source.

## ✨ Magical Features

- **Daily Explosions of Updates**: Automated daily updates through CI/CD pipelines that would make any archmage jealous.
- **Work Arrangement Scrying**: Identify Remote, Hybrid, and On-site opportunities with crystal-clear clarity.
- **Job Level Divination**: Automatically determine job levels from titles, providing deeper insights into career opportunities.
- **Application Deadline Divination**: Calculated end dates for job applications, because timing is everything in both magic and job hunting.
- **Optimized Data Collection Rituals**: Improved accuracy and coverage of job listings, leaving no stone unturned.
- **User-Friendly Spell Interface**: Access job data through a Google Sheets interface so intuitive, even a novice wizard could use it.
- **Comprehensive Information Gathering**: Data from multiple job boards and company websites, all in one place.
- **Job Age Tracking**: Identify the freshest job listings with the precision of a finely tuned magical sensor.

## 🧙‍♀️ Getting Started on Your Magical Journey

For a quick guide on how to harness the power of id-jobs, consult our [Quickstart Grimoire](QUICKSTART.md).

## 🔍 Frequently Asked Arcane Questions

Have questions about our magical processes? Check out our [FAQ Scroll](FAQ.md) for answers to common queries from fellow wizards and job seekers.

## 📜 Legal Incantations

id-jobs is open source under the GPL-3.0 license. You're free to use, modify, and share the code, as long as you keep it open source too. Think of it as sharing the secrets of Explosion magic with the world!

We always respect website terms of service when collecting data, because even the most powerful wizards need to follow the rules of the realms they visit.

Now go forth and explode your job search with the power of id-jobs! 💥🎆
We respect website terms of service when collecting data.
71 changes: 71 additions & 0 deletions freya/spiders/koltiva.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import scrapy
import logging
import json
from datetime import datetime
from typing import Dict, Any, Optional
from freya.pipelines import calculate_job_age
from freya.utils import calculate_job_apply_end_date

logger = logging.getLogger(__name__)

class KoltivaSpider(scrapy.Spider):
name = 'koltiva'
BASE_URL = 'https://career.koltiva.com'
API_URL = 'https://erp-api.koltitrace.com/api/v1/jobs?limit=100'

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

def start_requests(self):
headers = {
'accept': 'application/json, text/plain, */*',
'authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJlbWFpbCI6InBzYW5qYXlhLndvcmtAZ21haWwuY29tIiwiaWF0IjoxNzI2MTI5NTIwfQ.4NDbf50RCpgcpQ8tz2oPBULtom0o-A5JgJOjDOHXtIY',
'origin': 'https://career.koltiva.com',
'user-agent': 'Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.188 Safari/537.36 CrKey/1.54.250320'
}
yield scrapy.Request(self.API_URL, headers=headers, callback=self.parse)

def parse(self, response):
data = json.loads(response.text)
for job in data['data']['data']:
yield self.parse_job(job)

def parse_job(self, job_data: Dict[str, Any]) -> Dict[str, Any]:
first_seen = self.timestamp
last_seen = self.timestamp

return {
'job_title': self.sanitize_string(job_data['position_name'], is_title=True),
'job_location': f"{self.sanitize_string(job_data['unit_name'])} - {self.sanitize_string(job_data['country_name'])}",
'job_department': self.sanitize_string(job_data['unitsec_name']),
'job_url': f"{self.BASE_URL}/list-job/{job_data['slug']}",
'first_seen': first_seen,
'base_salary': 'N/A',
'job_type': self.sanitize_string(job_data['work_period_name'], is_job_type=True),
'job_level': self.sanitize_string(job_data['level_name']),
'job_apply_end_date': job_data['close_date'].split('T')[0],
'last_seen': last_seen,
'is_active': 'True',
'company': 'Koltiva',
'company_url': self.BASE_URL,
'job_board': 'Koltiva Careers',
'job_board_url': self.BASE_URL,
'job_age': calculate_job_age(first_seen, last_seen),
'work_arrangement': self.get_work_arrangement(job_data['jobs_benefits_perks']),
}

def get_work_arrangement(self, benefits: str) -> str:
return 'Remote' if 'Work-from-home' in benefits else 'On-site'

@staticmethod
def sanitize_string(s: Optional[str], is_title: bool = False, is_job_type: bool = False) -> str:
if s is None:
return 'N/A'
s = s.strip()
s = s.replace(',', ' -') # Replace commas with hyphens for CSV compatibility
if is_title:
s = s.title()
elif is_job_type:
s = s.replace('Contract', '').strip()
return ' '.join(s.split()) or 'N/A'
Loading

0 comments on commit 6217901

Please sign in to comment.