Description I am building a research-driven job market analytics system (US market). This is not just a development project — it requires designing, executing, and evaluating data science experiments. The goal is to analyze job data and produce validated insights, not just build a working pipeline. Scope of Work The project includes: • Data processing of large job datasets (10k–50k+ rows) • Extracting technical skills (Python, SQL, AWS, etc.) from job descriptions • Salary analysis and estimation (validated using official BLS data) • Job demand trend analysis and forecasting • Building an interactive dashboard (Streamlit preferred) • Designing and executing research experiments with clear evaluation metrics Research Requirements (Core of Project) You must treat this as a research problem, not just implementation. 1. Skill Extraction Reliability • Extract skills from job descriptions using multiple approaches, such as: • Rule-based (regex / keyword matching) • NLP-based (spaCy / NER) • Embedding-based (semantic similarity) • Compare methods systematically • Evaluate performance using: • Precision • Recall • F1-score • Provide: • Clear comparison of methods • Error analysis (false positives / false negatives) • Justification of best approach 2. Salary Estimation (with Validation) • Build models to predict salary based on job attributes and extracted skills • Use features such as: • Skills • Job title • Experience level • Location (if available) • Validate predictions using BLS datasets (e.g., OEWS) • Analyze: • Model performance (MAE / RMSE) • Bias and uncertainty • Differences between predicted vs official salary ranges • Provide reasoning for model choice and limitations 3. Job Demand Forecasting • Analyze trends in job demand over time • Build forecasting models (e.g., ARIMA, Prophet, or ML-based models) • Compare at least two approaches • Evaluate using: • Forecast error metrics (MAPE / RMSE) • Stability of predictions • Provide insights such as: • Which skills are growing/declining • Reliability of forecasts Dataset Requirements You may use or source datasets, but they must: • Include job descriptions • Contain modern technical skills (Python, SQL, AWS, etc.) • Have at least 10,000+ rows • Include timestamps (for forecasting) • Include salary data OR be compatible with BLS mapping You are responsible for: • Data cleaning • Preprocessing • Documentation of dataset choices Deliverables • Clean and processed dataset • Skill extraction pipeline (with comparison results) • Salary prediction models + validation analysis • Demand forecasting models + evaluation • Streamlit dashboard for visualization • Research report including: • Methodology • Experiments • Evaluation metrics • Results and insights • Limitations Requirements • Strong Python (pandas, numpy) • NLP experience (regex, spaCy, embeddings) • Machine learning experience • Understanding of evaluation metrics (precision, recall, F1, RMSE, etc.) • Experience with research-style work (preferred) • GitHub usage with regular commits • Must speak Hindi or Gujarati • Must attend 2 meetings per week (30–45 min) Timeline 2–3 weeks (tight deadline, structured milestones required) Important Notes • This is a research-focused academic-style project • Work must include clear explanations and reasoning, not just code • No reuse or redistribution of this work • You must explain all decisions during meetings Please Share • Relevant past projects (especially NLP or research work) • Your approach to: • Skill extraction (multiple methods) • Model evaluation • Estimated timeline with milestones Budget USD $50 – $80