Real-Time Multimodal Emotion Detection

Objective: Develop a multimodal emotion recognition system that integrates audio, video, and text modalities using advanced deep learning models, cross-modal fusion, and meta-learning techniques (MAML/Reptile). Responsibilities: Implement feature extraction pipelines using pre-trained models: Visual → Vision Transformer (ViT) for facial features Audio → Wav2Vec 2.0 for speech features Text → BERT for contextual embeddings Design and implement a Cross-Modal Transformer with cross-attention for fusion of modalities. Integrate a meta-learning framework (MAML/Reptile) for few-shot adaptation. Preprocess datasets (IEMOCAP, CMU-MOSEI, MELD, etc.) and handle data imbalance. Optimize model for real-time and efficient processing (lightweight models, pruning, frame selection). Evaluate performance using metrics like Accuracy, F1-score, and Confusion Matrix. Document the implementation with clear explanation of methodology, results, and novelty. Required Skills: Strong knowledge of Deep Learning (PyTorch, TensorFlow preferred) Experience with Transformer architectures (ViT, BERT, Wav2Vec 2.0) Familiarity with meta-learning algorithms (MAML, Reptile) Hands-on experience with multimodal data preprocessing (OpenCV, Librosa, Mediapipe, dlib) Knowledge of speech/text/video emotion datasets (IEMOCAP, CMU-MOSEI, MELD) Ability to deploy/test on Google Colab (GPU-based training) Good documentation and reporting skills

Python

Реєстрація