I’m building a vision tool that watches two single-row industrial rack bays and instantly tells us exactly which slot an operator uses when placing a package. The camera is fixed, so every frame shares the same perspective. Your task is to detect the package as soon as it appears and map that event to the correct shelf index with minimal latency. Key points you should know • Incoming video arrives as MP4 at 1080p. • Packages are the only class we care about. • Any recent one-stage detector is fine—feel free to start from YOLOv5, v7, or another performant model—as long as the finished pipeline stays below ~100 ms per frame on an RTX-class GPU. • I’ll supply sample footage, precise rack dimensions, and a simple ID scheme for every slot. Deliverables 1. Well-commented Python code for training and inference. 2. Trained weights ready for deployment. 3. A concise README detailing environment setup, training steps, and how to run real-time inference. 4. Demo proof—either a short clip or a live session—showing the system correctly labeling each new placement with the right rack coordinate. Acceptance The project is complete when the demo consistently outputs the correct rack index in real time on my provided test set while meeting the latency target. Let me know any questions and your estimated timeline to train and fine-tune the model.