AI Agent langchain / langgraph for SaaS App Classification -- 4

Бюджет: 250 $

AI Agent TASK foo is a SaaS security company. A key part of our work involves detecting and classifying new applications, enabling organizations to gain visibility and control over the SaaS tools their employees use. Task Overview In this exercise, you will design and implement a lightweight AI Agent that takes the name of an application and: ● Identifies its official website ● Determines whether the application incorporates AI features (boolean output) The goal is to evaluate your ability to: ● Engineer effective prompts ● Design a practical evaluation methodology ● Compare model performance with and without access to an external tool (search) You may use any modern LLM framework of your choice (OpenAI GPT, Anthropic, Gemini, etc.), but the implementation must be in Python 3.11. MCP or RAG setups are not required. The agent should rely on a single prompt, tested in two modes: with and without tool access. We recommend using the frameworks listed above directly (rather than LangChain/LangGraph) to make full use of native LLM features. Requirements 1. AI Agent o Accepts an app name as input. o Returns: The app’s official website URL and a boolean indicating whether the app uses AI o Must be evaluated in two modes: Without access to a search engine and with access to a search engine (tool) 2. Prompt Design o Experiment with at least two different prompt versions. 3. Evaluation Process o Create a test set of at least 15 app names (including both well-known and lesser-known SaaS apps). o Define a clear evaluation metric (e.g., accuracy, precision, recall, or success rate). o Run your AI Agent on your test set in both modes and document the process of what improvements you are doing. o For each app, we accept multiple valid domains as correct answers. Example entries: o Microsoft Corporation → [“microsoft.com”] o Adobe → [“adobe.com”, “adobelogin.com”] o Zoom → [“zoomgov.com”, “zoom.us”, “zoom.com”] o Google → [“accounts.google.com”, “google.com”] o Dealcloud → [“intapp.com”, “dealcloud.com”] o Receptive IO [receptive.io] → [“receptive.io”, “pendo.io”] o Futuresimple | app [futuresimple.com] → [“zendesk.com”, “futuresimple.com”] 4. Comparison & Analysis o Summarize performance differences between the two modes. o Reflect on areas of success and failure. o Suggest potential improvements. Deliverables ● Python code implementing the AI Agent ● A CSV file comparing prompt versions on the evaluation set ● The evaluation dataset and methodology ● Results and analysis ● Key takeaways and suggested improvements Suggested Time Allocation ● Review task, design evaluation set, and select framework ● Implement AI Agent and prompt versions ● Run evaluations and collect results ● Analyze results ● Final review and polish deliverables What We’re Looking For ● A repo with the code and documentation of the solution process ● Clear reasoning and structured approach ● Effective use of LLMs and prompt engineering ● Sound evaluation methodologyh

Python

Реєстрація