1. Home
  2. Jobs
  3. Freelance Agent Evaluation Engineer
FullStack Onsite

Freelance Agent Evaluation Engineer

Mindrift
Morocco - Morocco Listed 4 days ago 5+ years via Naukrigulf
python fastapi redis react typescript javascript docker kafka

Job Description Roles & Responsibilities We're building a dataset to evaluate AI coding agents - how well a model handles real-world developer tasks. You'll create challenging tasks and evaluation criteria within realistic simulated environments: Build realistic developer environments - a virtual company with codebase, infrastructure, and context (tickets, docs, conversations) that forms a believable development history Design tasks from intermediate states of these environments - craft the prompt, define what "solved" means, and ensure the task is solvable by an AI agent Write tests that verify agent solutions - accept all valid approaches and reject incorrect ones, neither too strict nor too lenient Iterate on tasks and tests based on QA feedback - review agent solutions, analyze failures, and refine until the evaluation is fair and robust What this is NOT Not data labeling Not prompt engineering Not writing code from scratch - the agent writes most of the code; you guide and evaluate What we look for 5+ years in software development Core stack: Python (FastAPI), JavaScript/TypeScript (React), Docker, Postgres, Kafka, Redis Experience writing tests (functional, integration) English proficiency - B2+ Why this is hard Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution. Tasks have many valid solutions - writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds. How it works Apply Pass qualification(s) Join a project Complete tasks Get paid Effort estimate Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work. Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted. Compensation Up to $50/hr equivalent , depending on level and pace. Tasks are estimated at ~20 hours each; you set your own schedule. Desired Candidate Profile 5+ years in software development Core stack: Python (FastAPI), JavaScript/TypeScript (React), Docker, Postgres, Kafka, Redis Experience writing tests (functional, integration) English proficiency - B2+ Company Industry InternetE-commerceDotcom Department / Functional Area Engineering Keywords Freelance Agent Evaluation Engineer Get real-time job updates only on our App

Ready to apply?

You are viewing this role on JobSphere AI. Applications are completed on the original employer / source website.

Apply on original site

Opens the employer's site in a new tab

  • CompanyMindrift
  • LocationMorocco - Morocco
  • CategoryFullStack
  • SourceNaukrigulf
  • Listed4 days ago

Related FullStack jobs

More FullStack