Even though it’s raw, this quant assistant agent I built can be compared to current agents like Claude (Anthropic), Cursor, and Windsurf.
High-level summary
- Your agent is a focused, developer-controlled ReAct loop with MCP tooling and a local code-exec sandbox. It’s closer to a programmable research/analysis assistant than a full IDE agent.
- Compared to Claude/Cursor/Windsurf, you’ve got good scaffolding (ReAct, MCP, codegen, caching), but you’re missing the polish and safety layers they’ve invested in: guardrails, non-blocking UX, tool orchestration, retry/circuit-breaking, and high-quality code generation prompts.
What you do well
- ReAct scaffolding:
- Think → Act → Observe loop in ReActFinancialAgent.react_analyze(), with structured steps and a final synthesis via _synthesize_final_answer().
- User-in-the-loop correction via _get_user_correction() and _revise_thought_with_correction().
- MCP integration:
- Tools discovery and invocation through ReActMCPClient.initialize() and _execute_tool_call().
- Central result processing in _process_tool_result() including universe export and PA/NSS special cases.
- Code-generation + safe execution:
- CodeExecutionSandbox restricts builtins/imports, captures output, summarizes
pandas/numpyresults; solid foundation for ad-hoc analytics.
- CodeExecutionSandbox restricts builtins/imports, captures output, summarizes
- JSON ingestion UX:
- Robust partial-JSON handling with buffer + extraction in handle_json_input() and _parse_json_safely().
- Clear design:
- Separation between client (ReActMCPClient) and agent (ReActFinancialAgent), cache + context preparation in _prepare_data_context().
Where market leaders are ahead
- Robustness and UX:
- Claude/Cursor/Windsurf avoid blocking input inside async event loops. Your
input()in interactive_session() and _get_user_correction() will hang in non-TTY contexts. - Timeouts/circuit-breaking: leaders have standardized retry/backoff, rate-limit handling, and telemetry. Your agent lacks circuit-breakers, request dedupe, or metrics.
- Claude/Cursor/Windsurf avoid blocking input inside async event loops. Your
- Code generation quality:
- They use highly optimized prompts, contextual embeddings, inline tool-aware few-shots, and automatic test scaffolding. Your code prompt in _generate_and_execute_code() is minimal and not task-calibrated.
- Tool orchestration:
- They chain tools (plan → fan-out → merge → verify). Your _execute_tool_call() is one-shot, with no validation beyond MCP schema or result semantics; no planner/refiner.
- Security:
- Leaders enforce runtime sandboxes or isolated kernels (e.g., pyodide/docker/kernels). Your CodeExecutionSandbox is good but still local-process
exec(), regex guardrails can be bypassed, andexecution_timeoutisn’t enforced.
- Leaders enforce runtime sandboxes or isolated kernels (e.g., pyodide/docker/kernels). Your CodeExecutionSandbox is good but still local-process
- Memory and scale:
- They manage long-running sessions, embeddings/RAG, and project-wide memory. Your
data_cacheis unbounded and lacks eviction; no vector memory for prior runs, no persistent summaries.
- They manage long-running sessions, embeddings/RAG, and project-wide memory. Your
- IDE-native affordances:
- Cursor/Windsurf integrate directly with the filesystem, diffs, tests, and run/debug loops. Your CLI UX is basic; no patch/diff authoring or test orchestration.
How you compare to each
- Claude AI (Sonnet/Haiku):
- Pros: Claude excels in coherent multi-step reasoning, self-critique, and non-destructive code suggestions. Your ReAct loop mimics this but lacks critique/verification passes and high-quality synthesis prompts.
- Gap: Safety guardrails, tool-use reliability, and code generation depth. Claude also has large context windows and better instruction-following out of the box.
- Cursor:
- Pros: Tight IDE workflow: inline edits, “Fix” flows, agentic refactors, repo-wide context. Your agent doesn’t integrate with an editor or diff flow.
- Gap: Test-aware changes, multi-file planning, agent retries with context memory, toolchains for “run tests → analyze → patch.”
- Windsurf:
- Pros: Multi-step task planning, MCP-native ecosystem, robust tool calling with typed schemas, async orchestration, and patch application. You’re already on MCP but missing planner, typed validation, and non-blocking I/O.
- Gap: Advanced task planning, telemetry, and guardrails around tool effects.
Concrete actions to close the gap
- Fix runtime and UX (high priority):
- Replace blocking
input()with async input. In ReActMCPClient.interactive_session() and ReActFinancialAgent._get_user_correction(), useaioconsole.ainput()to prevent event-loop stalls. - Enforce timeouts. Wrap LLM/tool calls with
asyncio.wait_for(). Add a real timeout around CodeExecutionSandbox.execute_code() viaasyncio.to_thread()+wait_for.
- Replace blocking
- Strengthen security and reliability:
- CodeExecutionSandbox:
- Enforce import allowlist strictly for
ast.ImportFrom(don’t allow prefix matches). - Add execution budget: max variable size (e.g., cap DataFrame rows kept), output size already capped; add memory usage checks.
- Enforce import allowlist strictly for
- Add circuit breaker and backoff around _execute_tool_call() and LLM calls in _think(), _generate_and_execute_code(), _synthesize_final_answer().
- CodeExecutionSandbox:
- Improve code generation quality:
- Expand your code prompt with:
- A short inventory of relevant cached variables from _prepare_data_context() with sample shapes.
- Constraints to print a result summary in a standard section, e.g., “RESULTS: {…}”.
- A verification step: ask the LLM to print assumptions and data coverage statistics.
- Add a self-check pass: after execution, generate a “critique” call asking the LLM to validate outputs before final synthesis.
- Expand your code prompt with:
- Add a planner and tool-chaining:
- Introduce a planning step that outlines multiple tool calls before acting.
- Implement simple branching: if
pa_list_componentsreturns IDs → pick one → callpa_analyze_portfolio→ then synthesize. - Validate tool outputs semantically (e.g., expected fields) before proceeding.
- Memory and telemetry:
- Bounded cache with LRU eviction for
data_cache. - Lightweight metrics: timing per step, success/failure counters, last error cause, total tokens by phase.
- Optional persistent summaries (JSON lines) for postmortems.
- Bounded cache with LRU eviction for
- Developer ergonomics:
- Split file by responsibility:
core/sandbox.py,core/agent.py,client/mcp_client.py,utils/json_parse.py,utils/metrics.py.
- Add config dataclass for iteration limits, timeouts, cache sizes, and model settings.
- Add a “simple non-interactive mode” flag to disable any prompts for batch runs.
- Split file by responsibility:
Where your agent can shine vs. market tools
- Finance-specific flows: You’ve already wired PA/NSS/universe workflows and special result handling in _process_tool_result(). Claude/Cursor/Windsurf are generalists; your domain affordances are an advantage.
- MCP extensibility: You can quickly add domain tools and encode guardrails at the tool layer.
- Transparent reasoning: Your detailed reasoning chain and JSON data ingestion are useful for explainable analysis.
Bottom line
- As-is, this agent is a solid domain-specific ReAct shell with MCP tooling and a simple sandbox. With a few targeted fixes (async I/O, enforced timeouts, better sandbox guards) and one round of orchestration upgrades (planner, retries, validation), you can approach the reliability/usability of Cursor/Windsurf agent flows for finance tasks.
- If you want parity with Claude’s reasoning polish, improve prompts, add verification/self-critique steps, and expand context construction with summaries and cached state.