NAURA Innovation LAB · Data Team

LLM Selection
— Data Team

Evaluating the best AI providers for our daily operations

🔍

Deep-Dive: Ranking & Analysis

Three evaluation pillars across two rounds of providers.

⚙️

Software Engineering & Architecture

Assessing code logic, refactoring capability, and agentic workflows.

📊

Data Analysis & Strategy

Evaluating complex reasoning, data synthesis, and strategic planning.

📋

Presentations & Deliverables

Measuring quality of output for slides, documents, and client-facing work.

Round 1 — US Providers

⚙️

Software Engineering & Architecture

✦ Pros

Leader in "System 2" thinking — highly reliable for architectural planning and identifying edge cases in distributed systems.

✦ Cons

Code style more rigid compared to Claude. Smaller context window limits full-repo analysis.

✦ Pros

Gold standard for code — 80.8% SWE-bench Verified. MCP is the industry standard for agent integrations. Excels at complex refactors and long-horizon tasks.

✦ Cons

Narrower context window vs Gemini. Less native tooling ecosystem than Google/OpenAI.

✦ Pros

The "Context King" — 2M+ token window ingests entire repos. Unparalleled for cross-file dependency analysis and legacy code migration.

✦ Cons

Code quality trails Claude on complex refactors. Agent/tooling ecosystem less mature than Claude's MCP.

Round 1 — US Providers

📊

Data Analysis & Strategy

✦ Pros

Superior for multimodal data — analyzes 10-hour technical recordings or massive PDF datasets natively, extracting insights text-only models miss.

✦ Cons

Reasoning depth not on par with GPT-5.4 for abstract strategy and multi-step planning.

✦ Pros

Leads in strategic depth — multi-step "thought-loops" make it the best for technical roadmaps and abstract business strategy.

✦ Cons

No native multimodal ingestion at Gemini's scale. GPT-5.4 lock-in via API pricing.

✦ Pros

Focused on precision — preferred for legal and financial data where nuance and instruction-following are more critical than raw speed.

✦ Cons

Reasoning depth trails GPT-5.4 on abstract strategy. Less suited for multimodal data ingestion.

Round 1 — US Providers

📋

Presentations & Deliverables

✦ Pros

Known for the most "human" and professional prose. Artifacts allow real-time preview of dashboards and UI components — top pick for rapid prototyping.

✦ Cons

No native workspace suite integration. Artifacts great for prototyping but not a full presentation tool.

✦ Pros

Unbeatable integration — native connection to Google Workspace (Slides, Docs, Sheets) means it generates full presentation decks directly in your Drive.

✦ Cons

Weaker prose quality for polished deliverables. Code quality trails Claude on complex work.

✦ Pros

Canvas is the most robust collaborative environment for iterative editing of whitepapers and executive summaries.

✦ Cons

No native workspace suite integration like Gemini. Smaller context window limits large document ingestion.

Round 1 — US Providers

🏆

Score Matrix

Weighted evaluation across all categories.

Category	Weight
Coding & Logic	45%	10	9	7
Cross-Repo Context	25%	8	6	10
Team Collaboration (Drive/Slides/Doc)	20%	7	7	10
Data Analysis & Scripting	10%	7	10	8
TOTAL SCORE	100%	8.75	8.15	8.45

Round 2 — Chinese Providers

⚙️

Software Engineering & Architecture

✦ Pros

Leader in systems engineering — excels at understanding low-level hardware constraints and C++/Rust environments. Developed by Zhipu AI.

✦ Cons

Less proven on high-level application logic compared to MiniMax and DeepSeek. Smaller international community.

✦ Pros

Surprise contender — 80.2% SWE-bench rivals Claude Opus. Optimized for agentic tool use, highly effective for "self-healing" code pipelines.

✦ Cons

Single-agent logic can be outclassed by focused models. Less ecosystem maturity than US providers.

✦ Pros

Reasoning specialist — "Thinking" mode critiques plans and identifies logical fallacies in technical roadmaps.

✦ Cons

Not purpose-built for coding tasks. Slower response times due to complex reasoning chains.

Round 2 — Chinese Providers

📊

Data Analysis & Strategy

✦ Pros

Strongest at structured "B-side" (enterprise) business logic — used by Chinese corporations for internal knowledge graph synthesis.

✦ Cons

Less versatile for general-purpose data analysis. Weaker on abstract strategic reasoning.

✦ Pros

Reasoning specialist — "Thinking" mode designed for critiquing plans and spotting logical fallacies. Handles extremely long-context reasoning (1M tokens) with high stability.

✦ Cons

Optimized for deep analysis — not ideal for quick, lightweight queries. Premium pricing for enterprise tier.

✦ Pros

Highly favored for math-heavy data science and algorithmic optimization due to strong STEM benchmark performance.

✦ Cons

Strong on quantitative findings but weaker on strategic interpretation and business storytelling.

Round 2 — Chinese Providers

📋

Presentations & Deliverables

✦ Pros

Leads in multimodal "flair" — best native text-to-video and text-to-audio integration. Favorite for high-impact multimedia collateral.

✦ Cons

Creative focus can overshadow formal accuracy. Less suited for dry, data-heavy executive reports.

✦ Pros

Only provider that outputs polished .docx and .pptx files natively. High professional finish for enterprise deliverables.

✦ Cons

Output tone tends to be overly formal — less suited for agile or casual cultures.

✦ Pros

Known for "creative strategy" — output style is less dry than DeepSeek, suitable for project proposals and marketing collateral.

✦ Cons

Creative style can feel out of place in formal, conservative settings. Less structured than GLM for enterprise formats.

Round 2 — Chinese Providers

🏆

Score Matrix

Weighted evaluation across all categories.

Category	Weight
Software & Architecture	40%	9.0	8.5	10.0	8.5
Data & Strategy	30%	9.5	9.5	8.0	8.5
Deliverables & Collateral	15%	10.0	8.0	9.0	6.5
Cost & Latency (TPS)	15%	6.5	7.5	9.0	10.0
Total Score	100%	8.85	8.55	9.15	8.30