🏆 Leaderboard — Harness Engineering

100| 104|

106|

107|

108|

★

109|

110|

Top Repos por Harness Score

111|

112| leaderboard 113| benchmark 114| open-source 115|

116|

117|

118|

¿Qué tan bien preparados están los proyectos open-source más populares para ser operados por agentes de IA? Usamos Harness Engineering Scanner v2 para medir 6 subsistemas en repos top-starred de GitHub.

120| 121| 122| 123| 124| 125| 126| 127| 128| 129| 130| 131| 133| 134| 135| 136| 137| 139| 140| 142| 143| 144| 145| 146| 148| 149| 151| 152| 153| 154| 155| 157| 158| 160| 161| 162| 163| 164| 166| 167| 168| 169| 170| 171| 172| 173| 174| 176| 177| 178| 179| 180| 182| 183| 185| 186| 187| 188| 189| 191| 192| 194| 195| 196| 197| 198| 200| 201| 203| 204| 205| 206| 207| 209| 210| 212| 213| 214| 215| 216| 218| 219| 221| 222| 223| 224| 225| 227| 228| 230| 231| 232| 233| 234| 236| 237| 239| 240| 241| 242| 243| 245| 246| 247| 248| 249| 250| 251| 252| 253| 255| 256| 257| 258| 259| 261| 262| 264| 265| 266| 267| 268| 270| 271| 273| 274| 275| 276| 277| 279| 280| 282| 283| 284| 285| 286| 288| 289| 291| 292| 293| 294| 295| 297| 298| 300| 301| 302| 303| 304| 306| 307| 309| 310| 311| 312| 313| 315| 316| 318| 319| 320| 321| 322| 324| 325| 327| 328| 329| 330| 331| 333| 334| 335| 336| 337| 338| 339| 340| 341| 343| 344| 345| 346| 347| 349| 350| 352| 353| 354| 355| 356| 358| 359| 361| 362| 363| 364| 365| 367| 368| 370| 371| 372| 373| 374| 376| 377| 379| 380| 381| 382| 383| 385| 386| 388| 389| 390| 391| 392| 394| 395| 397| 398| 399| 400| 401| 403| 404| 406| 407| 408| 409| 410| 412| 413| 415| 416| 417| 418| 419| 421| 422| 424| 425| 426| 427| 428| 430| 431| 433| 434| 435| 436| 437| 439| 440| 442| 443| 444| 445| 446| 448| 449| 451| 452| 453| 454| 455| 457| 458| 460| 461| 462| 463| 464| 466| 467| 469| 470| 471| 472| 473| 475| 476| 478| 479| 480| 481| 482| 484| 485| 487| 488| 489| 490| 491| 493| 494| 496| 497| 498| 499| 500| 502| 503| 505| 506| 507| 508| 509| 511| 512| 514| 515| 516| 517| 518| 520| 521| 523| 524| 525| 526| 527| 529| 530| 532| 533| 534| 535| 536| 538| 539| 541| 542| 543| 544| 545| 547| 548| 550| 551| 552| 553| 554| 556| 557| 559| 560| 561| 562| 563| 565| 566| 568| 569| 570| 571| 572| 574| 575| 576|

#	Proyecto	★ Estrellas	Stack	Score
1	harness-course iberi22/harness-course	—	HTML/CSS/JS	🟢 100.0%
2	synapse-trading iberi22/synapse-trading	—	Rust	🟢 100.0%
3	swal-skills iberi22/swal-skills	—	Skills	🔵 72.5%
4	agents-flows-recipes iberi22/agents-flows-recipes	—	POML	🔵 65.1%
5	skyvern Skyvern-AI/skyvern	21K ★	Python	🟡 52.3%
6	agent-recipes-repo iberi22/agent-recipes-repo	—	Recipes	🟡 48.6%
7	zeroclaw zeroclaw-labs/zeroclaw	⭐ 31K	Rust	🟡 45.0%
8	context-mode mksglu/context-mode	⭐ 14K	TypeScript	🟡 44.0%
9	browser-use browser-use/browser-use	⭐ 93K	Python	🟡 43.1%
10	graphiti getzep/graphiti	⭐ 25K	Python	🟡 42.2%
11	OpenHands OpenHands/OpenHands	⭐ 73K	Python	🟡 41.3%
12	llama.cpp ggerganov/llama.cpp	⭐ 100K	C/C++	🟡 41.3%
13	DeepTutor HKUDS/DeepTutor	⭐ 23K	Python	🟡 40.4%
14	hyperframes heygen-com/hyperframes	16K ★	TypeScript	🟡 40.4%
15	camel camel-ai/camel	⭐ 16K	Python	🟠 39.4%
16	claude-mem thedotmack/claude-mem	⭐ 74K	TypeScript	🟠 37.6%
17	freqtrade freqtrade/freqtrade	⭐ 50K	Python	🟠 37.6%
18	openfang RightNow-AI/openfang	⭐ 18K	Rust	🟠 37.6%
19	hummingbot hummingbot/hummingbot	⭐ 18K	Python	🟠 36.7%
20	goose aaif-goose/goose	⭐ 50K	Rust	🟠 35.8%
21	open-swe langchain-ai/open-swe	⭐ 10K	Python	🟠 35.8%
22	awesome-copilot github/awesome-copilot	⭐ 32K	Python	🟠 34.9%
23	gitnexus abhigyanpatwari/GitNexus	⭐ 44K	TypeScript	🟠 34.9%
24	cognee topoteretes/cognee	17K ★	Python	🟠 33.9%
25	agenticSeek Fosowl/agenticSeek	⭐ 26K	Python	🟠 30.3%
26	aider Aider-AI/aider	⭐ 44K	Python	🟠 30.3%
27	learn-harness-engineering walkinglabs/learn-harness-engineering	—	TypeScript	🟠 29.4%
28	anomalyco/opencode anomalyco/opencode	⭐ 157K	TypeScript	🟠 27.5%
29	mattpocock-skills mattpocock/skills	⭐ 68K	Shell	🟠 27.5%
30	AI-Trader HKUDS/AI-Trader	⭐ 20K	Python	🟠 26.6%
31	kilocode Kilo-Org/kilocode	⭐ 19K	TypeScript	🟠 26.6%
32	pi earendil-works/pi	⭐ 47K	TypeScript	🟠 26.6%
33	skills iberi22/skills	—	Skills	🟠 25.7%
34	page-agent alibaba/page-agent	⭐ 25K	TypeScript	🟠 24.8%
35	langgraph langchain-ai/langgraph	⭐ 31K	Python	🟠 23.9%
36	crush charmbracelet/crush	⭐ 24K	Go	🟠 22.0%
37	12-factor-agents humanlayer/12-factor-agents	⭐ 19K	TypeScript	🟠 20.2%
38	nanobrowser nanobrowser/nanobrowser	⭐ 12K	TypeScript	🟠 20.2%
39	local-deep-researcher langchain-ai/local-deep-researcher	⭐ 9K	Python	🔴 16.5%
40	gemini-fullstack-langgraph-quickstart google-gemini/gemini-fullstack-langgraph-quickstart	⭐ 18K	Jupyter	🔴 15.6%
41	deep-research dzhng/deep-research	⭐ 18K	TypeScript	🔴 14.7%
42	awesome-harness-engineering walkinglabs/awesome-harness-engineering	—	Markdown	🔴 11.9%
43	opencode opencode-ai/opencode	⭐ 12K	Go	🔴 11.9%
44	UI-TARS bytedance/UI-TARS	⭐ 11K	Python	🔴 10.1%
45	deepresearch Alibaba-NLP/DeepResearch	⭐ 18K	Python	🔴 10.1%
46	genai-agents NirDiamant/GenAI_Agents	⭐ 21K	Python/Jupyter	🔴 10.1%
47	awesome-design-md VoltAgent/awesome-design-md	⭐ 74K	DESIGN	🔴 9.2%
48	imported-skills iberi22/imported-skills	—	Skills	🔴 4.6%
49	harness-course-site iberi22/harness-course-site	—	HTML/CSS/JS	🔴 1.8%
50	local-models iberi22/local-models	—	Python/CSV	🔴 0.0%

577|

579|

581|

📊 Comparativa por Subsistema

582|

Los 6 subsistemas evaluados muestran un patrón claro: Verification es el punto fuerte de casi todos los proyectos, mientras que Skills y State son los más descuidados.

584|

585|

586|

📋 Instructions

587|

∅ 39%

588|

Promedio de 50 repos. Skyvern (55.0%, AGENTS.md + docs/), hyperframes (55.0%, AGENTS.md + docs/), cognee (55.0%, AGENTS.md + docs/) añadidos — todos con AGENTS.md básico. Instructions sube de 38% a 39.0%.

589|

590|

591|

💾 State

592|

∅ 19.6%

593|

Promedio de 50 repos. Nuevos: Skyvern (15.0%), hyperframes (10.0%), cognee (15.0%). State sigue siendo un gap universal — ningún repo nuevo tiene TASK.md ni memoria persistente. State baja de 20% a 19.6%.

594|

595|

596|

✅ Verification

597|

∅ 46%

598|

Promedio de 50 repos. Skyvern (100.0%, 768 tests + CI/CD) refuerza el promedio. hyperframes (33.3%, 566 tests sin test dir detectado) y cognee (50.0%, 357 tests) moderan. Verification sube de 45% a 46.0%.

599|

600|

601|

🎯 Scope

602|

∅ 23%

603|

Promedio de 50 repos. Nuevos: Skyvern (12.5%), hyperframes (31.2%), cognee (25.0%). Scope sigue siendo el segundo gap más grande — ninguno tiene DoD ni milestones formales. Scope se mantiene en 23.0%.

604|

605|

606|

🔄 Lifecycle

607|

∅ 44.6%

608|

Promedio de 50 repos. Nuevos: Skyvern (65.0%, Docker + deps), hyperframes (45.0%, sin Docker), cognee (50.0%, Docker + deps). Lifecycle sube de 44% a 44.6%.

609|

610|

611|

🧠 Skills

612|

∅ 20.7%

613|

Promedio de 50 repos. Hyperframes (73.3%, 20 skills con frontmatter válido) es el destacado. Skyvern (66.7%) con skill propio. Cognee (0.0%) sin skills/. Skills sube de 19% a 20.7% — pero 39/50 repos tienen 0%.

614|

615|

617|

619|

🔍 Conclusiones

621|

622|

623|

624|

625|

Proyectos de iberi22 dominan el top

626|

627|

628|

Los 8 proyectos de iberi22 dominan el top del leaderboard. harness-course (100%) y synapse-trading (100.0%) demuestran que la metodología Harness Engineering produce repositorios significativamente mejor preparados para agentes de IA que los proyectos open-source más populares. DeepTutor (40.4%, 23K⭐, Python) es el nuevo ingreso más fuerte, con Verification 83.3% (293 tests + CI/CD). anomalyco/opencode (27.5%, 157K⭐, TypeScript) es el repo con más estrellas del leaderboard — 671 tests pero Skills 0%. crush (22.0%, 24K⭐, Go) se une al grupo de agentic coding tools sin Skills. kilocode (26.6%, 19K⭐, TypeScript) es un monorepo con 1056 tests pero sin directorio de tests raíz. DeepResearch (10.1%, 18K⭐, Python) y 12-factor-agents (20.2%, 19K⭐, TypeScript) se suman como proyectos educativos/de agentes sin harness. El leaderboard ahora abarca 50 repos y un espectro más amplio de stacks: TypeScript (page-agent), Rust (openfang), Jupyter (gemini-fullstack), y más. openfang (37.6%, 18K⭐ Rust) tiene Verification 55.6% y Lifecycle 60%. page-agent (24.8%, 25K⭐ TypeScript) se destaca con Instructions 55%. gemini-fullstack-langgraph-quickstart (15.6%, 18K⭐ Jupyter) tiene Lifecycle 60% pero Verification 0%.

629|

631|

632|

633|

634|

635|

Gap: State, Skills y la cola del leaderboard

636|

637|

638|

El promedio de State (20%) y Skills (20%) sigue siendo el talón de Aquiles. Los 3 nuevos — kilocode (10%), DeepResearch (15%) y 12-factor-agents (10%) — todos tienen State 15% o menos. kilocode (26.6%, 19K⭐) tiene Instructions 40% (AGENTS.md completo) pero Skills 0%. 12-factor-agents (20.2%) tiene CLAUDE.md con personas pero Skills 0% y Scope 0%. Esto confirma que Skills sigue siendo el gap más universal — ahora 52/50 repos tienen Skills 0%.

639|

641|

642|

643|

644|

645|

Escalabilidad del Scanner

646|

647|

648|

El evaluador de Harness funciona en cualquier repo — sin importar el lenguaje o stack. Esto significa que podemos escalar este leaderboard a 100+ repos y crear un ranking público donde cualquier proyecto pueda medir su "agent-readiness". Con 50 repos escaneados — incluyendo kilocode (TypeScript, 19K⭐), DeepResearch (Python, 18K⭐), 12-factor-agents (TypeScript, 19K⭐), anomalyco/opencode (TypeScript, 157K⭐), DeepTutor (Python, 23K⭐), crush (Go, 24K⭐), goose (Rust, 50K⭐) y gitnexus (TypeScript, 44K⭐) — el leaderboard cubre un espectro amplio de stacks: TypeScript, Python, Rust, Go, Jupyter, C/C++, Shell y más. El scanner funciona consistentemente en todos ellos sin importar el stack.

649|

651|

653|