Bar chart showing cumulative ABAP understanding success rates by model and feedback round.

SAP’s ABAP-1 Loses Every ABAP Benchmark, Even “Explaining”

Previous post (code generation benchmark): Benchmarking LLMs for ABAP Live benchmark results (old + new): abap-llm-benchmark.marianzeis.de In my first evaluation (based on the TH Köln benchmark paper), I extended the original setup with additional models and focused on a very concrete question: how well can LLMs generate ABAP code that actually compiles and passes ABAP Unit tests? I also tested SAP’s model ABAP-1, and it performed very poorly for code generation. To be fair: SAP also states this in the documentation. ABAP-1 is primarily meant for explaining ABAP code not for reliably generating full working implementations. ...

March 3, 2026 · Marian Zeis
Bar chart showing cumulative ABAP code generation success rates by model and feedback round.

Benchmarking LLMs for ABAP: Why ABAP-1 Isn't a Code Generator (Yet)

Live benchmark results: abap-llm-benchmark.marianzeis.de In a lot of SAP webcasts and webinars, especially around AI, the question comes up very early: which model are you using, and which one do you recommend? For CAP and UI5 the answer is usually pretty simple: use the current best model from Anthropic. If you add good context via MCP servers from the community or SAP, you are basically fine. There is just a lot of public knowledge available, and most of it is in JavaScript/TypeScript, which LLMs handle extremely well. ...

February 9, 2026 · Marian Zeis