Recently, I participated in the MLSys 2026 - NVIDIA Track: FlashInfer AI Kernel Generation Contest (FlashInfer Contest, 2026a). This post is not a tutorial on CUDA kernel optimization, and I am not a GPU operator development expert. My main goal was to use a highly verifiable task environment with clear feedback to study how coding agents can continuously produce high-quality GPU kernels in a closed-loop workflow.

The full materials are split into two reports: Harness Engineering for LLM-Driven GPU Kernel Generation (Shui et al., 2026) and Full-Agent Kernel Generation for FlashInfer (Ma et al., 2026). The code is available in mlsys26-flashinfer-contest.

Research Background

The difficulty of LLM-generated GPU kernels is not merely writing plausible CUDA or Triton code. A candidate implementation must be semantically correct, compile successfully, cover the target input shapes, and run faster than existing implementations on real GPUs.

KernelBench (Ouyang et al., 2025) builds an evaluation framework where an LLM reads a PyTorch reference implementation, writes a custom kernel, and is then evaluated for compilation, correctness, and runtime performance.

Fig. 1. KernelBench evaluation workflow. A model receives a PyTorch workload, writes a custom kernel implementation, and is evaluated by both functional correctness and latency. (Image source: Ouyang et al., 2025)

Fig. 1. KernelBench evaluation workflow. A model receives a PyTorch workload, writes a custom kernel implementation, and is evaluated by both functional correctness and latency. (Image source: Ouyang et al., 2025)

FlashInfer-Bench (Xing et al., 2026) places this problem in the real workload distributions of LLM inference serving, emphasizing the closed loop among execution traces, evaluation, candidate implementations, and deployment. It is not just a standalone microbenchmark; it requires candidate kernels to be evaluated under realistic inference distributions, a unified trace format, and correctness checks.

Fig. 2. FlashInfer-Bench architecture. FlashInfer Trace connects kernel definitions, serving workloads, candidate solutions, benchmark results, and deployment paths. (Image source: Xing et al., 2026)

Fig. 2. FlashInfer-Bench architecture. FlashInfer Trace connects kernel definitions, serving workloads, candidate solutions, benchmark results, and deployment paths. (Image source: Xing et al., 2026)

From a method-taxonomy perspective, this post follows the survey Towards Automated Kernel Generation in the Era of LLMs (Yu et al., 2026) and summarizes related work into two routes: LLM4Kernel and Agent4Kernel.

LLM4Kernel starts from high-quality domain data and uses training techniques such as continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL) to improve the model itself, making it better at understanding kernel-development contexts and generating high-quality kernel code.

Fig. 3. LLM4Kernel focuses on improving model-side kernel generation capability through data construction, supervised fine-tuning, reinforcement learning, and domain adaptation. (Image source: Yu et al., 2026)

Fig. 3. LLM4Kernel focuses on improving model-side kernel generation capability through data construction, supervised fine-tuning, reinforcement learning, and domain adaptation. (Image source: Yu et al., 2026)

This route can internalize kernel knowledge into model parameters, but it usually depends on high-quality training data, stable reward design, and substantial training cost.

Agent4Kernel instead emphasizes iterative search, external memory, multi-agent orchestration, and automated evaluation. My contest solution is closer to this route: I did not train a new model; instead, I designed a workflow where existing coding agents could try candidates, record feedback, and improve the experimental harness.

Fig. 4. Agent4Kernel emphasizes iterative refinement, evolutionary search, external memory, hardware profiling, and multi-agent orchestration for kernel optimization. (Image source: Yu et al., 2026)

Fig. 4. Agent4Kernel emphasizes iterative refinement, evolutionary search, external memory, hardware profiling, and multi-agent orchestration for kernel optimization. (Image source: Yu et al., 2026)

This direction also echoes industrial systems such as Meta’s KernelEvolve (Liao et al., 2025). KernelEvolve emphasizes persistent knowledge bases, retrieval-augmented prompt construction, cross-hardware programming abstractions, and continuous optimization for production operators.

Fig. 5. KernelEvolve system overview. Persistent memory, retrieval, evolutionary search, and hardware-aware evaluation are combined to scale agentic kernel coding to production operators. (Image source: Liao et al., 2025)

Fig. 5. KernelEvolve system overview. Persistent memory, retrieval, evolutionary search, and hardware-aware evaluation are combined to scale agentic kernel coding to production operators. (Image source: Liao et al., 2025)

The core challenge for GPU-kernel agents is preserving useful experience under complex hardware, complex workloads, and strict validation constraints, then compressing failure feedback into executable search constraints for the next iteration.

System Architecture

The LLM CUDA team submitted two routes: Agent-Assisted and Full-Agent. Their main difference is not whether they use LLMs, but whether humans continuously intervene in the search process.

DimensionAgent-AssistedFull-Agent
Human involvementHumans continuously design strategies, provide reference implementations, select optimization directions, and maintain promotion rules.Humans provide only the initial task, constraints, and automation tools; the agent performs the subsequent search.
Search styleUses profiling results and experience to choose candidate implementation families, focusing on high-confidence local optimizations.Runs long-horizon search through a plan-execute-evaluate-summarize-store loop.
State managementMainly relies on human-maintained notes, skills, and experiment archives.Uses a LoongFlow-style database to record candidate origins, result summaries, current best records, and failure modes.

Agent-Assisted

In Self-Evolving Agents, I discussed the core paradigm of Harness Engineering (Lopopolo, 2026; Rajasekaran, 2026): humans design constraints, build feedback mechanisms, and define evaluation criteria, while agents iteratively generate higher-quality code inside a controlled environment.

Fig. 6. Agent-Assisted closed-loop harness for B200 kernel optimization. The workflow grounds agents in operator definitions, workload distributions, references, profile signals, and explicit promotion policies. (Image source: Shui et al., 2026)

Fig. 6. Agent-Assisted closed-loop harness for B200 kernel optimization. The workflow grounds agents in operator definitions, workload distributions, references, profile signals, and explicit promotion policies. (Image source: Shui et al., 2026)

The Agent-Assisted harness is organized into four layers:

  • Grounding inputs: required context, such as operator definitions, reference implementations, and workload JSON files.
  • Shape discovery: group workloads by parameters such as batch size and sequence length, then sample representative dimensions from each group so that agents can quickly evaluate and iterate on candidates without running full validation every time.
  • Closed-loop optimization: generate candidates along the baseline -> profile -> diagnose -> generate -> evaluate -> archive loop, verify whether the code compiles and whether the results are correct, and evaluate performance; Torch Profiler and NVIDIA Nsight Compute (NCU) are also used to analyze operator bottlenecks.
  • Outputs: archive related code, performance metrics, and other files so they can be provided back to the agent for later iterations.

Humans wrote optimization skills and built evaluation scaffolding. These practices share the same engineering motivation as Agent Skills (OpenAI, 2026a) and Subagents (OpenAI, 2026b): reusable context, packaged tools, and parallel exploration. In this contest setting, they also helped keep the search process inside a verifiable closed loop.

Full-Agent

Fig. 7. Modified LoongFlow Full-Agent stack. The agent iterates through planning, code generation, evaluation, summarization, and database updates, turning failed candidates into searchable context for later iterations. (Image source: Ma et al., 2026)

Fig. 7. Modified LoongFlow Full-Agent stack. The agent iterates through planning, code generation, evaluation, summarization, and database updates, turning failed candidates into searchable context for later iterations. (Image source: Ma et al., 2026)

The framework I used for the Full-Agent route follows a LoongFlow-like (Wan et al., 2025) plan-execute-summarize paradigm and is also similar to OpenEvolve-style (Sharma, 2025) evolutionary search systems. It decomposes a kernel search into planning, execution, evaluation, summarization, and storage, then writes each candidate’s provenance, performance results, and failure summary into a persistent database.

Experimental Results

On the official Top-3 leaderboard, our team results were:

Official trackMethodRank
Track A Fused MoEAgent-Assisted3rd
Track C Gated Delta NetAgent-Assisted3rd
Track C Gated Delta NetFull-Agent2nd

The data below comes from local evaluation on Modal B200 GPUs. It should be treated as reference only, because the development environment did not fully lock GPU clock frequencies and the final evaluation ran on bare-metal machines. The evaluation protocol follows the correctness-gated benchmark setting of FlashInfer (Ye et al., 2025) and FlashInfer-Bench (Xing et al., 2026). The table uses the matched comparison convention from the report: latency is mean milliseconds, PyTorch speedup uses the corresponding PyTorch reference mean, and FlashInfer speedup uses the official FlashInfer baseline. In simplified form:

\[ \mathrm{Speedup} = \frac{\mathrm{mean\ baseline\ latency}}{\mathrm{mean\ solution\ latency}}. \]

This table is for local analysis only, not the official per-operator or per-track contest score:

Operator definitionMethodMean latency (ms)Speedup vs. PyTorch reference meanSpeedup vs. FlashInfer baseline
DSA AttentionAgent-Assisted0.011175217.17×29.68×
Full-Agent0.022811106.39×14.54×
FlashInfer baseline0.3316507.32×1.00×
DSA IndexerAgent-Assisted0.006893494.13×18.05×
Full-Agent0.032659104.29×3.81×
FlashInfer baseline0.12442027.38×1.00×
GDN PrefillAgent-Assisted0.05199221,078×13.70×
Full-Agent0.6888751,591×1.03×
FlashInfer baseline0.7121661,539×1.00×
MoE FP8Agent-Assisted0.28634063.78×1.62×
FlashInfer baseline0.46387439.37×1.00×
Full-Agent1.74263010.48×0.27×
GDN DecodeAgent-Assisted0.0062017,970×1.12×
FlashInfer baseline0.0069407,121×1.00×
Full-Agent0.0083665,907×0.83×
Fig. 8. Final retained speedups over the supplied FlashInfer baseline, measured using mean latency on local Modal B200 runs. (Image source: Shui et al., 2026)

Fig. 8. Final retained speedups over the supplied FlashInfer baseline, measured using mean latency on local Modal B200 runs. (Image source: Shui et al., 2026)

Agent-Assisted outperforms the FlashInfer baseline on all five operators, with speedups ranging from 1.12× on GDN Decode to 29.68× on DSA Attention. Full-Agent also finds effective candidates for DSA Attention, DSA Indexer, and GDN Prefill, but remains below the baseline on MoE FP8 and GDN Decode.

Agent-Assisted Optimization Trajectory

Fig. 9. Retained speedup trajectories over the supplied FlashInfer baseline. The curves show long plateaus and discrete jumps rather than smooth monotonic progress. (Image source: Shui et al., 2026)

Fig. 9. Retained speedup trajectories over the supplied FlashInfer baseline. The curves show long plateaus and discrete jumps rather than smooth monotonic progress. (Image source: Shui et al., 2026)

The trajectories show that performance improvements did not happen smoothly. Instead, a small number of large jumps occurred after long plateaus. Effective Agent-Assisted kernel optimization does not simply rely on prompts; it depends on a measurable systems loop: organizing operator constraints, evaluation scaffolding, performance-analysis feedback, and historical trajectories into reusable workflows, then letting agents generate, validate, and retain candidates inside that loop. This process requires humans to continuously design and maintain the harness.

Full-Agent Optimization Trajectory

Fig. 10. Full-Agent optimization trajectories from LoongFlow trace logs. Gray dots are correctness-passing candidates, solid lines are running best, and dashed lines mark the FlashInfer baseline. (Image source: Ma et al., 2026)

Fig. 10. Full-Agent optimization trajectories from LoongFlow trace logs. Gray dots are correctness-passing candidates, solid lines are running best, and dashed lines mark the FlashInfer baseline. (Image source: Ma et al., 2026)

The Full-Agent trajectories come from automatic search logs. Full-Agent can still find effective candidates for some operators, such as DSA Attention at 14.54×, but it remains slower than Agent-Assisted overall and even falls below the FlashInfer baseline on MoE FP8 and GDN Decode. This gap suggests that fully automated agent search is still difficult. High-quality human-provided reference implementations and continuously accumulated trajectory memory are often more efficient than asking agents to explore from scratch. Future systems need to incorporate controller state and historical memory into the harness while preserving strict final validation.

Future Work

  • Model-level optimization loop: Following the direction of AutoKernel (Jaber et al., 2026), kernel optimization can be extended from single operators to a model-level profile -> extract -> optimize -> verify workflow. The system should first use profilers to locate GPU bottlenecks in the model, then extract standalone Triton/CUDA kernels, and use Amdahl’s law to decide which kernel should be optimized next.

  • Experiment management and independent verification: Based on the publicly released contest writeups (FlashInfer Contest, 2026b), future harnesses should standardize benchmarks, correctness checks, input-shape scans, numerical stability tests, determinism checks, Roofline Analysis, promotion/rollback decisions, and artifact-structure constraints, while using an independent verifier to review candidate implementations.

  • Workload specialization and retrievable memory: The common pattern in high-scoring solutions is not blindly trying more kernels, but first understanding the workload distribution and then choosing implementation strategies that fit it. Future systems can structure common workload profiles, reusable optimization templates, successful candidates, and failure reasons, so that agents retrieve similar cases before generating code and know the applicable input shapes, known bottlenecks, and directions that should not be repeated.

References

[1] FlashInfer Contest. “FlashInfer AI Kernel Generation Contest.” MLSys 2026 Competition, NVIDIA Track (2026a).

[2] Shui, Yue, Chenyu Ma, Hangfei Xu, Shengzhao Wen, and Yanpeng Wang. “Harness Engineering for LLM-Driven GPU Kernel Generation.” Technical Report (2026).

[3] Ma, Chenyu, Yue Shui, Hangfei Xu, Shengzhao Wen, and Yanpeng Wang. “Full-Agent Kernel Generation for FlashInfer @ MLSys 2026.” Technical Report (2026).

[4] Ouyang, Anne, et al. “KernelBench: Can LLMs Write Efficient GPU Kernels?” arXiv preprint arXiv:2502.10517 (2025).

[5] Xing, Shanli, et al. “FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems.” arXiv preprint arXiv:2601.00227 (2026).

[6] Yu, Yang, et al. “Towards Automated Kernel Generation in the Era of LLMs.” arXiv preprint arXiv:2601.15727 (2026).

[7] Liao, Gang, et al. “KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta.” arXiv preprint arXiv:2512.23236 (2025).

[8] Lopopolo, Ryan. “Harness Engineering: Leveraging Codex in an Agent-First World.” OpenAI Blog (2026).

[9] Rajasekaran, Prithvi. “Harness Design for Long-Running Application Development.” Anthropic Engineering Blog (2026).

[10] OpenAI. “Agent Skills.” OpenAI Developers (2026a).

[11] OpenAI. “Subagents.” OpenAI Developers (2026b).

[12] Wan, Chunhui, et al. “LoongFlow: Directed Evolutionary Search via a Cognitive Plan-Execute-Summarize Paradigm.” arXiv preprint arXiv:2512.24077 (2025).

[13] Sharma, Asankhaya. “OpenEvolve: Open-source Implementation of AlphaEvolve.” GitHub Repository (2025).

[14] Ye, Zihao, et al. “FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.” Proceedings of Machine Learning and Systems (2025).

[15] Jaber, Jaber, and Osama Jaber. “AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search.” arXiv preprint arXiv:2603.21331 (2026).

[16] FlashInfer Contest. “MLSys 2026 Contest Writeups.” GitHub Repository (2026b).

Citation

Citation: Please cite the original author and source when reposting or citing this post.

Cited as:

Yue Shui. (May 2026). GPU Kernel Generation and Optimization with Coding Agents: MLSys 2026 FlashInfer Contest Summary.
https://syhya.github.io/posts/2026-05-18-flashinfer-contest

Or

@article{syhya2026-mlsys26-flashinfer-contest,
  title   = "GPU Kernel Generation and Optimization with Coding Agents: MLSys 2026 FlashInfer Contest Summary",
  author  = "Yue Shui",
  journal = "syhya.github.io",
  year    = "2026",
  month   = "May",
  url     = "https://syhya.github.io/posts/2026-05-18-flashinfer-contest"
}