GPU Kernel Generation and Optimization with Coding Agents: MLSys 2026 FlashInfer Contest Summary

Recently, I participated in the MLSys 2026 - NVIDIA Track: FlashInfer AI Kernel Generation Contest (FlashInfer Contest, 2026a). This post is not a tutorial on CUDA kernel optimization, and I am not a GPU operator development expert. My main goal was to use a highly verifiable task environment with clear feedback to study how coding agents can continuously produce high-quality GPU kernels in a closed-loop workflow. The full materials are split into two reports: Harness Engineering for LLM-Driven GPU Kernel Generation (Shui et al., 2026) and Full-Agent Kernel Generation for FlashInfer (Ma et al., 2026). The code is available in mlsys26-flashinfer-contest. ...

Created: 2026-05-18 · Updated: 2026-05-25 · 10 min · 2046 words · Yue Shui

DeepSeek-V3.2 Series

By introducing DeepSeek Sparse Attention (DSA), a scalable reinforcement learning framework, and a large-scale agentic task synthesis pipeline, DeepSeek-V3.2 (DeepSeek-AI, 2025) achieves reasoning capabilities and agent performance comparable to GPT-5. Fig. 1. Benchmark of DeepSeek-V3.2 and its counterparts. (Image source: DeepSeek-AI, 2025) DeepSeek Sparse Attention Fig. 2. Attention architecture of DeepSeek-V3.2, where DSA is instantiated under MLA. (Image source: DeepSeek-AI, 2025) ...

Created: 2025-12-31 · Updated: 2025-12-31 · 14 min · 2917 words · Yue Shui

Scaling Laws

From the evolution of the GPT series, researchers have gradually realized that as long as model parameters, training data, and compute resources are continuously scaled up, the performance of large models improves along a stable and predictable path. This predictability is characterized by Scaling Laws, which provide the theoretical foundation and practical confidence for high-cost pre-training. As model scale, alignment techniques, and inference-time compute co-evolve, the boundaries of AI capabilities are being systematically pushed. Scaling laws are not only the foundation for building next-generation models but also a key methodology for continuously improving model capabilities under compute constraints. ...

Created: 2025-11-19 · Updated: 2025-12-03 · 12 min · 2365 words · Yue Shui

Agentic RL

As Large Language Models (LLMs) achieve breakthroughs in natural language processing, their applications continue to expand. However, they also exhibit limitations such as knowledge cutoffs, hallucinations, and deficiencies in complex computation and logical reasoning. To address these challenges, Agentic RL, which combines agents with Reinforcement Learning (RL), is emerging as a key research direction. Agentic RL enables LLMs to possess capabilities like autonomous planning, decision-making, tool use, and environmental interaction by creating a closed-loop interaction with the external world (e.g., search engines, code interpreters, databases, browsers) and continuously optimizing through reward signals. In practical applications, it not only understands requirements and plans autonomously but also constantly corrects and optimizes within an execution-feedback loop. ...

Created: 2025-09-30 · Updated: 2025-09-30 · 24 min · 5072 words · Yue Shui

Large Language Model Inference

In recent years, Large Language Models (LLMs) have achieved revolutionary breakthroughs in fields such as natural language processing, code generation, and even multimodal interaction. However, the powerful capabilities of these models come at the cost of enormous computational and memory overhead, especially during the inference stage. Efficiently deploying and running these models, which have billions or even trillions of parameters, has become a core challenge in scaling LLM technology for real-world applications. ...

Created: 2025-06-29 · Updated: 2025-06-29 · 43 min · 9025 words · Yue Shui