Encoder vs Decoder LLM

XDA Developers on MSN

I tested Google's new Gemma 4 12B on my 8GB GPU, and now I don't want to go back to smaller models

Not bad for limited hardware ...

Tensordyne Claims Massive Speed and Power Improvement Over Nvidia

If simulations are to be believed, startup Tensordyne's new AI chip could crush the performance of market leader Nvidia in terms of energy efficiency and latency for inferencing. The company just sent ...

When Your LLM API Is Slow: Stop Guessing, Start Diagnosing

Your LLM endpoint is slow. P95 latency is spiking. Users are complaining. You open a terminal and... type nvidia-smi. Nothing looks obviously broken. You tweak max_num_seqs. Maybe better? Hard to say.

GitHub

Training-free sparse attention for long-context LLM decode

Training-free KV-cache routing and sparse attention for long-context decode on frozen pretrained LLMs: a from-scratch Triton sparse-decode kernel, a Blackwell wall-clock replication of ClusterKV-style ...

VentureBeat

Google's new open source Gemma 4 12B analyzes audio, video — and runs entirely locally on a typical 16GB enterprise laptop

Credit: VentureBeat made with OpenAI ChatGPT-Images-2.0 While many AI open source model providers are pursuing larger and more powerful models, Google is still giving attention to the smaller, more ...

GitHub

A Lightweight LLM Inference Framework Based on Triton Kernels

Lumen is a lightweight, high-performance inference framework for large language models, built from the ground up using OpenAI Triton kernels. It achieves up to 4x speedup over HuggingFace Transformers ...

LLM Inference Challenges - Understanding Prefill vs Decode

At this point, the infrastructure picture starts becoming much clearer. But something important still feels confusing. Even after building massive GPU clusters: why does inference still become ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results