WebLLM

Your graphics card is the most powerful processor in your computer, and until recently, web browsers couldn't really use it. WebGL gave us 3D graphics, but the GPU's real potential—massive parallel computation—remained locked away.

WebGPU changes that. It's not just a better WebGL. It's a fundamental shift in what web applications can do.

This matters because the same GPU that renders video games can also run machine learning inference, physics simulations, video encoding, and cryptocurrency mining. When browsers gain full GPU access, web apps gain capabilities that previously required native code.

Let's understand what WebGPU is, why it exists, and what it enables.

The GPU: A Different Kind of Processor

Before diving into WebGPU, we need to understand what makes GPUs special.

CPU vs. GPU: Different Tools for Different Jobs

CPU (Central Processing Unit):

Few powerful cores (4-16 typical)
Optimized for sequential tasks
Great at complex, branching logic
Handles diverse workloads

GPU (Graphics Processing Unit):

Many smaller cores (hundreds to thousands)
Optimized for parallel tasks
Great at doing the same thing to lots of data
Handles specific workloads extremely fast

The Parallelism Advantage

Consider calculating brightness for every pixel in a 1920x1080 image:

CPU approach (sequential):

for (let i = 0; i < 2_073_600; i++) {
  pixels[i] = calculateBrightness(pixels[i]);
}
// ~2 million iterations, one at a time

GPU approach (parallel):

// All ~2 million pixels calculated simultaneously
// Each GPU core handles different pixels

The GPU might be 10-100x faster for this workload, not because each core is faster, but because thousands of cores work simultaneously.

What GPUs Excel At

Graphics rendering: The original use case—transforming 3D geometry into 2D images
Image/video processing: Filters, transformations, encoding/decoding
Machine learning inference: Matrix multiplications, neural network layers
Physics simulation: Particle systems, fluid dynamics
Scientific computing: Financial modeling, cryptography, simulations

All of these involve doing similar operations on large amounts of data—perfect for parallelization.

The History: From WebGL to WebGPU

WebGL (2011): Graphics for the Web

WebGL brought hardware-accelerated 3D graphics to browsers. It was based on OpenGL ES 2.0, a graphics API designed in 2007 for mobile devices.

// WebGL: Verbose, state-machine based
const gl = canvas.getContext('webgl');
gl.clearColor(0.0, 0.0, 0.0, 1.0);
gl.enable(gl.DEPTH_TEST);
gl.useProgram(shaderProgram);
gl.bindBuffer(gl.ARRAY_BUFFER, vertexBuffer);
gl.drawArrays(gl.TRIANGLES, 0, vertexCount);

WebGL achievements:

3D games in browser (HexGL, Quake JS)
Data visualization (Three.js, deck.gl)
Creative tools (Figma's rendering)
Maps (Google Maps 3D, Mapbox)

WebGL limitations:

State machine API: Error-prone, hard to optimize
Old design paradigm: Based on 2007-era graphics concepts
Single-threaded: Can't prepare work on multiple threads
Limited compute: No general-purpose GPU computing
Driver overhead: High CPU cost for draw calls

By 2015, native graphics APIs had evolved dramatically—Vulkan, Metal, DirectX 12—while WebGL was stuck in the OpenGL ES 2.0 era.

The Problem With Modernizing WebGL

Browsers couldn't just "update" WebGL because:

Different native APIs: Windows uses DirectX, macOS uses Metal, Linux uses Vulkan
Breaking changes: Modern paradigms are fundamentally different from WebGL's model
Security requirements: Browsers need extra safety layers native apps don't
Backward compatibility: Millions of WebGL sites must keep working

A new API was needed.

WebGPU: The Modern Solution

In 2017, Apple proposed WebGPU. Google had been working on NXT. These efforts merged into a unified specification through the W3C.

Design goals:

Abstract over Vulkan, Metal, and DirectX 12
Enable general-purpose GPU computing
Reduce CPU overhead (more draw calls, less state management)
Support multi-threading (prepare work off main thread)
Maintain browser security model

In 2023, Chrome shipped WebGPU. Firefox and Safari implementations are in progress.

WebGPU Architecture: The Key Concepts

Adapter and Device

// 1. Request adapter (physical GPU)
const adapter = await navigator.gpu.requestAdapter();

// 2. Request device (logical connection)
const device = await adapter.requestDevice();

Adapter: Represents the physical GPU. You can query capabilities and limits.

Device: Your application's connection to the GPU. All work goes through this.

This separation matters because:

Multiple tabs/apps share the same physical GPU
Each gets isolated logical access
Crashes in one app don't affect others

Buffers: GPU Memory

// Create buffer for vertex data
const vertexBuffer = device.createBuffer({
  size: vertices.byteLength,
  usage: GPUBufferUsage.VERTEX | GPUBufferUsage.COPY_DST,
});

// Copy data to GPU
device.queue.writeBuffer(vertexBuffer, 0, vertices);

Buffers are explicitly allocated and typed. You declare upfront how they'll be used—vertex data, uniform data, storage, etc. This explicitness enables GPU driver optimizations.

Shaders: GPU Programs

// WGSL shader (WebGPU Shading Language)
const shaderModule = device.createShaderModule({
  code: `
    @vertex
    fn vertexMain(@location(0) position: vec3f) -> @builtin(position) vec4f {
      return vec4f(position, 1.0);
    }

    @fragment
    fn fragmentMain() -> @location(0) vec4f {
      return vec4f(1.0, 0.0, 0.0, 1.0); // Red
    }
  `,
});

WGSL (WebGPU Shading Language) is new. Unlike GLSL (WebGL's shader language), WGSL was designed for the modern GPU model with safety and portability in mind.

Pipelines: Execution Configuration

const pipeline = device.createRenderPipeline({
  layout: 'auto',
  vertex: {
    module: shaderModule,
    entryPoint: 'vertexMain',
    buffers: [
      /* vertex buffer layout */
    ],
  },
  fragment: {
    module: shaderModule,
    entryPoint: 'fragmentMain',
    targets: [{ format: canvasFormat }],
  },
  primitive: {
    topology: 'triangle-list',
  },
});

Pipelines bundle all the state needed for a draw call:

Which shaders to use
How to interpret vertex data
Blending modes, depth testing, etc.

Creating pipelines is expensive, but using them is cheap. You create pipelines once, reuse many times.

Command Encoding: Recording Work

// Create command encoder
const encoder = device.createCommandEncoder();

// Begin render pass
const pass = encoder.beginRenderPass({
  colorAttachments: [
    {
      view: context.getCurrentTexture().createView(),
      clearValue: { r: 0, g: 0, b: 0, a: 1 },
      loadOp: 'clear',
      storeOp: 'store',
    },
  ],
});

// Record drawing commands
pass.setPipeline(pipeline);
pass.setVertexBuffer(0, vertexBuffer);
pass.draw(3); // Draw 3 vertices

// End pass
pass.end();

// Submit to GPU
device.queue.submit([encoder.finish()]);

Key insight: You're not drawing immediately. You're recording commands that the GPU will execute later. This enables:

Batching work efficiently
Recording on background threads
Optimizing command sequences

The Game Changer: Compute Shaders

Here's what truly sets WebGPU apart from WebGL: compute shaders.

What Are Compute Shaders?

Graphics shaders (vertex, fragment) are designed for rendering pipelines. Compute shaders are general-purpose—they just crunch data.

// WGSL compute shader: double every number
const computeShader = device.createShaderModule({
  code: `
    @group(0) @binding(0) var<storage, read_write> data: array<f32>;

    @compute @workgroup_size(64)
    fn main(@builtin(global_invocation_id) id: vec3u) {
      data[id.x] = data[id.x] * 2.0;
    }
  `,
});

This shader runs across thousands of GPU cores simultaneously. Each core processes a different array element.

Creating a Compute Pipeline

const computePipeline = device.createComputePipeline({
  layout: 'auto',
  compute: {
    module: computeShader,
    entryPoint: 'main',
  },
});

// Create buffer with data
const dataBuffer = device.createBuffer({
  size: data.byteLength,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
});

// Bind buffer to shader
const bindGroup = device.createBindGroup({
  layout: computePipeline.getBindGroupLayout(0),
  entries: [
    {
      binding: 0,
      resource: { buffer: dataBuffer },
    },
  ],
});

// Dispatch compute work
const encoder = device.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(computePipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(Math.ceil(data.length / 64));
pass.end();

device.queue.submit([encoder.finish()]);

Why Compute Shaders Enable ML

Machine learning inference is fundamentally about matrix math:

output = activation(weights * input + bias)

This involves:

Matrix multiplication (massive parallelism opportunity)
Element-wise operations (add, multiply)
Activation functions (apply same function to each element)

All perfect for GPU parallelization. A neural network layer that takes 100ms on CPU might take 1ms on GPU.

Before WebGPU: Browser ML used WebGL hacks—encoding matrices as textures, using fragment shaders for computation. It worked, but was awkward and limited.

With WebGPU: Compute shaders directly express ML operations. Libraries like ONNX Runtime Web and Transformers.js can target WebGPU naturally.

What WebGPU Enables

1. Browser-Based ML Inference

// Conceptual: Run inference with WebGPU backend
import { pipeline } from '@xenova/transformers';

const classifier = await pipeline('sentiment-analysis', {
  backend: 'webgpu', // Use GPU acceleration
});

const result = await classifier('WebGPU is amazing!');
// { label: 'POSITIVE', score: 0.9998 }

Real ML models running at near-native speed in a browser tab:

Image classification
Object detection
Text generation (LLMs!)
Speech recognition
Translation

2. Advanced Graphics

// Clustered forward rendering, GPU-driven culling, etc.
// Techniques that were too expensive in WebGL

Modern rendering techniques become practical:

Deferred rendering
GPU particle systems
Volumetric effects
Real-time global illumination

3. Scientific Simulation

// Fluid simulation with compute shaders
// Each cell updated in parallel

Physics simulations that were previously server-side or native-only:

Fluid dynamics
N-body gravity
Molecular dynamics
Weather modeling (simplified)

4. Video Processing

// Real-time video effects
// Background removal, style transfer, upscaling

Effects applied per-frame in real-time:

Background blur (like Zoom/Meet)
Style transfer
Super-resolution upscaling
Real-time color grading

5. Cryptography

// Parallel hash computation
// (Note: Responsible use required)

Cryptographic operations benefit from parallelism:

Proof-of-work (yes, browser mining exists)
Bulk encryption/decryption
Hash verification

WebGPU vs. WebGL: The Real Differences

Aspect	WebGL	WebGPU
API Model	State machine	Object-oriented
Compute Shaders	No (hacks only)	Yes, first-class
Multi-threading	Limited	Full support
Draw Call Overhead	High	Low
Shader Language	GLSL	WGSL
Error Handling	Silent failures	Explicit errors
Pipeline State	Global, mutable	Pre-baked objects
Browser Support	Universal	Chrome (others coming)

Code Comparison: Drawing a Triangle

WebGL (~100 lines):

// Initialize WebGL context
const gl = canvas.getContext('webgl');

// Compile vertex shader
const vertexShader = gl.createShader(gl.VERTEX_SHADER);
gl.shaderSource(
  vertexShader,
  `
  attribute vec4 position;
  void main() { gl_Position = position; }
`
);
gl.compileShader(vertexShader);

// Compile fragment shader
const fragmentShader = gl.createShader(gl.FRAGMENT_SHADER);
gl.shaderSource(
  fragmentShader,
  `
  precision mediump float;
  void main() { gl_FragColor = vec4(1, 0, 0, 1); }
`
);
gl.compileShader(fragmentShader);

// Link program
const program = gl.createProgram();
gl.attachShader(program, vertexShader);
gl.attachShader(program, fragmentShader);
gl.linkProgram(program);

// Set up buffer
const buffer = gl.createBuffer();
gl.bindBuffer(gl.ARRAY_BUFFER, buffer);
gl.bufferData(
  gl.ARRAY_BUFFER,
  new Float32Array([0, 0.5, -0.5, -0.5, 0.5, -0.5]),
  gl.STATIC_DRAW
);

// Draw
gl.useProgram(program);
const positionLocation = gl.getAttribLocation(program, 'position');
gl.enableVertexAttribArray(positionLocation);
gl.vertexAttribPointer(positionLocation, 2, gl.FLOAT, false, 0, 0);
gl.drawArrays(gl.TRIANGLES, 0, 3);

WebGPU (still verbose, but more explicit):

const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();

const context = canvas.getContext('webgpu');
const format = navigator.gpu.getPreferredCanvasFormat();
context.configure({ device, format });

const shaderModule = device.createShaderModule({
  code: `
    @vertex fn vs(@builtin(vertex_index) i: u32) -> @builtin(position) vec4f {
      var pos = array<vec2f, 3>(
        vec2f(0, 0.5), vec2f(-0.5, -0.5), vec2f(0.5, -0.5)
      );
      return vec4f(pos[i], 0, 1);
    }
    @fragment fn fs() -> @location(0) vec4f {
      return vec4f(1, 0, 0, 1);
    }
  `,
});

const pipeline = device.createRenderPipeline({
  layout: 'auto',
  vertex: { module: shaderModule, entryPoint: 'vs' },
  fragment: { module: shaderModule, entryPoint: 'fs', targets: [{ format }] },
});

const encoder = device.createCommandEncoder();
const pass = encoder.beginRenderPass({
  colorAttachments: [
    {
      view: context.getCurrentTexture().createView(),
      loadOp: 'clear',
      storeOp: 'store',
    },
  ],
});
pass.setPipeline(pipeline);
pass.draw(3);
pass.end();
device.queue.submit([encoder.finish()]);

WebGPU isn't necessarily shorter, but it's more explicit about what's happening—and that explicitness enables optimizations.

Current Browser Support

As of late 2024:

Browser	Status
Chrome	Shipped (113+)
Edge	Shipped (113+)
Opera	Shipped
Firefox	In development (Nightly)
Safari	In development (Technology Preview)

Practical advice: Use WebGPU with fallback to WebGL for broader compatibility.

if ('gpu' in navigator) {
  // Use WebGPU
} else if ('WebGLRenderingContext' in window) {
  // Fall back to WebGL
} else {
  // No GPU support
}

Performance: Real Numbers

Benchmark comparisons (representative, your mileage varies):

ML Inference (BERT-base)

CPU (JavaScript): ~800ms per inference
WebGL backend: ~150ms
WebGPU backend: ~30ms

Particle Simulation (1M particles)

CPU: 15 FPS
WebGL: 45 FPS
WebGPU: 60 FPS (compute shader)

Image Processing (4K image, blur)

CPU: 500ms
WebGL: 80ms
WebGPU: 20ms

The gains are most dramatic for compute-heavy workloads that parallelize well.

The ML Connection: Why This Matters for AI in the Browser

WebGPU is the foundation that makes browser-based AI practical:

LLM inference: Running language models requires matrix operations that benefit from GPU parallelism
Image AI: Vision models (classification, segmentation, generation) need GPU compute
Real-time AI: Video effects, pose detection, background removal—all need low-latency GPU access
Private AI: Running models locally (no server round-trip) requires efficient GPU use

Without WebGPU, browser AI is limited to:

Small models only
High latency
Battery drain (CPU is less efficient)
Janky user experience

With WebGPU:

Larger models become practical
Near-real-time inference
Efficient power usage
Smooth 60fps integration

This is why WebGPU is a prerequisite for browser-native AI APIs. You can't have navigator.llm without the GPU infrastructure to make it fast enough to be useful.

Getting Started with WebGPU

Learn the Fundamentals

Raw WebGPU: Start with the basics
- WebGPU Fundamentals
- Google's WebGPU samples
Use a Library: For practical projects
- Three.js: Now has WebGPU renderer
- Babylon.js: WebGPU support
- wgpu-matrix: Math utilities

Simple Compute Example

Here's a minimal compute shader that adds two arrays:

// Full working example: Add two arrays on GPU
async function gpuAdd(a, b) {
  const adapter = await navigator.gpu.requestAdapter();
  const device = await adapter.requestDevice();

  const shader = device.createShaderModule({
    code: `
      @group(0) @binding(0) var<storage, read> a: array<f32>;
      @group(0) @binding(1) var<storage, read> b: array<f32>;
      @group(0) @binding(2) var<storage, read_write> result: array<f32>;

      @compute @workgroup_size(64)
      fn main(@builtin(global_invocation_id) id: vec3u) {
        result[id.x] = a[id.x] + b[id.x];
      }
    `,
  });

  const pipeline = device.createComputePipeline({
    layout: 'auto',
    compute: { module: shader, entryPoint: 'main' },
  });

  // Create buffers
  const size = a.byteLength;
  const bufferA = device.createBuffer({
    size,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });
  const bufferB = device.createBuffer({
    size,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });
  const bufferResult = device.createBuffer({
    size,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  });
  const bufferRead = device.createBuffer({
    size,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });

  device.queue.writeBuffer(bufferA, 0, a);
  device.queue.writeBuffer(bufferB, 0, b);

  const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: bufferA } },
      { binding: 1, resource: { buffer: bufferB } },
      { binding: 2, resource: { buffer: bufferResult } },
    ],
  });

  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(Math.ceil(a.length / 64));
  pass.end();

  encoder.copyBufferToBuffer(bufferResult, 0, bufferRead, 0, size);
  device.queue.submit([encoder.finish()]);

  await bufferRead.mapAsync(GPUMapMode.READ);
  const result = new Float32Array(bufferRead.getMappedRange().slice(0));
  bufferRead.unmap();

  return result;
}

// Usage
const a = new Float32Array([1, 2, 3, 4]);
const b = new Float32Array([5, 6, 7, 8]);
const result = await gpuAdd(a, b);
console.log(result); // Float32Array [6, 8, 10, 12]

Yes, it's verbose for adding numbers. But when you're processing millions of values, the parallelism makes it worthwhile.

Conclusion

WebGPU represents a fundamental expansion of web platform capabilities. By exposing modern GPU features—especially compute shaders—it enables application categories that were previously native-only:

Real-time machine learning inference
Advanced graphics and visualization
Scientific simulation
High-performance media processing

For AI in the browser specifically, WebGPU is foundational. The same infrastructure that enables WebGPU-accelerated ML inference will eventually enable native browser AI APIs.

When you hear about running LLMs in the browser or AI-powered web apps, WebGPU is what makes it possible. It's not just a graphics API—it's the compute layer that the next generation of web applications will be built on.

WebGPU Explained