Morph Models

Fast general models for agent loops

Run the primary agent loop on fast, OpenAI-compatible coding models served on Morph's custom kernels. One API for chat, code generation, and reasoning.

Get API Key Read the docs Private deployments

Same coding job on two stacks: baseline serving grinds line by line while Kimi K3 on Morph snaps every edit in and settles at 100 tok/s

OUTPUT SPEED

ONE OPENAI-COMPATIBLE API

THE LINEUP

Morph Models

Frontier coding models,
served on custom kernels

Output speed

Codegen-specific optimizations and custom GPU kernels. Up to 150 tok/s on DeepSeek V4 Flash, measured on private deployments with custom speculators. Public endpoints run on shared capacity.

One OpenAI-compatible API

Point your existing client at api.morphllm.com. Switch models by changing one string.

import OpenAI from "openai";

const client = new OpenAI({

  baseURL: "https://api.morphllm.com/v1",

  apiKey: process.env.MORPH_API_KEY,

});

const res = await client.chat.completions.create({

  model: "morph-glm52-744b",

  messages: [{ role: "user", content: "Refactor this function..." }],

});

The lineup

Open-weight frontier models with long context, served and billed per token. No per-seat fees.

// Available general models

morph-glm52-744b       // 744B MoE, 1M context

morph-minimax3-428b    // 428B MoE, 256k context

morph-minimax27-230b   // 230B MoE, agentic workflows

morph-dsv4flash        // 1M context, fast

morph-qwen36-27b       // dense, low latency

Built for production agent workloads

Low latency

Custom GPU kernels and speculative decoding tuned for the code-generation workload.

High throughput

Batched serving across a GPU fleet. Private deployments span over 100 billion tokens per day.

Private deployments

Dedicated capacity with custom speculators and caching, in our cloud or yours, at large discounts over public pricing.

Inference optimized for coding agents

Every agent will write code. We bet the stack on it.

So we tune every layer for that one workload: custom GPU kernels, speculative decoding shaped around code, and serving built for the agent loop instead of general chat. Not general infrastructure with code bolted on.

Private deployments

The fastest endpoints are private deployments

The headline speeds on this page come from dedicated deployments, not the shared public endpoints. At scale we build the deployment around your traffic: speculators trained on it, caching tuned to it, and pricing well under public rates.

100B+

tokens per day served across private deployments

Custom speculators

trained on your traffic for higher acceptance rates

Custom caching

prefix and KV caching tuned to your workload

Volume pricing

large discounts over public per-token rates

Talk to us about a private deployment

Sign up and start free. No card required.

Get API Key

Add a billing method for $10 in free credits and full rate limits.

Kimi K3

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Fast general models for agent loops

Frontier coding models,
served on custom kernels

Output speed

One OpenAI-compatible API

The lineup

Built for production agent workloads

Every agent will write code. We bet the stack on it.

The fastest endpoints are private deployments

Sign up and start free. No card required.

Kimi K3

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Fast general models for agent loops

Frontier coding models, served on custom kernels

Output speed

One OpenAI-compatible API

The lineup

Built for production agent workloads

Every agent will write code. We bet the stack on it.

The fastest endpoints are private deployments

Sign up and start free. No card required.

Frontier coding models,
served on custom kernels