Over 9000 t/s! Ultra fast JSON LLMs

Today, we benchmarked a checkpoint model for inversion-xs, a smaller Inversion structured language model we've been working on. It is remarkably good at some tasks despite its size, and reached up to over 9000 tokens per second inference (according to the GPT-4 tokenizer).

See below for an example of perfect typed extraction in milliseconds.

Speed
33%
66%
99%
Inference speed in characters per second, 33rd/66th/99th percentile across 600 extraction & reasoning tests. (higher is better)

Our first generation models are state of the art in structured tasks such as extraction and function calling while running up to 100× faster, with 10× lower latency, outputting 100% reliable structure with 10,000× less overhead than the best alternatives, and boasting the deepest support for typed JSON output available anywhere.*

Inversion models do more with less - they use less compute, less time, and less data to produce outputs with higher quality, reliability, and reasoning.

Example: typed JSON extraction

Here's a fun example of a typed JSON extraction task we wrote a few weeks ago and just tested on the inversion-xs model today. Say we're building an app that needs to recognize people and their details within unstructured text. We want to extract data from the following input prompt:

"hey so let me spill on everyone real quick: first off, ryan smith, total older dude. dude's 28 but still lit, y'know? love his blue thing, everything's gotta be blue. jessica brown, she's our age, 22, and she's totally into yellow. can't get enough of it. it's like sunshine, y'know? in her soul or something, lol. then there's jackie johnson, omg she's such a red chick, always fiery. she's 26 by the way. oh, and i almost forgot about sam baker, 24, quieter type, kinda weird but cool. he's all about greens. i don't know, guess it's chill. and bob, bob jones lol. he's 30, bit older but totally rad. pink's his thing. pretty rare for a guy huh? and last there's lily thomas, 21, smallest among us. to match, she's black all the way. total goth. so that's all i got for now. TTYL xoxo."

We want to use the following typescript type to extract the data reliably:

type People = {
  name: {
    first: string
    last: string
  }
  age: number
  favoriteColor: string
}[]

To solve this, we just need to make a typed completion with the Inversion API using the prompt and a corresponding JSON schema or regular expression.

import { z } from 'zod'

await ai.completions.create({
  prompt,
  schema: z.array(z.object({
    name: z.object({
      first: z.string(),
      last: z.string(),
    }),
    age: z.number().int(),
    favoriteColor: z.string(),
  }))
}))

Like magic, Inversion returns the following JSON output with perfect correct answers in just 28.2 ms + network latency with inversion-txs (turbo mode), equivalent to over 9000 tokens per second using the GPT-4 tokenizer:

{
  "people": [
    {
      "name": {
        "first": "ryan",
        "last": "smith"
      },
      "age": 28,
      "favoriteColor": "blue"
    },
    {
      "name": {
        "first": "jessica",
        "last": "brown"
      },
      "age": 22,
      "favoriteColor": "yellow"
    },
    {
      "name": {
        "first": "jackie",
        "last": "johnson"
      },
      "age": 26,
      "favoriteColor": "red"
    },
    {
      "name": {
        "first": "sam",
        "last": "baker"
      },
      "age": 24,
      "favoriteColor": "green"
    },
    {
      "name": {
        "first": "bob",
        "last": "jones"
      },
      "age": 30,
      "favoriteColor": "pink"
    },
    {
      "name": {
        "first": "lily",
        "last": "thomas"
      },
      "age": 21,
      "favoriteColor": "black"
    }
  ]
}

The cheaper inversion-xs mode completes the request in 316.2 ms, which is also quite fast at over 800 tokens per second and very affordable.

The best part is the models are forced to return exactly the type we asked for, with a 0% type error rate across all of our JSON schema completion tests, even for the smallest Inversion models. They can still hallucinate content and fail a lot - just like any contemporary model - but at least the output is always valid to whatever type you ask for.

For some tasks like the above extraction task, an ultra-fast and ultra-small model works perfectly fine, so there's no reason to use a large and expensive model for the job.

Read more about the supported types and constraints in the docs.

The developer experience of knowing you're going to get exactly the type you ask for is bliss.

Always-valid outputs are a game changer for structured workloads, dramatically improving the reliability and reasoning level of LLMs across most tasks. Inversion's reliable typed output is powered by our best-in-class compiler & runtime constraint system - read more about it here.

What's next?

We're expanding access to the first generation of Inversion models shortly as we scale capacity, and we're excited to get all of our models into your hands to help you build whatever you can dream up with more reliability, efficiency, and speed than ever before.

Join us on this journey as we share insights & access to the technology we're building.

Reach us on X @RysanaAI.

Subscribe to our newsletter

Receive an email when we publish a new post. No spam, just the good stuff.

Summary

We've created Inversion - a family of structured language models designed to solve the speed, reliability, and reasoning issues in traditional AI systems.

Our first generation models are state of the art in structured tasks such as extraction and function calling while running up to 100× faster, with 10× lower latency, outputting 100% reliable structure with 10,000× less overhead than the best alternatives, and boasting the deepest support for typed JSON output available anywhere.

We're expanding access to the first generation of Inversion models shortly as we scale capacity, and we're excited to get all of our models into your hands to help you build whatever you can dream up with more reliability, efficiency, and speed than ever before.

Be among the first to try Inversion. Sign up for early access.

*All approximate numbers based on 1000 tests as of March 18-20th 2024. Results may vary. Experimental models not yet finalized. Models are tested by a single client making sequential requests to each model API server in rapid succession. Inference speed is calculated as the total output characters divided by the total request time. Throughput is calculated as the total output characters divided by the total request time minus the time to first tokens. Latency is calculated as the time from start of request to receiving the first tokens on the requesting client. Type error rate is the percentage of outputs that fail to parse according to the requested schema.