Input Rules as Intent: Streaming AI into a Rich Text Editor

An agent streams a response into a document. On the other side, your collaborator watches it happen: characters appear, and then # Hello flickers into a proper heading mid-sentence. The markdown disappears. A formatted heading sits where raw text was a moment ago.

That works because of input rules. In ProseMirror, you can set up patterns that watch what's being typed and transform the document when they match. Users already have these: you type # and the paragraph becomes a heading, you type 1. and a numbered list starts. It's been a solved problem for years.

So when we got to "okay, the LLM needs to create headings too," the answer was kind of obvious. The agent is just typing characters into the same editor. If a user typing # triggers the heading rule, and the agent also types # ... just let the same rule fire. And it works! The agent types, the input rules pick it up, and your collaborator sees structured content appearing in real-time. No parsing, no intermediate format between the model and the editor.

The problem shows up the moment the agent tries to create a list.

Suppose the agent is responding to something you asked, and part of its completion (that's the full response the model streams back, token by token) looks like this:

Here is a list!
1. One
2. Two
And it's done!

Simple enough. A line of text, a two-item numbered list, and a closing line. Let's walk through what happens as the agent types this out, character by character, into the editor.

The first line is just text. Nothing structural going on. The only thing that matters is the \n at the end, because that's how the agent signals "I'm moving to a new line." So we need a rule for that:

Whenever the agent types *\\n*, go to the next line.

(And if you're the kind of person who immediately wonders "what does 'go to the next line' actually mean in a structured document?"... hold that thought. We'll get there.)

So after the first line, the document looks like this:

Here is a list!
|

Nothing interesting yet. A paragraph with the cursor sitting on an empty line below it. Doing exactly what it should.

Now the agent starts typing the second line: 1. One. The 1. part is a valid input rule that users already have access to. When a user types 1. at the start of a line, the editor converts that line into a numbered list. So let's add that:

Whenever the agent types *\\d\\. *, a new numbered list starts.

The rule fires, the paragraph becomes a list item, and the agent continues typing One. Then it hits \n again, which means "next line." The next line rule fires, and we get a new empty list item below. After line two, the document looks like this:

Here is a list!
1. One
2. |

That empty second list item with the cursor in it? That's fine. The agent is about to type into it. Everything still looks correct.

Now line three. The agent types 2. Two\n. And here's where things get interesting.

The agent is on that empty second list item. It starts typing 2. , and the input rule fires. The rule sees \d\. at the start of a line and does what it always does: creates a new numbered list. So now there's a list inside a list. Then the agent types Two, hits \n, and we get another empty list item. The result:

Here is a list!
1. One
2. 2. Two
   3. |

Makes sense, no? The rule did exactly what we told it to do. It saw a number followed by a dot, it started a new list. It doesn't know that the agent intended to continue the existing list. It just saw the pattern and fired.

Let's finish the example anyway. Line four: And it's done!. The agent types that into the nested list item, and we end up with:

Here is a list!
1. One
2. 2. Two
   3. And it's done!

That's not what we wanted at all. What we wanted was this:

Here is a list!
1. One
2. Two
And it's done!

A flat two-item list, and then a regular paragraph. Instead we got a nested list with the closing line trapped inside it.

So what happened? The input rules worked perfectly. Every single rule fired correctly. The problem is that we're processing tokens one at a time, and we have no idea what comes next. When the agent typed \n after One, we don't know if the next line will be another list item, or a paragraph, or a heading, or anything. We just know the agent hit enter. And when it typed 2. on the next line, the rule can't tell the difference between "continue the list I'm already in" and "start a brand new list." It only sees the pattern. It doesn't have the intent.

One approach is to buffer. If the agent is inside a list and types \n, maybe we could wait and see what comes next. If the next tokens are 2. , we know to continue the list. If it's something else, we exit. Hold a few tokens, figure out the intent, act on it.

For simple lists, that actually works fine. But there's a problem with it that bothered me. Think about what happens with a deeply nested list. The agent might stream something like:

1. List
  1. Nested List
            1. A deeply nested item

That's a bunch of spaces followed by 1. . The number of spaces determines the nesting level. So now you're buffering \n, then a space, then another space, then another... and you still don't know how deep it goes. The user is staring at a cursor that isn't doing anything. For a two-item list the pause is barely noticeable, but for a complex document with nested structures the lag adds up, and the whole "real-time streaming" thing starts feeling like a loading spinner.

Now, you might say that's a reasonable tradeoff. A little buffering, a little delay, whatever. And maybe you're right for lists. But have you thought about tables?

A markdown table looks like this:

| Name  | Score |
|-------|-------|
| Alice | 42    |
| Bob   | 17    |

Every cell is delimited by |. Every row ends with |\n. When do you render? You can't wait for the whole table because the agent might stream a 50-row table and the user sees nothing for ten seconds. You can't dump raw markdown into the editor because the user is watching and they should see a real table forming, row by row, cell by cell. You need to handle each | as it arrives, create table cells on the fly, extend the table when |\n signals a new row. There's no amount of buffering that makes this comfortable. The structure is the stream.

So we went with no buffering at all. Instead, each block type defines what \n means in its own context, and we handle reconstruction after the fact.

In a list, \n means "exit the list." Every time. When 2. shows up on the next line, the input rule fires and creates a new list. Now you have two lists sitting next to each other, so you just merge them. The result is the same as if the agent had somehow "pressed Enter to continue the list." You just get there differently: exit, re-enter, merge.

Blockquotes work the same way. In markdown every line in a blockquote starts with > . Each \n exits the blockquote, each > creates a new one, and adjacent blockquotes get merged back together.

In a code block, \n is just a newline. It's content. Nothing to exit. You stay inside until the agent types ``` , which is the actual closing signal. The markdown format tells you exactly when a code block ends, so \n gets to just be \n.

Tables are their own thing entirely. | means "create a new cell" (or if the content before it is already a table, extend it). |\n means "this row is done, exit the table." And if the next line starts with |, a new table row is created and merged with the table above. Same exit-and-merge idea as lists, just with | as the structural marker instead of 1. .

The pattern that falls out of all of this is: you don't buffer, you don't predict. Each block type knows what its tokens mean in context, and when something exits, the next input rule handles re-entry and merges things back together.

What I find interesting about all of this is that the agent never needed special handling. Not really. The whole time, the problem looked like "how do we interpret what the agent is trying to do," but the answer was already sitting in the format it was writing. 2. after a list means "continue the list." > means "I'm quoting." `` means "I'm done with code."|\n` means "this row is done." The intent was never missing. It was always there, encoded in the markdown, character by character.

And once you see it that way, the agent stops being a special system that needs its own parsing pipeline, its own validation, its own injection path into the document. It's just someone typing. It writes into the same editor, through the same collaborative layer, triggering the same input rules as any other person on the document. No privileged path. No intermediate format. A peer.

The funny thing is that this constraint (refusing to give the agent special treatment) is what pushed us to build the exit-and-merge pattern in the first place. And that pattern turned out to be more resilient than anything we would've built with a dedicated parsing step. When you can't give the agent special tools, you have to make the editor smart enough to handle whatever any collaborator might type. That's just a better editor.

Now, this all works because markdown is a well-known format with clear conventions. But what happens when the agent needs to create something markdown doesn't have? Like an Application, a live, embedded, interactive thing inside the document. You can't show a partially built broken application to the user while the agent is still streaming it. The input rules need to understand when something starts, when it's still being built, and when it's ready to render. Turns out the same idea applies: you match HTML-like tags to create and exit operations, and you handle the intermediate state. But that's a whole other post.