A single talking head is a solved problem. Two people having a conversation is where AI video still face-plants, and it falls in three places at once: the two faces drift apart across cuts, the lips sync to the wrong person, and the shot-reverse-shot does not match, so the eyelines miss and you get two clips stapled together instead of a scene.

Most guides tell you to fix this with a more structured prompt. That is half the answer. A shot list does beat a paragraph. But the real reason a two-hander comes apart sits upstream of the prompt: the model never gets the same face, or the same voice, twice. Every shot, you hand it a fresh description or a re-uploaded still, and "the same woman" quietly becomes a slightly different woman. Consistency you re-type is consistency you will lose.

So this is less a prompting trick and more a way of working. Stop describing your characters and start pointing at them. Build each one once, as a node on your canvas, then reference that exact node .. its face, its voice, even its previous shot .. by name, inside a single prompt. That one move is what makes a dialogue scene hold together, and it is also what makes it fast.

Why "the same woman" keeps drifting

A prose prompt describes a vibe. A conversation needs the same two people and the same two voices, take after take, plus a schedule for who speaks when. Prose pins none of that. "Two friends at a cafe, the camera cuts between them" leaves the model to invent the faces, the voices, the cut, and the eyeline, fresh, on every generation. That is why reverse shots never match. You asked for an improv, and you asked for it twice.

A tidy shot list helps, but it still breaks if your references are scattered across tools. You build a character in an image app, re-upload her into a video app, describe her voice in a third, and at every handoff she drifts a little. Nothing is carrying the exact asset from one shot to the next.

What carries it is a reference you can name.

Build each character as a node, once

Before any shot, make each character a node on the canvas and annotate it: a plain name, a one-line summary, the locked face, and a voice sample. Name them so you can tag them precisely, like CHAR_A and CHAR_B, and let the summary do the documenting .. "CHAR_A: lead, mid-30s, navy coat, low gravel voice."

That annotation is not decoration. It is what makes the node addressable. Once a node has a name, every part of it has a handle:

@CHAR_A.image1 .. her locked face
@CHAR_A.audio1 .. her voice
@CHAR_A.summary .. the one-line description you wrote

You will not memorize that syntax, and you do not have to. Type @, a menu lists every node on the canvas, you pick one, and it drops into your prompt as a clean tag. The keystrokes are not the point. The idea is: each character is now a variable you can reuse anywhere, and changing the source updates every shot that points at it.

Build two nodes. That is your cast. You will never describe these people again.

Write each shot as one prompt that points at the cast

Now write the shot list, except every "who" is a reference instead of a sentence. Each shot pins the face and the voice in a single line:

Shot 1  medium close on @CHAR_A.image1, she says "you actually did it?"
        in @CHAR_A.audio1, slow push in, warm window light

Shot 2  reverse medium on @CHAR_B.image1, he says "told you i would"
        in @CHAR_B.audio1, static, same light

Shot 3  two-shot, @CHAR_A.image1 seated left and @CHAR_B.image1 seated right,
        a beat of silence, slow pull out

Read Shot 3 again. Two faces, both pinned, in one prompt. That two-shot is exactly where dialogue scenes usually fall apart, because the model has to hold both identities at once. Here it is not holding a description of two people, it is holding the two actual faces you already locked. One prompt, both references, no re-upload.

This is the "one easy prompt" doing the heavy lifting. The framing and the camera move are yours to direct. The identities are pinned, so the model spends its effort on the shot instead of re-guessing who these people are.

Speaker mapping is just referencing the right voice

The classic two-hander failure is the wrong mouth moving. You fix it by referencing, per line, the speaker's own voice node. Shot 1 carries @CHAR_A.audio1, Shot 2 carries @CHAR_B.audio1. The line and the voice travel together, so the lip movement lands on the person actually talking and the other face stays in reaction.

Do not leave the model to infer the speaker from context. In a two-shot it will guess, and a line coming out of the listener's mouth is the most uncanny tell in the scene. The reference is the instruction: whoever's .audio1 you attached is who speaks. That is audio referencing earning its keep .. the voice is pinned to a person, not picked by vibe.

Reverse shots and continuity, by reference

Two more things naming buys you.

Eyelines. Shot-reverse-shot only reads if the geometry agrees. If CHAR_A looks camera-left, CHAR_B has to look camera-right, or they are both addressing the wall. Because you reference the same locked face in every single, the only variable left is direction. So you state it once per character .. "CHAR_A looks right-to-left, CHAR_B left-to-right" .. and it stays put, instead of fighting a face that also changed between cuts.

Continuity across the cut. When one shot has to flow out of the last, do not re-describe the room. Reference the clip. Pull the previous shot's video in and anchor to its final frame:

Shot 4  continue from the last frame of @Shot3.video1,
        @CHAR_A.image1 glances down, holds the beat

Notice Shot 4 carries both: @Shot3.video1 for the room and the light, and @CHAR_A.image1 to re-pin her face to the source. That second tag matters. You take continuity from the clip, but you take identity from the node, never from the outgoing frame, because frames accumulate drift and the node does not. Continuity from the video, the face from the source. That is image and video referencing working in the same line.

Why this is faster, more accurate, and higher quality

Those are three different wins, so worth separating.

Faster. You build the cast once and every shot becomes a one-liner. No re-uploading a face into each tool, no re-typing a voice description, no dragging reference wires across the canvas for every shot. A new scene with the same cast is mostly just a new shot list, because the references already exist.

More accurate. The model gets the exact pixels and the exact voice, not its interpretation of a sentence. "The same woman" stops being a hope and becomes a pin. Lips sync to the right person because you attached the right voice, not because the model guessed well. And if you reference the same face twice in one prompt .. once on its own, once inside the two-shot .. Vilva counts it once, so you never muddy a generation by double-feeding it.

Higher quality, as a result. When identity, voice, and eyeline are all pinned and consistent, the cut stops reading as clips and starts reading as a scene. The quality is not a prettier single render, it is the through-line surviving the cut. Holding that through-line is the entire job of a dialogue scene, and it is exactly what a re-typed reference loses.

Check the scene as a scene, not as clips

Watch it end to end and judge the conversation, not the individual shots. One pass, three checks: the faces are the same people throughout, every line comes from the right mouth, and the eyelines connect across the cuts. Each one maps to a reference .. identity to the image node, speaker to the audio node, continuity to the video node .. so when something is off, you know which reference to fix instead of which of fifty prompt words to reword.

Then check the rhythm. Real conversation has beats: a pause before a reply, a reaction held a moment. If every shot is the same length and every reply is instant, it reads like lines, not people. Vary the durations to give it air. Pacing is the one thing referencing will not do for you, that part is still your call as the director.

Where this runs

All of it .. the character nodes, the @ mentions, the image, audio, and video tags .. lives on one Vilva canvas, which is the whole point. The reason a dialogue scene drifts is that the pieces normally live in four separate tools and nothing carries the exact asset between them. When the cast, the voices, and every shot are nodes on the same board, a reference is just a name, and the scene holds because it never left the canvas. Free to try at vilva.ai.

Troubleshooting

The reverse shot is a different person. Identity leaked in from a previous frame, not the node. Reference @CHAR.image1 in every shot.
The wrong character's mouth moves. You did not attach the speaker's voice. Carry @CHAR.audio1 with the line.
They seem to look past each other. Eyelines are not opposed. Set explicit left/right directions per character. The face is already pinned, so direction is the only thing left to fix.
The cut jumps. No continuity handoff. Pull @PrevShot.video1 for the room and re-pin the face with @CHAR.image1.
It feels like read lines, not talk. Pacing is too even. Vary shot durations and add reaction beats.
Next step: once a two-hander works, the same nodes-plus-references method scales to three or more characters. The cast is reusable, so a new scene with the same people is mostly a new shot list.

The takeaway

A two-character conversation is not one prompt of prose, it is a small cast of named nodes and a shot list that points at them. Build each character once, reference the face, the voice, and the last frame by name, and the model stops improvising the things that have to stay the same.

The references are the direction. Hand the model the exact pieces, in one prompt, and it will shoot you a conversation instead of two clips that happen to share a room.