
Do not index
I have a daily YouTube series where I build in public and share what I am learning ( I am on day 120!)
Lately a lot of that work has been going into PailFlow, the internal agent I use at Lunch Pail Labs to help with project work. I wanted to see if she could help make her own build-in-public videos: write the script, generate the visuals, narrate the story, and show up on screen as a little character.
I had a few constraints to refine the approach:
- I wanted Pai to feel like the same character every time, not a new blue robot in every video.
- I wanted the video to come from a few reusable ingredients, ideally assembled in code.
- I wanted the workflow to be simple enough that PailFlow could eventually run most of it without me babysitting every step.
Basically, I wanted to give her a system and get back a tiny video factory.
The first part was the character animation.
Creating the character
I started with the PailFlow bot icon from Slack and asked ChatGPT Images to make it feel more like a character for a YouTube video:

Can we make a version of this but in a more startup office still for a YouTube so not busy background the environment should feel real but the animation itself is claymated
That got me to the right neighborhood: blue, soft, toy-like, seated at a desk, with a cozy office background.
Breaking the character into parts
My first instinct was to generate complete animation frames: same 1080x1920 canvas, same camera angle, same lighting, no object movement, and a loopable talking cycle. The images looked fine one by one. Then I put them in motion and immediately saw the problem. The head shifted, the face shape changed, the background wobbled, and every frame switch made Pai jump around in a way I could not unsee.
So I stopped asking for finished frames and asked for parts instead.
I wanted the background, body, head, eyes, and mouth shapes as separate images on the exact same 1080x1920 canvas. If every layer stacked at
x=0, y=0, Remotion would not have to do anything clever. The background could stay still. The body could stay still. Only the eyes and mouth had to change.Here is the cleaned-up prompt I used:
Create a reusable 2D puppet rig asset pack for Pai, a friendly PailFlow mascot based on the attached image.
Generate 10 separate images. Do not combine the assets into one image.
Every image must follow these rules:
- 1080x1920 vertical 9:16 canvas
- full-canvas PNG export
- transparent background for every image except background.png
- do not crop any layer to the visible artwork
- every layer must use the exact same 1080x1920 canvas
- every layer must align perfectly when stacked at x=0, y=0 in Figma or code
- keep the character scale, camera angle, lighting, and position identical across all character layers
- front-facing character
- no text or labels inside the images
- no extra duplicate parts on any layer
- do not change the art style between layers
Character style:
- friendly blue AI assistant mascot for PailFlow
- soft claymation illustrated style
- premium SaaS brand feel
- rounded shapes
- clean, modern, expressive, warm
- cute desktop assistant mascot energy
- simple enough to animate by swapping eye and mouth layers
Generate exactly these 10 images:
1. background.png: cozy office/desk background, no character, no text
2. body.png: shoulders, torso, arms, hands, desk interaction, and neck connector; no head, eyes, or mouth
3. head-shell.png: head shell only, no eyes or mouth
4. eyes-open.png: eyes open only
5. eyes-half-blink.png: half-blink eyes only
6. mouth-closed-smile.png: closed smile mouth only
7. mouth-flat-rest.png: neutral resting mouth only
8. mouth-small-open.png: small speaking mouth only
9. mouth-medium-open.png: medium speaking mouth only
10. mouth-wide-open.png: wide speaking mouth only
Quality check:
The images must stack perfectly in this order:
background.png
body.png
head-shell.png
eyes-open.png
mouth-closed-smile.png
When stacked, the result should look like one complete Pai character sitting naturally at the desk.
Alignment matters more than creativity.This is where the workflow started to make sense. Before, I had a folder of nice-looking images that fell apart the second they became a sequence. After, I had a little puppet. Remotion could swap the mouth and eyes without regenerating the whole character every time.

It still was not clean. ChatGPT Images did not reliably give me true transparent PNG layers, so I had to remove backgrounds myself.
Q/A in Figma
Next I moved into Figma. ChatGPT Images gave me the parts, but the sizing and placement were not perfect. I layered everything on top of everything else to see what was usable and what was going to look cursed once it moved.
The boring detail that mattered most: every layer needed to stay 1080x1920. Do not crop to the object. Do not let a background removal step resize the image. Once the canvas changes, every layer needs custom positioning, and now you have a tiny alignment nightmare.
Figma became the QA step. I toggled through the mouth shapes and eye states by hand. The test was very simple: does
mouth-wide-open feel like the same mouth opening, or does it feel like a new face just appeared on top of the old one?
Animating the puppet with Remotion
Finally, I brought the aligned layers into Remotion. The background, body, and head stay visible. The eyes switch between open and half-blink. The mouth switches between rest, small, medium, and wide while the narration plays. I tried Kokoro for the voice and timed the mouth loop against the audio.
This is not real lip sync. It is barely lip sync. But it gives the impression of talking, and that was enough for this version. I can add arm poses, a head bob, a thinking expression, or a pointing pose later. Starting there would have been a mistake.

Conclusion
The result is still a little uncanny. The mouth is not real lip sync. The character does not have natural body language yet. But, with this method, I can add more states: more mouth shapes, more eye positions, a hand pose, a nod, a thinking face, a pointing moment.
Once the character exists as a small system, I can make more videos from the same ingredients instead of starting over every time.
.png?table=block&id=9ba33ac6-8e12-48f6-b980-4333b612ec56&cache=v2)