very cool, love to see some C code. #c-langgang
Synthesizing Martian Speech
Over the course of this project, I've made three separate attempts to generate synthesized speech for the martians.
Classic Speech Synthesis
Before there were neural networks and machine learnings to do all our computer talking, speech was generally synthesized by modeling the vocal cords and resonant cavities of the mouth. There's decades of research into this and probably the most well-known classic example is SAM the Software Automatic Mouth from the early 80's.
Even though I love the way SAM sounds, and just look at him, it wasn't quite what I had in mind for the martians in this game. Something less familiar would be nice for a start. Also, speech synths in this form are finely honed masterpieces of software design and who's got time for that.
First Attempt: Amateur Hour
So let's tumble down a rabbit hole and roll our own speech synth system. Start by blasting some vowels into a microphone.
Beautiful. Now as a first stab at manipulating these sounds into something more interesting, try crossfading between them to get an AHYEEYAH sound.
Whoops it doesn't work. Here's what it should sound like.
The human voice doesn't crossfade. Vowels are blended by adjusting the jaw, lips, tongue, etc. That sounds hard to simulate, surely there's a shortcut.
Maybe the problem is that we're working with raw samples in the time domain. Let's switch to the frequency domain and try interpolating there. We want this to work on portable hardware so first break the audio clips down to a limited set of important frequencies.
Waveform, spectrum, and detected peaks for 'AH' clip
Same thing for the 'EE' clip
I put together some numpy/scipy python code to do this part. The result is a set of frequencies and amplitudes that can be used to reconstruct the sound with sine waves. Baby's first MP3. Maybe there's something interesting we can do once everything is driven by sine waves instead of individual samples.
Original 'AH' recording (top) and sinewave-reconstructed version (bottom)
Same thing for the 'EE' clip
Sounds basically right, if a little muffled. I can hear the martians already. Now try blending between the reconstructed clips by interpolating the sine frequencies. This should hopefully be an improvement on crossfading the samples. Waveform (top) & spectrogram (bottom):
Nope! That makes a good siren but it doesn't sound like a voice. The peak frequencies can be pretty far apart for each vowel and there's a "whipping" effect when interpolating between them. I suspect the spectrum is just too sparse and it would need a lot more sine waves to avoid sounding artificial. This rabbit hole sucks. With basically no DSP or speech theory experience this was a lost cause and after a fair bit of flailing I put the whole thing down and moved on to other non-speech stuff.
Second Attempt: Audio Clips
Almost a year later, I came back to the speech synthesis task. Having forgotten most of the first try my plan this time was to go even simpler. Why synthesize anything at all? Just string together a bunch of audio clips and call it a day.
I recorded myself making some vocalizations, mixed up their speed, and played them back. There should be no surprise at how this sounds.
Ok. Adding alien-sounding clicks and pops is trivial. With the right source clips and offline processing, throw in a simple runtime vocoder, I could probably get some decent results.
Well, maybe. I didn't give it much chance. This technique just wasn't grabbing me. One for being too simple and two for having so few limitations. There are infinite ways to prepare and process audio clips for sequential playback. I need constraints, the more the better.
Strike two. I put it away again for another year.
Third Attempt: Surrender
This was finally the point for a more solid think about what I really wanted from the speech synthesizer. No more experimenting with carefree wonder. What should the martians sound like?
It came down to three things:
- Be voice-like.
- Be easy to control/vary.
- Be funny.
If, like me, you've ever crossed paths with The Talking Moose on a classic Mac then you know all three requirements are handily met by a traditional speech synthesizer. The Moose is based on MacinTalk which as far as I can tell is implemented similarly to SAM.
The Talking Moose. All we ever wanted from computer speech.
I trashed all my previous code and researched the details of how these synths actually work. The seminal model here is the Klatt speech synthesizer. Dennis Klatt worked out a system of cascading and parallel filters applied to the fundamental waveform + noise, and all the complicated articulations necessary to sound like human voice. Back in 1980.
Brief summary. The human vocal system has cavities that amplify & resonate the vocal cords' vibrations at certain frequency bands. These bands are called formants and they change based on the shape of the mouth, tongue, lips, palette, etc. Each vowel sound has a different set of formants.
Formants are different from the peaks I was detecting in my first try in that they're independent of the fundamental pitch of the voice, and they have a bandwidth. To get the formant frequencies and bandwidths I ditched my custom python code and switched to using Praat, which is laser focused on this exact task.
Praat-detected formants in 'AH' clip
Note that these are not marking sharp spikes as much as broad hilltops in the spectrum. It's somewhat surprising (to me) that these vowel formant frequencies don't vary much from person to person. From Synthesizing static vowels and dynamic sounds:
Measured formants for random male subject, closely matching my voice
With this formant model, I wrote C code for synthesizing vowels using a sawtooth wave passed through a series of resonating filters. I'm familiar with using these kinds of filters in hardware and DAW synths. What do they look like in actual code? Basically just a weighted sum of the current sample and previous outputs. The weights are calculated from your desired filter frequency, resonance, and bandwidth. MusicDSP.org was a great resource when working on this.
After days of stumbling and tweaking, the ungarnished result sounds fairly standard.
Pretty much exacly like SAM and glad for it. Nothing like a desperate third attempt to slide the goalposts right up. The AH-EE-AH blend that didn't work before sounds fine now when interpolating filter frequencies:
The next step was to integrate a noise source to synthesize the 't', 'ch', 's', 'k', and other non-voiced sounds. Unlike with the vowels, these use a separate parallel filter bank. I made some good progress here before realizing that (A) this part is much trickier since it requires careful modulation to sound right and (B) the end result would be better sounding human speech, which I wasn't really after.
So I stopped there and decided make the vowels slightly more interesting, with enough variability to match the wide gamut of procedurally-generated faces.
Vowel Synth Features
The final vowel synth has parameters for overall speed, input waveform, fundamental pitch (including sub-oscillator, vibrato, and randomized LFO), and formant frequencies. A set of these parameters defines the basic sound of a voice.
Vowel synth block diagram
Along with the A, E, I, O, and U vowel formant sets, I also added M and R support. The Klatt model has special handling for these nasally sounds that integrates with the normal vowels' resonators.
One trick I found to get slightly less muffled output was to fix the two highest-frequency formants at 4kHz and 6kHz. This is essentially a HF boost and adds a noticeable crispness when running at the synth's relatively low 11kHz samplerate.
To get usable formant data, I process audio clips of me saying the vowels with Praat, then pass that to a python script that cleans up the results and writes them out as C struct data.
For controlling the synth, I designed a simple command string format:
Each aeioumr character activates a formant set and the rest are inline commands for modulation. ~ & _ enable or disable vibrato, / & | denote pauses, + & - add pitch changes in semitones. There's a default vowel token duration which can be sped up or slowed down with > & <. Pitch and speed commands can be stacked to go higher or lower.
When processing a command string, the synth first breaks it into words at / & |, then sets up envelopes to blend between pitches, volumes, speeds, and formants.
A few more examples of "m~ee_e/i+uu|u<<+a--u" with randomized voice properties:
An interesting discovery is that creating intelligible words is still possible with this limited vowel-only synth. Quick starts and stops are almost enough for faking a few consonants. The brain fills in the rest I guess.
"ao~o/+aa-r/<++i-u-u" (How are you?)
"++<ai/-emm|++<o/-ei" (I am okay)
Even in its simplified vowel-only form, the synth still has a fair bit of complex C code. I wrote everything initially using floating point math, as a sane person would. Unfortunately the performance requirements to run at 11kHz were too much for the hardware, mostly because of the floats. Once it was working okay I refactored the inner loops to use fixed point math.
Testing on the arm64 Playdate hardware I found that int64 arithmetic is the fastest, then int32, then float. To keep the memory cache unstressed I settled on int32 with S.15.16 fixed point format. Ideally, you'd want more bits after the decimal when dealing with mostly -1.0 -> 1.0 audio samples. In this case the range needed to also cover several multiples of the sample rate so with a unified format I couldn't slide the decimal point very far left.
Rewriting the floating point math to fixed point was mostly straightforward, with special care to not overflow the resonators. A pure engineering task like this is a nice break sometimes.
Speaking of performance, one might ask how SAM and MacinTalk could run perfectly fine on 40-year-old computers. Based on Tyomitch's reverse engineering it seems that, besides lots of clever optimisations, they summed pre-baked formant waveforms instead of running the math-heavy filter code. Sounds a bit like my first attempt up there.
Once I was happy with the audio output the next step was to hook it into the game's visuals. Knowing I'd want the martians to eventually talk, I've been drawing three frames of animation for each mouth since the beginning:
- Lips closed
- Open "Ah" sound
- Open "Oh" sound
You'd want more frames if you had the resources to draw them. I'm settling for the bare minimum here to save some work.
A few mouths and their speech animation frames
When generating a martian face, the mouth frames are tucked away in the image atlas and can be swapped in at runtime.
Martian image atlas with mouth frames in the bottom left there
In-game, the vowel synth keeps track of which formant sets are used in a word, and the game logic can query which one is currently playing. The full voiced list of aeioumr is reduced down to aom for the mouth frame selection. Add a little shake and bob's your uncle.
I spent too long working on all of this, really. Anyone paying for 'talking martians' would have a few questions. Luckily, it's just me using up my own energy here, and spending ages learning about things is 90% of why I make games.
STILL, wouldn't it be nice if the speech synth could be used for more than a few martian dialog bleeps and bloops. Can I take this minor side feature and expand it into something more important, papering over the embarassingly long dev time to make it all look intentional?
Well it's hard to turn a vowel-only speech synth into a core feature in an already-weird game like this but that'll hardly stop me from trying. And hey I got an idea while implementing the speech bubble animations.
In-game speech bubble
This was my plan for the talking martians. They'd blurt out some hilarious-sounding unintelligible speech and a helpful dialog bubble would show a single glyph translation.
How is a single glyph enough? I don't know, I hadn't figured that part out. It should be possible to reduce the conversation surface enough to use emotes or other solitary symbols. I just know that I didn't want to put straight text translations in the bubble and adding some ridiculous restriction here might be fruitful.
The taste of the fruit is debatable but what I've ended up doing is adding a whole other system to the game called the BLAB-O-DEX.
A handy guide
When martians say something translatable, their dialog bubble shows only a reference number.
Speech bubble showing BLAB-O-DEX reference number
This number is an index into the BLAB-O-DEX, accessable from the Playdate's in-game system menu. Players can refer to the translations here after they've heard the phrase at least once. Unheard entries are present in the list as a blank dash.
The thought is that players will hear some of these phrases enough times to recognize either the speech itself or the reference number.
Treating dialog as a trackable collectible like this has the potential to solve several design problems I've been facing. For one, how to make the background street scene more engaging.
Little conversation going on back there
I have a few reservations about the BLAB-O-DEX at the moment so we'll see how it actually plays out. The speech synth is fine though for sure.
Log in with itch.io to leave a comment.
Thanks for these updates, can't wait for final game! I just wanted to add another vote for the idea to use Galactic Alphabet for speech bubbles. It would look more.. dunno, appropriate I guess, because the numbers are looking more like placeholder. And blab-o-dex where you decipher those symbols will look more organic in this world, IMO.
When Lukas does something, it's always interesting, original and amazing. Can't wait! :)
I love these peeks behind the curtain. And I'm also grateful that you're (I hope!) not setting the price of this game based on the millions of hours you've spent making it. :)
This series has been the most fascinating dev diary of all time, imho! I really want to get a Playdate just for this one game!
This was a really fascinating read, thank you for detailing your process! I am a sound designer by trade - I edit voices every single day but I don't know where I would even start when it comes to the synthesizing the human voice. I'm no programmer but I think you ultimately ended up in a really cool spot. It's amazing how much range you get even without consonants. Well done!
This was an excellent and fascinating read!
I love how you outlined your naive approaches to speech synthesis and where/why they didn't work.
I think I love you. Many many thanks for sharing this. I always struggled with sound implementation and this adds so interesting conclusions! I admire your hanger for new knowledge and, in fact, yours helped mine here.
Keep rocking, all I wish for now is having MAM in my Playdate.
I'm honestly speechless (ha!) thank you for sharing the process.
Just when I thought the whole concept couldn’t become any cooler, in a vintage 8-bit kind of way, now the martians babble like adorable Speak and Spells . So appropriate. You’re killing me.
Do you reckon replacing the numbers with wacky martian symbols might be cool? I like the idea of translating things, but seeing martians talk saying things like "2!" "3." "1" is a bit weird, if not funny in its own way
I tried this first, but found it really hard to recognize/remember [random-glyph] as "hello". Better if it's readable, even if unrelated, to make looking it up in the blab-o-dex easier. I may add a letter to separate conversations, so "H2" instead of "31". Still figuring it out.
You could do a bit of both and make a glyph numbering system.
Something like the Mayan numeral system, but a Marsian numeral system.
I'd consider trying it with short nonsense phrases written in the Standard Galactic Alphabet, of Keen and Minecraft fame. It was designed as a genuine font with the intention of players being able to legitimately read it, so it might work better, and it's already got a storied history of being used on mars.
You may have played it already but Chants of Sennaar is a (very good) recent game that tackled the same kinds of problems
So cool!! 👽💬