Removing 'um' from a recording is harder than it sounds

upvote

Removing 'um' from a recording is harder than it sounds

(doug.sh)

144 points

by dougcalobrisi17 hours ago |

upvote

by alyssamazz3 hours ago|

[-]

Doug is a friend, but I actually use this so figured I’d chime in.

I make online course content and used to lose close to a full day cutting filler out of every hour or so of recording. This gets me maybe 70% of that time back. On whether you should even cut them, I don’t think it’s clear cut. With non-native English speakers especially, the um is usually a real pause before they say something that matters, and cutting it makes them choppy or changes what they meant. Most of the time though it’s just padding. That matters more for courses than it sounds like it should, because a common complaint I get is how long courses are, so any dead air I can pull out is time I give back to people.

Anyway this is in my workflow now. Still messing with the settings to get it right, but I like to mess with my stack and this focuses on this step for me.

reply

upvote

by wzdd14 hours ago|

[-]

It’s a nice engineering approach, but I’m interested in the motivation. Um and ah is distracting in a transcript, where you can naturally pause to take in information; in speech however it can serve as a focusing point to indicate the next part is important. See https://medium.com/better-humans/dont-worry-about-saying-um-... for example. The weirdly obsessive zeal that orgs like Toastmasters have about eliminating them is weird.

Disfluencies aren’t necessarily bad even if the word starts with “dis”!

reply

upvote

by toast012 hours ago|

[-]

Having heard radio interviews with and without 'internal editing' to remove ums and ahs, most of the time I'd rather the edited version. It's more concise and focused, and I find it easier to comprehend. Too many ums and ahs and my mind wanders, and if it's radio, I can't go easily go back to try again. When I've listened to podcasts or audiobooks, I could never easily go back a little to try again either, and I gave up on them (even though I have some content I really want to listen to, it's too frustrating, so it's not happening). But I'm sure other people have different preferences.

I also don't care for writing that could have been made a lot more concise. It's a lot of work to make things shorter, but I think it's worthwhile.

reply

upvote

by venzaspa9 hours ago|

[-]

It just goes to show that people have very different views. I think when I hear people thinking out loud (ums and ahs) it's a marker that they are actually engaging with the question, thinking through an answer and not bullshitting without thinking.

reply

upvote

by td67 hours ago|

[-]

I agree to you, when it's in person. I think what your describing is mostly the beginning of an answer.

Just randoms "um" inbetween because your struggling to build sentences can get annoying both in person and online

reply

upvote

by inopinatus4 hours ago|

[-]

Just sit there in silence whilst you cogitate.

reply

upvote

by gegtik4 hours ago|

[-]

this is the move

reply

upvote

by macintux3 hours ago|

[-]

Space fillers are sadly important for group settings where you need to finish a thought before someone interjects.

But hearing them from an interviewee drives me crazy, along with "sort of", "kind of", etc. I once counted all of the "sorta"s in an NPR interview, it was brutal.

reply

upvote

by doubled1125 hours ago|

[-]

"Ummm, I think I agree with this description" vs "I, think, umm, I agree with, umm, this description"

The first one indicates something along the lines of "thinking, please stand by". The second one is a struggle.

reply

upvote

by bluebarbet8 hours ago|

[-]

The most popular academic theory (IIRC) is that "um" and "uh" are conversational placeholders that say, "don't talk, I'm not finished speaking yet". Which obviously serves no purpose in a monologue.

To me they just indicate lack of confidence on the part of the speaker.

reply

upvote

by skrebbel8 hours ago|

[-]

There's a correlation between speaking with confidence and bullshitting / corner cutting. Hard, nuanced questions require more thinking time to produce a nuanced answer. But a bullshitter will just confidently answer subtly wrong stuff. But they won't say "uh"! Is that really better?

reply

upvote

by bluebarbet6 hours ago|

[-]

Sure, that figures. Much of this is surely subjective.

reply

upvote

by NooneAtAll310 hours ago|

[-]

> in speech however it can serve as a focusing point to indicate the next part is important

it's... exact opposite?

the main (attempted) use for ummms is to keep continuation of speech despite the pause. And the main complaint is exactly that it ruins the focus and doesn't give respite

reply

upvote

by RobotToaster7 hours ago|

[-]

It can be a focusing point when someone wants to highlight the deliberate use of euphemism, removing those would be, um, unwise.

Although that is probably the less common use.

reply

upvote

by latexr7 hours ago|

[-]

I think you’re both right. But you’re right regarding writing and your parent comment is right regarding speech.

reply

upvote

by bongoman422 hours ago|

[-]

A part of saying something like um is to continue your speech and prevent the other person or someone else in the group from interjecting.

reply

upvote

by goalieca6 hours ago|

[-]

The younger generation seems to love listening at 1.2x or faster. I think it’s a preference for a fast information dopamine hit. I may argue it’s even a shallow approach that prefers against pausing and time for careful reflection. Meanwhile, book reading is at an all time low seemingly because no one has a preference or patience for careful study and reflection.

reply

upvote

by ordu10 minutes ago|

[-]

> The younger generation seems to love listening at 1.2x or faster.

I do not belong to the younger generation. I refused to watch videos because it takes too long comparing with reading. But now I'm watching them at 2x. You can watch a 40 min video in 20 minutes. I'd like to compress it further to 10 min or so, but 3x is a paid option on youtube and I'm not sure I could digest English (which is a foreign language to me) at 3x.

> Meanwhile, book reading is at an all time low seemingly because no one has a preference or patience for careful study and reflection.

Oh, I read books too. But the content is different. You can't read some books at 2x. You can't listen to it on such a speed. In any book I think there are stretches of text you can consume at any speed, but sometimes you hit a dense packed information you need to think through. It happens with videos too. Like, try to watch Veritasium at 2x, you'll be forced to slow things down at least sometimes, because to get the message you need to learn how to think at 2x speed too, not just to listen.

In any case the most of videos dilute their message over tens of minutes and you can speed up things and have plenty of time to think things through while watching.

reply

upvote

by ralferoo5 hours ago|

[-]

I'm not in the younger generation, but I listen to most of youtube (apart from songs and comedy) at 2x speed, and wish it could be even faster most of the time (that's a feature of premium, but I'm not paying for that).

The problem is that people are producing longer videos because that earns them more advertising revenue. Many creators now speak so mind-numbingly slowly, that even at 2x speed it feels like it's about a normal presentation speed.

In almost all cases, even at 2x speed, it would be quicker to just read a transcript (if that was available). The problem is really that people are incentivised to make everything into at least a 10 minute youtube video, when a short blog post that could have taken only a minute to read would have been sufficient to convey all the same information, and probably more useful as you could easily refer back to specific sections if you wanted.

reply

upvote

by yummybrainz42 minutes ago|

[-]

FYI NewPipe allows up to 4x playback; PipePipe up to 10x! And both block ads, while PipePipe also integrates Sponsorblock.

reply

upvote

by landl0rd5 hours ago|

[-]

Podcasts and other media to which people often listen at faster speeds aren't produced with the professional fluency of a news broadcast from the fifties. The bitrate of information is relatively low. Of course many speed them up.

The democratization of media created a lot of folks who've no idea how to disseminate information in a structured format and at an optimal rate.

reply

upvote

by red-iron-pine3 hours ago|

[-]

i'm not a gen z but I routinely do that. a habit picked up from grad school work and having to assimilate several frameworks and techniques quickly.

arguably clickbait is the reason: i'm not here to listen to the video or all of the other fluff, i'm here to get the point as quickly as possible. it's a 'meeting could have been an email' sort of thing where lots of videos could really just be several bulletpoints.

AI youtubue summarizers are great in that regard.

reply

upvote

by burkaman5 hours ago|

[-]

I listen to podcasts and videos at 2x speed or faster, I can still understand everything and it brings listening time about equal to what my reading time would be if I were reading an article or transcript. Average reading speed is generally about twice as fast as average speaking speed, and in produced media people tend to speak even slower. I realize it sounds insane to hear 2x speed audio if you aren't used to it, but I promise if you were to ramp up the speed over a couple weeks or so, you would have absolutely no trouble with it. There's no need to if you don't want to, I'm just saying that your first impression is not giving you an accurate experience of what it's actually like.

For audiobooks I usually want to have time to hear and process every word, so I still speed it up but usually more like 1.5x, it depends on the narrator and the book. For podcasts I'm not there to appreciate the prose, so I go as fast as I can while still understanding them. I don't think it's about dopamine, I just find I don't gain anything by getting the same amount of information slower.

reply

upvote

by dyauspitr4 hours ago|

[-]

That reminds me of the blind Microsoft developer that uses a screen reader at a very high speed to code

https://youtu.be/wKISPePFrIs?is=K3nKVrpH-vOSem54

reply

upvote

by tech_hutch3 hours ago|

[-]

In my limited experience, it seems a high reading speed is common among users of screen readers.

reply

upvote

by siriaan13 hours ago|

[-]

Occasional ums and ahs are fine but when every other phrase starts with a long aaaaah it can be pretty unpleasant to listen to.

reply

upvote

by sans_souse13 hours ago|

[-]

So, if this project's source Audio were Beavis and Butthead, you would be enthused?

reply

upvote

by amelius10 hours ago|

[-]

As with all things ... Don't be opinionated and make it an option for the user.

reply

upvote

by mrob11 hours ago|

[-]

>The weirdly obsessive zeal that orgs like Toastmasters have about eliminating them is weird.

If you speak with disfluencies, you probably didn't sufficiently rehearse your speech. If you didn't rehearse enough, you probably didn't put much effort into writing it either, so why should I put much effort into listening? It's the same principle as AI slop.

reply

upvote

by kaashif9 hours ago|

[-]

Not necessarily true, more rehearsal isn't the key to fluent oratory.

Many people can speak off the cuff fluently and confidently, avoiding "like", "um", and other filler words. And even if you're not speaking fluently, leaving silences as punctuation is more effective, IMO.

Many impressive speakers I've met actually cite Toastmasters! So their obsessive zeal actually does work.

More rehearsal does work too sometimes, but it does sometimes lead to speeches "sounding too rehearsed".

reply

upvote

by cubefox8 hours ago|

[-]

> Many people can speak off the cuff fluently and confidently, avoiding "like", "um", and other filler words.

I don't think that's true, we usually just don't notice filler words in the same way we are surprised that people usually don't even talk in whole sentences, in contrast to written text or movies (which also use written text).

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by heroprotagonist15 hours ago|

[-]

Not to promote something, but Wispr Flow does that for me automatically if I trigger a setting for it..

While it's a commercial product with a subscription, I spent a long time on the free tier not even hitting their limits until I started using it so extensively that I wanted to pay for it.

And I've used Whisper in the past, mostly for tinkering. I tried it for a couple of use cases but haven't touched the base project in a while. But I do regularly use Faster-Whisper-XXL, an open source project based on Whisper, for subtitle generation.

Though, for subtitle generation, I decided to support the project and mainly use the non-public build of Faster-Whisper-XXL Pro built for donators to the open source project.

The extra features smooth out the subtitle editing process very substantially. Toss in "--roformer_overlap 0.125 --roformer_vram 16 --best_of 15 --ff_vocal_extract mb-roformer --vad_method pyannote_v3" to the cli parameters (and sometimes --realign) and you have much less work to do in SubtitleEdit or Tero Subtitler afterwards to clean it up.

reply

upvote

by iib9 hours ago|

[-]

Surprisingly, it's the whisper model itself that does that. I find that it's also good with false starts, often correcting something like: "uhm, we could...we can go there" to just "we can go there", if spoken rapidly enough.

reply

upvote

by dotancohen11 hours ago|

[-]

Is love to hear more about subtitle generation. Specifically, can you label different speakers? I'd be using this for meeting transcription. Thank you.

reply

upvote

by 13176 hours ago|

[-]

Looks interesting, would be a nicer article though if there was a demo with before/after to show the results, and why the previous ideas didn't work

for something dealing with audio you do need to play the audio really

reply

upvote

by supernes13 hours ago|

[-]

This approach seems kind of backwards to me. Why try to detect everything except the thing you're trying to remove instead of either sampling a few uhs and ums and treating them as noise to be silenced (with a sharp crossfade to the noise floor that doesn't interrupt speech flow) or finetuning a model to detect them specifically for full automation?

reply

upvote

by pdpi6 hours ago|

[-]

> instead of either sampling a few uhs and ums and treating them as noise to be silenced

If you're not paying ttention, ctting out specific sounds can easily cause more trouble. I for one would be quite pset if I couldn't hear the pire's reasoning for calling a foul.

reply

upvote

by ghaff6 hours ago|

[-]

When I was doing podcasts regularly, it made me acutely aware of various people's speech mannerisms. (Somewhat similarly, recording a lot of videos during COVID made me very aware of a variety of my own mannerisms--especially overactive hand motions.)

reply

upvote

by chrismorgan9 hours ago|

[-]

I think the “What it won’t touch” section shows why the entire concept is unsound. Here it is with a different first sentence, and (other than the third sentence no longer matching erm’s reality) it’s perfectly coherent:

> It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone. Those sound like fillers but they’re doing real work in the sentence, and cutting them automatically would change what someone said. The rule erm follows: only remove things that are sound, not language.

> It also doesn’t touch repeated words, false starts, or long thinking pauses. Those aren’t noise on top of the speech; they are the speech, just messier than the speaker would like. Cleaning them up is an editorial decision about which take to keep, and erm doesn’t have an opinion about that.

Think about it. Cleaning these things-that-can-be-just-sounds-but-can-also-very-much-be-load-bearing up is an editorial decision. At the very least, you need to judge based on the surrounding content whether the removal of an um would change the meaning at all; and I don’t think text alone is adequate for that.

reply

upvote

by thaumasiotes9 hours ago|

[-]

>> It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone.

Something's already gone wrong here. Uh and er refer to the same sound. Uh is the American spelling. Er is British; to them a following "r" like that is just a kind of vowel.

reply

upvote

by Silamoth31 minutes ago|

[-]

Regardless of American vs. British spellings, those are not the same sound. Some British people may pronounce them the same. Americans definitely pronounce them differently, though. For instance, the word “water” has a hard “r” sound at the end; Americans don’t pronounce it “watuh” like some British people do.

reply

upvote

by chrismorgan9 hours ago|

[-]

Um… no. Quite different vowel sounds.

(Also, in case it wasn’t clear: I was quoting from the start of the article in that sentence.)

reply

upvote

by thaumasiotes8 hours ago|

[-]

They're quite different vowel sounds in the same sense that "back" and "back" use "quite different vowel sounds" when pronounced by American vs British speakers.

But not in any other sense.

> in case it wasn’t clear: I was quoting from the start of the article in that sentence.

You don't seem to be quoting from the article at all, actually. You've combined two different sentences in a way that grossly misrepresents what the article says. But that's not really relevant to the point here.

reply

upvote

by rbbydotdev9 hours ago|

[-]

I wonder if with enough input data and transcription you could “fingerprint” where a speaker personality has habits of interjecting “ums” leading to more hardy analysis. Novel approach, but gets me thinking

reply

upvote

by rindalir16 hours ago|

[-]

This is fascinating! I'm going to try this on a certain clip from Jurassic Park.

reply

upvote

by boodleboodle7 hours ago|

[-]

This resonates with our crusade to eradicate Ums once and for all.

- Ums Considered Harmful: https://hamanlp.org/research/ums/

- Related paper: https://hamanlp.org/SIGBOVIK_2026.pdf

reply

upvote

by ralferoo5 hours ago|

[-]

The title of the article is wrong. It's not that removing 'um' from a recording is hard, it's that not removing everything else in the recording while doing so is.

reply

upvote

by dougcalobrisi1 hours ago|

[-]

You’re right. I may borrow that if I do a follow up at some point :-)

reply

upvote

by alok-g14 hours ago|

[-]

I would love to see support for videos and removal of custom filler words (I say 'basically' and 'like' a lot and have so far failed to improve myself on this).

reply

upvote

by dougcalobrisi1 hours ago|

[-]

It does take videos (like mp4) as input but will only output the stripped audio track.

I might add the custom filler word functionality and/or perhaps just make the filler word list configurable.

reply

upvote

by 10 hours ago|

[-]

deleted

reply

upvote

by lavaman13112 hours ago|

[-]

This is great, I've tried out automated podcast editing tools before and they cut too aggressively in my experience. What are you thinking about doing next with this now that you've gotten the alignment snapping working cleanly for 'um' and 'ah', are you thinking of expanding the tool?

reply

upvote

by BugsJustFindMe2 hours ago|

[-]

I find the crusade against 'um' to be annoyingly misplaced. It frustrates the shit out of me that iOS speech-to-text dictation refuses to write my 'um's and 'uh's with no way to change that behavior. If a person asks to remove them, fine, but don't fucking alter my speech patterns when I'm sending messages to people.

reply

upvote

by cadamsdotcom15 hours ago|

[-]

What an awesome tool and idea. I’d be keen to see if it can integrate with video editing tools.

Ideally it would slice the video in the timeline without actually removing anything, so you can scrub through your video and try with and without each disfluency (thank you - awesome word) & decide case by case which to keep!

reply

upvote

by AaronAPU4 hours ago|

[-]

I accidentally learned how disgusting people’s mouth noises are while developing an audio leveler. The lip smacking and snot noises between sentences are the stuff of nightmares if you don’t do anything to exclude them from amplification.

The best approach I could come up with was to maintain a sliding histogram of loudness and exclude the low-level outliers.

You can do more in the noise/frequency domain but those were outside the scope of this tool.

reply

upvote

by stavros4 hours ago|

[-]

Misphonia sufferers unite!

reply

upvote

by __mharrison__5 hours ago|

[-]

Interesting. I make a bunch of video content and I went another way.

When I want to redo a section, I say it again. But, I have a magic word — "mistake" — that I insert before. Previously I transcribed and just removed the sentence (or section) before mistake.

I recently automated this and used AI to determine what to cut and to drive davinci resolve to make the edit. Saves a lot of time in my workflow.

reply

upvote

by HeavyStorm9 hours ago|

[-]

What a very cool utility.

reply

upvote

by josefritzishere4 hours ago|

[-]

I used to do this with a razor and an aluminum cutting block.

reply

upvote

by npodbielski13 hours ago|

[-]

I think it is harder to remove those from your own speech. I have been doing that for few months now and I still get back at it when I am in hurry or stressed.

reply

upvote

by ifwinterco6 hours ago|

[-]

In my experience native English speakers are particularly bad, generally when speaking a second language people are less likely to add random filler words.

Also the type of filler word for some reason is often different between UK and US: British people tend to be "umm"-ers and Americans are more likely to add "you know" (although "umm" is also common).

Once you notice it it's impossible to ignore and many, many native English speakers are actually terrible at speaking and add filler words to the point where it's very distracting

reply

upvote

by sciencesama15 hours ago|

[-]

there is a aah counter in toast master !! this is the software that helps !!

reply

upvote

by cryptoz15 hours ago|

[-]

Really cool stuff and definitely going to try it; I’m also finding it wild that Google put effort into adding ums and erms into their text to speech model a while back. AI puts it in, AI helps take it out.

reply

upvote

by cyberax9 hours ago|

[-]

BTW, any recommendations for AI tools that remove the laugh track? I don't even mind the awkward acting without the missing laughter.

reply

upvote

by fragmede5 hours ago|

[-]

...

No, you run an entire second pass LLM over the output of Whisper. "no uhhh three no four." should just output four the numeral not even f.o.u.r.

Hi, my name is fragmede. Judging by the date on my computer it's been four months since it's since I've t touched the transcription directory on computer and tried to improve on the state of wisprflow. Mines pretty good but it just doesn't... ah you can't drag me back in.

reply

upvote

by sublinear15 hours ago|

[-]

Disfluencies are not necessarily "filler". They can convey mood or hesitation. Cutting them can change the meaning.

A trivial example is "umm... well... (sigh) okay" versus just "okay". Not okay!

reply

upvote

by slhck6 hours ago|

[-]

> Two small fixes, in order. First, each cut endpoint is allowed to slide a tiny bit (up to 60ms) to land in the quietest spot nearby. If there’s a momentary lull in the audio just before or after the original cut point, slide there. The slide is bounded so it can’t cross into a neighboring word, otherwise you’d chew off real speech. Second, from that quiet spot, the endpoint snaps to the nearest moment when the waveform is exactly crossing zero.

Oh, Claudish striking again.

reply

upvote

by Retr0id5 hours ago|

[-]

I call it claudeslop but I suppose claudish is slightly less inflammatory.

reply

upvote

by dougcalobrisi17 hours ago|

[-]

This post is mostly about how surprisingly hard it is to cut filler words out of speech cleanly. Apparently, stripping ums isn't a find and replace type thing, because Whisper's timestamps are off by up to a few hundred ms and cutting on them chops syllables or leaves stutters. So, I built a tool, erm, that starts from Whisper's guess, finds where each word actually starts and stops in the audio, and snaps the cuts to silence so there's no click, with ffmpeg doing the splicing.

https://github.com/dougcalobrisi/erm

reply

upvote

by johnwheeler8 hours ago|

[-]

[flagged]

reply

upvote

by bagvader16 hours ago|

[-]

[flagged]

reply

upvote

by monster_truck9 hours ago|

[-]

It takes about 30 seconds in Audacity and will give an infinitely better result. Also works on any other sound

reply

upvote

by alyssamazz1 hours ago|

[-]

I’ve don’t this in audacity many times, it doesn’t work as well. All the umm patterns don’t match exactly. I’ve had better overall results with erm. I haven’t used audacity in years for this, maybe they improved the feature.

reply

upvote

by HeavyStorm8 hours ago|

[-]

Doesn't sound true. Unless audacity already has a tool for this exactly... How would you do it on 30 seconds or less?

reply

upvote

by ghaff6 hours ago|

[-]

It doesn't and ums aren't the only consistent tic you often want to clean up--"you know," long pauses, etc.

reply