I make online course content and used to lose close to a full day cutting filler out of every hour or so of recording. This gets me maybe 70% of that time back. On whether you should even cut them, I don’t think it’s clear cut. With non-native English speakers especially, the um is usually a real pause before they say something that matters, and cutting it makes them choppy or changes what they meant. Most of the time though it’s just padding. That matters more for courses than it sounds like it should, because a common complaint I get is how long courses are, so any dead air I can pull out is time I give back to people.
Anyway this is in my workflow now. Still messing with the settings to get it right, but I like to mess with my stack and this focuses on this step for me.
Disfluencies aren’t necessarily bad even if the word starts with “dis”!
I also don't care for writing that could have been made a lot more concise. It's a lot of work to make things shorter, but I think it's worthwhile.
Just randoms "um" inbetween because your struggling to build sentences can get annoying both in person and online
But hearing them from an interviewee drives me crazy, along with "sort of", "kind of", etc. I once counted all of the "sorta"s in an NPR interview, it was brutal.
The first one indicates something along the lines of "thinking, please stand by". The second one is a struggle.
To me they just indicate lack of confidence on the part of the speaker.
it's... exact opposite?
the main (attempted) use for ummms is to keep continuation of speech despite the pause. And the main complaint is exactly that it ruins the focus and doesn't give respite
Although that is probably the less common use.
I do not belong to the younger generation. I refused to watch videos because it takes too long comparing with reading. But now I'm watching them at 2x. You can watch a 40 min video in 20 minutes. I'd like to compress it further to 10 min or so, but 3x is a paid option on youtube and I'm not sure I could digest English (which is a foreign language to me) at 3x.
> Meanwhile, book reading is at an all time low seemingly because no one has a preference or patience for careful study and reflection.
Oh, I read books too. But the content is different. You can't read some books at 2x. You can't listen to it on such a speed. In any book I think there are stretches of text you can consume at any speed, but sometimes you hit a dense packed information you need to think through. It happens with videos too. Like, try to watch Veritasium at 2x, you'll be forced to slow things down at least sometimes, because to get the message you need to learn how to think at 2x speed too, not just to listen.
In any case the most of videos dilute their message over tens of minutes and you can speed up things and have plenty of time to think things through while watching.
The problem is that people are producing longer videos because that earns them more advertising revenue. Many creators now speak so mind-numbingly slowly, that even at 2x speed it feels like it's about a normal presentation speed.
In almost all cases, even at 2x speed, it would be quicker to just read a transcript (if that was available). The problem is really that people are incentivised to make everything into at least a 10 minute youtube video, when a short blog post that could have taken only a minute to read would have been sufficient to convey all the same information, and probably more useful as you could easily refer back to specific sections if you wanted.
The democratization of media created a lot of folks who've no idea how to disseminate information in a structured format and at an optimal rate.
arguably clickbait is the reason: i'm not here to listen to the video or all of the other fluff, i'm here to get the point as quickly as possible. it's a 'meeting could have been an email' sort of thing where lots of videos could really just be several bulletpoints.
AI youtubue summarizers are great in that regard.
For audiobooks I usually want to have time to hear and process every word, so I still speed it up but usually more like 1.5x, it depends on the narrator and the book. For podcasts I'm not there to appreciate the prose, so I go as fast as I can while still understanding them. I don't think it's about dopamine, I just find I don't gain anything by getting the same amount of information slower.
If you speak with disfluencies, you probably didn't sufficiently rehearse your speech. If you didn't rehearse enough, you probably didn't put much effort into writing it either, so why should I put much effort into listening? It's the same principle as AI slop.
Many people can speak off the cuff fluently and confidently, avoiding "like", "um", and other filler words. And even if you're not speaking fluently, leaving silences as punctuation is more effective, IMO.
Many impressive speakers I've met actually cite Toastmasters! So their obsessive zeal actually does work.
More rehearsal does work too sometimes, but it does sometimes lead to speeches "sounding too rehearsed".
I don't think that's true, we usually just don't notice filler words in the same way we are surprised that people usually don't even talk in whole sentences, in contrast to written text or movies (which also use written text).
While it's a commercial product with a subscription, I spent a long time on the free tier not even hitting their limits until I started using it so extensively that I wanted to pay for it.
And I've used Whisper in the past, mostly for tinkering. I tried it for a couple of use cases but haven't touched the base project in a while. But I do regularly use Faster-Whisper-XXL, an open source project based on Whisper, for subtitle generation.
Though, for subtitle generation, I decided to support the project and mainly use the non-public build of Faster-Whisper-XXL Pro built for donators to the open source project.
The extra features smooth out the subtitle editing process very substantially. Toss in "--roformer_overlap 0.125 --roformer_vram 16 --best_of 15 --ff_vocal_extract mb-roformer --vad_method pyannote_v3" to the cli parameters (and sometimes --realign) and you have much less work to do in SubtitleEdit or Tero Subtitler afterwards to clean it up.
for something dealing with audio you do need to play the audio really
If you're not paying ttention, ctting out specific sounds can easily cause more trouble. I for one would be quite pset if I couldn't hear the pire's reasoning for calling a foul.
> It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone. Those sound like fillers but they’re doing real work in the sentence, and cutting them automatically would change what someone said. The rule erm follows: only remove things that are sound, not language.
> It also doesn’t touch repeated words, false starts, or long thinking pauses. Those aren’t noise on top of the speech; they are the speech, just messier than the speaker would like. Cleaning them up is an editorial decision about which take to keep, and erm doesn’t have an opinion about that.
Think about it. Cleaning these things-that-can-be-just-sounds-but-can-also-very-much-be-load-bearing up is an editorial decision. At the very least, you need to judge based on the surrounding content whether the removal of an um would change the meaning at all; and I don’t think text alone is adequate for that.
Something's already gone wrong here. Uh and er refer to the same sound. Uh is the American spelling. Er is British; to them a following "r" like that is just a kind of vowel.
(Also, in case it wasn’t clear: I was quoting from the start of the article in that sentence.)
But not in any other sense.
> in case it wasn’t clear: I was quoting from the start of the article in that sentence.
You don't seem to be quoting from the article at all, actually. You've combined two different sentences in a way that grossly misrepresents what the article says. But that's not really relevant to the point here.
- Ums Considered Harmful: https://hamanlp.org/research/ums/
- Related paper: https://hamanlp.org/SIGBOVIK_2026.pdf
I might add the custom filler word functionality and/or perhaps just make the filler word list configurable.
Ideally it would slice the video in the timeline without actually removing anything, so you can scrub through your video and try with and without each disfluency (thank you - awesome word) & decide case by case which to keep!
The best approach I could come up with was to maintain a sliding histogram of loudness and exclude the low-level outliers.
You can do more in the noise/frequency domain but those were outside the scope of this tool.
When I want to redo a section, I say it again. But, I have a magic word — "mistake" — that I insert before. Previously I transcribed and just removed the sentence (or section) before mistake.
I recently automated this and used AI to determine what to cut and to drive davinci resolve to make the edit. Saves a lot of time in my workflow.
Also the type of filler word for some reason is often different between UK and US: British people tend to be "umm"-ers and Americans are more likely to add "you know" (although "umm" is also common).
Once you notice it it's impossible to ignore and many, many native English speakers are actually terrible at speaking and add filler words to the point where it's very distracting
No, you run an entire second pass LLM over the output of Whisper. "no uhhh three no four." should just output four the numeral not even f.o.u.r.
Hi, my name is fragmede. Judging by the date on my computer it's been four months since it's since I've t touched the transcription directory on computer and tried to improve on the state of wisprflow. Mines pretty good but it just doesn't... ah you can't drag me back in.
A trivial example is "umm... well... (sigh) okay" versus just "okay". Not okay!
Oh, Claudish striking again.