Log to a memory ring buffer (if you need extreme precision, prefetch everything and write binary fixed size "log entries"), flush asynchronously at some point when you don't care about timing anymore. Really helpful in kernel debugging.
You don't need to format the log on-device. You can push a binary representation and format when you need to display it. Look a 'defmt' for an example of this approach. Logging overhead in the path that emits the log messages can be tens of instructions.