I don't understand what kind of use case they were optimizing for when they designed this system. Don't think they were optimizing only for embedded or similar applications where they don't use a runtime at all.
Using stackfull coroutines, having a trait in std for runtimes and passing that trait around into async functions would be much better in my opinion instead of having the compiler transform entire functions and having more and more and more complexity layered on top of it solve the complexities that this decision created.