More reference images from different angles is always going to give more accurate information in 3D. From a single 2D image there is a lot of ambiguity in the context. Several different shapes in 3D can be represented in identical ways in 2D. Additional context like lighting shadows etc helps. But more real signal from more images will always be better
1. There's many use cases where only a single photo is available
2. There are many models similar to Sharp that do accept multiple photos - but Sharp is trying to solve a specific problem. If you have multiple photos - don't use Sharp.