The user's conversation happens at level 0. Any actual tool use is only permitted at stack depths > 0. When the model calls the Return tool at stack depth 0 we end that logical turn of conversation and the argument to the tool is presented to the user. The user can then continue the conversation if desired with all prior top level conversation available in-scope.
It's effectively the exact same experience as ChatGPT, but each time the user types a message an entire depth-first search process kicks off that can take several minutes to complete each time.