I've seen this happen when the backend image searches a picture, gets a description of what is in the picture, and adds that description to the bag of things it will produce as a summary. The whole 'put some text in the image frame that misleads the AI' lead to some hilarious results (man holding a puppy which has a postit stuck to it saying "Siamese kitten" for example, results saying "this man is holding a Siamese kitten."
That led to some changes but it would be interesting to see if you could still poison results that way.