Did you see the graph benchmark? I found it quite interesting. It had to do a graph traversal on a natural text representation of a graph. Pretty much your problem.
Update: I took a corpus of personal chat data (this way it wouldn't be seen in training), and tried asking it some paraphrased questions. It performed quite poorly.