Understanding Transformer reasoning capabilities via graph algorithms


Seeing as transformers and MPNNs are not the only ML approaches for the structural analysis of graphs, we also compared the analytical capabilities of a wide variety of other GNN- and transformer-based architectures. For GNNs, we compared both transformers and MPNNs to models like graph convolutional networks (GCNs) and graph isomorphism networks (GINs).

Additionally, we compared our transformers with much larger language models. Language models are transformers as well, but with many orders of magnitude more parameters. We compared transformers to the language modeling approach described in Talk Like a Graph, which encodes the graph as text, using natural language to describe relationships instead of treating an input graph as a collection of abstract tokens.

We asked a trained language model to solve various retrieval tasks with a variety of prompting approaches:

  • Zero-shot, which provides only a single prompt and asks for the solution without further hints.
  • Few-shot, which provides several examples of solved prompt–response pairs before asking the model to solve a task.
  • Chain-of-thought (CoT), which provides a collection of examples (similar to few-shot), each of which contains a prompt, a response, and an explanation before asking the model to solve a task.
  • Zero-shot CoT, which asks the model to show its work, without including additional worked-out examples as context.
  • CoT-bag, which asks the LLM to construct a graph before being provided with relevant information.

For the theoretical part of the experiment, we created a task difficulty hierarchy to assess which tasks transformers can solve with small models.

We only considered graph reasoning tasks that apply to undirected and unweighted graphs of bounded size: node count, edge count, edge existence, node degree, connectivity, node connectivity (for undirected graphs), cycle check, and shortest path.

In this hierarchy, we categorized graph task difficulty based on depth (the number of self-attention layers in the transformer, computed sequentially), width (the dimension of the vectors used for each graph token), number of blank tokens, and three different types:

  • Retrieval tasks: easy, local aggregation tasks.
  • Parallelizable tasks: tasks that benefit greatly from parallel operations.
  • Search: tasks with limited benefits from parallel operations.

Leave a Reply

Your email address will not be published. Required fields are marked *