Beyond Transformers with PyNeuraLogic
TOWARDS DEEP RELATIONAL LEARNING
Demonstrating the power of neuro-symbolic programming

In the last few years, we have seen a rise of Transformer¹ based models with successful applications in many fields, such as Natural Language Processing or Computer Vision. In this article, we will explore a concise, explainable, and extendable way to express Deep Learning models, specifically transformers, as a hybrid architecture, i.e., via marrying deep learning with symbolic artificial intelligence. To do so, we will implement models in a Python neuro-symbolic framework called PyNeuraLogic (the author is a co-author of the framework).
"We cannot construct rich cognitive models in an adequate, automated way without the triumvirate of hybrid architecture, rich prior knowledge, and sophisticated techniques for reasoning."
- Gary Marcus²
Combining symbolic representation with deep learning fills the gaps in the current deep learning models, such as out-of-the-box explainability or missing techniques for reasoning. Maybe, raising the number of parameters is not the soundest approach to achieving these desired results, just like increasing the number of camera megapixels does not necessarily yield better photos.

The PyNeuraLogic framework is based on logic programming with a twist – logic programs hold differentiable parameters. The framework is well-suited for smaller structured data, such as molecules, and complex models, such as Transformers and Graph Neural Networks. On the other hand, PyNeuraLogic is not the best choice for non-relational and large tensor data.
The key component of the framework is a differentiable logic program that we refer to as a template. A template consists of logic rules that define the structure of neural networks in an abstract way – we can think of a template as a blueprint of the model's architecture. The template is then applied to each input data instance to produce (via grounding and neuralization) a neural network unique to the input sample. This process is entirely different from other frameworks with predefined architectures that cannot adjust themselves to different input samples. For a bit closer introduction to the framework, you can see, e.g., a previous article on PyNeuralogic from the perspective of Graph Neural Networks.
Symbolic Transformers

We generally tend to implement deep learning models as tensor operations over input tokens batched into one large tensor. This makes sense because deep learning frameworks and hardware (e.g., GPUs) are typically optimized for processing larger tensors instead of multiple ones of diverse shapes and sizes. Transformers are no exception, and it is common to batch individual token vector representations into one large matrix and represent the model as operations over such matrices. Nevertheless, such implementations hide how individual input tokens relate to each other, as can be demonstrated in Transformer's attention mechanism.
The Attention Mechanism
The attention mechanism forms the very core of all the Transformer models. Specifically, its classic version makes use of a so-called multi-head scaled dot-product attention. Let us decompose the scaled dot-product attention with one head (for clarity) into a simple logic program.

The purpose of the attention is to decide what parts of the input the network should focus on. The attention does that by computing a weighted sum of the values V, where the weights represent the compatibility of the input keys K and queries Q. In this specific version, the weights are computed by the softmax function of the dot product of queries Q and keys K, divided by the square root of the input feature vector dimensionality _dk.
(R.weights(V.I, V.J) <= (R.d_k, R.k(V.J).T, R.q(V.I))) | [F.product, F.softmax_agg(agg_terms=[V.J])],
(R.attention(V.I) <= (R.weights(V.I, V.J), R.v(V.J)) | [F.product]
In PyNeuraLogic, we can fully capture the attention mechanism with the above logical rules. The first rule expresses the computation of the weights – it calculates the product of the inverse square root of dimensionality with a transposed j-th key vector and i-th query vector. Then we aggregate all the results for a given i and all possible j‘s with softmax.
The second rule then calculates a product between this weight vector and the corresponding j-th value vector and sums up the results across different j‘s for each respective i-th token.
Attention Masking
During the training and evaluation, we usually limit what input tokens can attend to. For example, we want to restrict tokens from looking ahead and attending to upcoming words. Popular frameworks, such as PyTorch, implement this via masking, that is, by setting a subset of elements of the scaled dot-product result to some very low negative number. Those numbers enforce the softmax function to assign zero as the weight for the corresponding token pair.
(R.weights(V.I, V.J) <= (
R.d_k, R.k(V.J).T, R.q(V.I), R.special.leq(V.J, V.I)
)) | [F.product, F.softmax_agg(agg_terms=[V.J])],
With our symbolic representation, we can implement this by simply adding one body relation serving as a constraint. When calculating the weights, we restrict the j index to be less than or equal to the i index. In contrast to the masking, we compute only the needed scaled dot products.

Beyond standard Attention aggregation
Of course, the symbolic "masking" can be completely arbitrary. Most of us heard of the GPT-3⁴ (or its applications, such as ChatGPT), based on Sparse Transformers.⁵ The Sparse Transformer's attention (the strided version) has two types of attention heads:
- One that attends only to previous n tokens (0 ≤ i − j ≤ n)
- One that attends only to every n-th previous token ((i − j) % n = 0)
The implementations of both types of heads require again only minor changes (e.g., for n = 5).
(R.weights(V.I, V.J) <= (
R.d_k, R.k(V.J).T, R.q(V.I),
R.special.leq(V.D, 5), R.special.sub(V.I, V.J, V.D),
)) | [F.product, F.softmax_agg(agg_terms=[V.J])],
(R.weights(V.I, V.J) <= (
R.d_k, R.k(V.J).T, R.q(V.I),
R.special.mod(V.D, 5, 0), R.special.sub(V.I, V.J, V.D),
)) | [F.product, F.softmax_agg(agg_terms=[V.J])],

We can go even further and generalize the attention for graph-like (relational) inputs, just like in Relational Attention.⁶ This type of attention operates on graphs, where nodes attend only to their neighbors (nodes connected by an edge). Queries Q, keys K, and values V are then edge embeddings summed with node vector embeddings.
(R.weights(V.I, V.J) <= (R.d_k, R.k(V.I, V.J).T, R.q(V.I, V.J))) | [F.product, F.softmax_agg(agg_terms=[V.J])],
(R.attention(V.I) <= (R.weights(V.I, V.J), R.v(V.I, V.J)) | [F.product],
R.q(V.I, V.J) <= (R.n(V.I)[W_qn], R.e(V.I, V.J)[W_qe]),
R.k(V.I, V.J) <= (R.n(V.J)[W_kn], R.e(V.I, V.J)[W_ke]),
R.v(V.I, V.J) <= (R.n(V.J)[W_vn], R.e(V.I, V.J)[W_ve]),
This type of attention is, in our case, again almost the same as the previously shown scaled dot-product attention. The only difference is the addition of extra terms to capture the edges. Feeding a graph as input into the attention mechanism seems quite natural, which is not entirely surprising, considering that the Transformer is a type of Graph Neural Network, acting on fully-connected graphs (when no masking is applied). In the traditional tensor representation, this is not that obvious.
The Transformer Encoder
Now, when we showcased the implementation of the Attention mechanism, the missing pieces to construct an entire transformer encoder block are relatively straightforward.
Embeddings
We have already seen in the Relational Attention how one can implement embeddings. For the traditional Transformer, the embeddings will be pretty similar. We project the input vector into three embedding vectors – keys, queries, and values.
R.q(V.I) <= R.input(V.I)[W_q],
R.k(V.I) <= R.input(V.I)[W_k],
R.v(V.I) <= R.input(V.I)[W_v],
Skip connections, Normalization, and Feed-forward Network
Query embeddings are summed with the attention's output via a skip connection. The resulting vector is then normalized and passed into a multilayer perceptron (MLP).
(R.norm1(V.I) <= (R.attention(V.I), R.q(V.I))) | [F.norm],
For the MLP, we will implement a fully connected neural network with two hidden layers, which can be elegantly expressed as one logic rule.
(R.mlp(V.I)[W_2] <= (R.norm(V.I)[W_1])) | [F.relu],
The last skip connection with normalization is then identical to the previous one.
(R.norm2(V.I) <= (R.mlp(V.I), R.norm1(V.I))) | [F.norm],
Putting it all together
We have built all the necessary parts to construct a Transformer encoder. The decoder utilizes the same components; therefore, its implementation would be analogous. Let us combine all the blocks into one differentiable logic program that can be embedded into a Python script and compiled into Neural Networks with PyNeuraLogic.
R.q(V.I) <= R.input(V.I)[W_q],
R.k(V.I) <= R.input(V.I)[W_k],
R.v(V.I) <= R.input(V.I)[W_v],
R.d_k[1 / math.sqrt(embed_dim)],
(R.weights(V.I, V.J) <= (R.d_k, R.k(V.J).T, R.q(V.I))) | [F.product, F.softmax_agg(agg_terms=[V.J])],
(R.attention(V.I) <= (R.weights(V.I, V.J), R.v(V.J)) | [F.product],
(R.norm1(V.I) <= (R.attention(V.I), R.q(V.I))) | [F.norm],
(R.mlp(V.I)[W_2] <= (R.norm(V.I)[W_1])) | [F.relu],
(R.norm2(V.I) <= (R.mlp(V.I), R.norm1(V.I))) | [F.norm],
Conclusion
In this article, we analysed the Transformer architecture and demonstrated its implementation in a neuro-symbolic framework called PyNeuraLogic. Via this approach, we were able to implement various types of Transformers with only minor changes in the code, illustrating how everyone can quickly pivot and develop novel Transformer architectures. It also points out the unmistakable resemblance of various versions of Transformers, and of Transformers with GNNs.
[1]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I.. (2017). Attention Is All You Need.
[2]: Marcus, G.. (2020). The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence.
[3]: Gustav Šourek, Filip Železný, & Ondřej Kuželka (2021). Beyond graph neural networks with lifted relational neural networks. Machine Learning, 110(7), 1695–1738.
[4]: Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D.. (2020). Language Models are Few-Shot Learners.
[5]: Child, R., Gray, S., Radford, A., & Sutskever, I.. (2019). Generating Long Sequences with Sparse Transformers.
[6]: Diao, C., & Loynd, R.. (2022). Relational Attention: Generalizing Transformers for Graph-Structured Tasks.
The author would like to thank Gustav Šír for proofreading this article and giving valuable feedback. If you want to learn more about combining logic with deep learning, head to Gustav's article series.