Optimizing Retrieval-Augmented Generation (RAG) by Selective Knowledge Graph Conditioning
Artificial intelligence software was used to enhance the grammar, flow, and readability of this article's text.
Generative pre-trained models have shown impressive fluency and coherence when used for dialogue agents. However, a key limitation they suffer from is the lack of grounding in external knowledge. Left to their pre-trained parameters alone, these models often generate plausible-sounding but factually incorrect responses, also known as hallucinations.
Prior approaches to mitigate this have involved augmenting the dialogue context with entire knowledge graphs associated with entities mentioned in the chat. However, this indiscriminate conditioning on large knowledge graphs brings its own problems:
Limitations of Naive Knowledge Graph Augmentation:
- Much of the 1-hop context may be irrelevant to the dialogue, inserting unnecessary noise
- Encoding entire knowledge subgraphs strains sequence length limits
- No guarantee model will use the relevant facts for generation
- Risk of hallucination still exists despite knowledge grounding
To overcome this, Kang et al. 2023 propose the SUbgraph Retrieval-augmented GEneration (SURGE) framework, with three key innovations:
Knowledge-Consistent Dialogue Generation with Language Models and…
- Context-Relevant Subgraph Retriever: Retrieving the most relevant knowledge graph facts to the dialogue context using a graph neural network retriever.
- Efficient Graph Encoding: Perturbing token embeddings based on relations while encoding just subgraph entities instead of all triplets. Maintains permutation and inversion invariance.
- Graph-Text Contrastive Learning: Ensuring consistency between retrieved knowledge graph and generated response via contrastive loss.
This allows providing precisely the requisite factual context to the dialogue without dilution from irrelevant facts or model limitations. Experiments show SURGE reduces hallucination and improves grounding.
The key insight is that selective conditioning on personalized subgraphs provides focused knowledge grounding without overwhelming pre-trained models.

Plan :
– Context-Relevant Knowledge Retrieval;
– Invariant Knowledge Encoding;
– Enforcing Knowledge Consistency;
– Results;
– Conclusion.
Context-Relevant Knowledge Retrieval:
- Retrieval distribution modeled using similarity of context and triplet embeddings
- Triplet embeddings obtained from Graph Neural Networks to capture relational structure
- Enables focusing on most relevant facts instead of all knowledge graph facts
The key challenge SURGE addresses is retrieving only the most relevant facts from the knowledge graph rather than overwhelm the generator with all contextually associated entities. To enable this context-specific selection, the paper proposes modeling the retrieval as a distribution over knowledge graph triplets conditioned on the dialogue history.
Mathematically, this context-conditional retrieval distribution is defined as:
pφ(z|x) ∝ exp(d(z)^T s(x))
Where:
- x is the dialogue context
- z is a knowledge graph triplet
- s(x) generates dense embeddings for the dialogue context
- d(z) generates dense embeddings for the triplets
The key insight here is using the similarity between dialogue and triplet embeddings to model relevance.
Since the triplets contain both entities and relations structured as a graph, plain language model encoders are insufficient. Instead, Graph Neural Networks (GNNs) are uniquely positioned to capture both nodes and edges. GNNs can represent the relational dependencies between entities by propagating neighbouring embeddings.
Specifically, node embeddings are generated using Graph Convolutional Networks:
e = GNN(e0; G)
While relation embeddings use Edge Hypergraph Networks:
r = GNN(r0; G∗)
Where G* denotes the dual hypergraph.
By combining node and edge embeddings, full triplet embeddings can embed semantic relations and proximity. The similarity of these triplets with the dialogue context vectors from the encoder then provides the foundation for a context-relevant retrieval distribution.
Invariant Knowledge Encoding:
- Encodes retrieved subgraph into generator transformer efficiently
- Ensures encoding is invariant to order and direction of relations
- Uniquely encodes entities and perturbs embeddings based on relations
The context-relevant subgraph retrieved in the previous stage needs to be encoded into the generator transformer model that will produce the dialogue response. However, naively encoding the symbolic triplets runs into issues around stability of the representations.
Specifically, there are two desired invariance properties:
- Permutation invariance: Order of triplets should not change overall meaning
- Relation inversion invariance: Forward and backward relations equivalent
When encoding knowledge graphs into pre-trained language models for dialogue, there are a couple practical problems that come up:
- Long sequences: Encoding every single triplet fact as words results in extremely long input sequences. This strains the model's context capacity.
- Order dependence: Shuffling the order of triplets changes the meaning seen by models like GPT-3, since they rely so much on word order and positioning. But triplets are by nature unordered – shuffling facts shouldn't change overall meaning.
- Directional difference: Relations can be inverted without changing the core meaning (X is-wife-of Y == Y has-husband X). But prefixed text makes these seem like completely different facts.
The problems above cause unnecessary stress on the language models when encoding structured knowledge. The models get overwhelmed by huge numbers of tokens, and they struggle to grasp that jumbled or inverted triplets still convey the same concepts.
So ideally, we need a way to encode knowledge compactly yet stably. The encoding should be:
- Efficient: Shouldn't result in 1000s of prepended tokens blowing context space.
- Order-invariant: Shuffling subgraphs shouldn't drastically alter meaning.
- Direction-invariant: Forward and backward relations should be treated equivalently.
SURGE solves this by uniquely encoding only entities, then judiciously perturbing their embeddings based on relations detected via graph neural networks. This provides a compact, stable form for assimilation by the decoder.
A two-step embed and perturb approach is introduced:
Unique entity embedding:
- Extract set of unique entities ENT(Z) from triplets
- Embed these entities using dialogue encoder
- This embed+sort provides permutation invariance
Perturbation using relations:
- Use Graph Neural Network over triplets
- GNN provides relation-aware node embeddings
- Apply transformation β to entity embeddings:
β(f(a), Z) = (1 + γ) ∗ f(a) + δ
where γ, δ are learned perturbation factors based on relations.
This step uses the relational information to directly influence the entity vector spaces while still keeping the efficient unique entity based encoding.
Benefits:
- Vector space encoding fits generator requirements
- Invariance provides stability and consistency
The insight is generating invariance through sets and perturbations rather than variable sequence encodings.
Enforcing Knowledge Consistency:
- Contrastive loss between knowledge graph and generated response
- Pulls relevant knowledge representations closer to response representations
- Improves grounding of responses in retrieved facts
Even after context-relevant retrieval and efficient encoding, there is no guarantee the generator will actually utilize the relevant knowledge provided to it. The risk of hallucination persists.
To actively incorporate the encoded subgraph, the authors propose adding a cross-modal contrastive loss between graph and response representations:
Lcont = (1/2) * log (sim(ζ(z), ξ(h)) / ∑ξ(h'))
- (1/2) * log (sim(ζ(z), ξ(h)) / ∑ζ(z'))
Where:
- z is the encoded knowledge subgraph
- h is the decoder hidden state
- ζ and ξ are projected embeddings
Intuitively, this loss pulls an encoded knowledge graph closer to its corresponding response representation, while pushing it away from other random responses or knowledge graphs.
This trains the model to actively distinguish between relevant knowledge-response pairs versus irrelevant ones. This discriminative pressure incentivizes the model to ground its responses in the encoded facts.
Benefits:
- Improves factual consistency
- Reduces unsupported assertions
- Allows tracing hallucinations to retrieval errors
The key insight is that without an explicit alignment objective, the vector spaces of both modalities may drift apart, limiting fact grounding. The contrastive loss acts as an inductive bias towards consistency.
Training end to end :
Objective Function: The overall training objective is to maximize the log likelihood of generating the correct responses summed over the latent knowledge subgraphs:
L = Σp(Z|x) p(y|x,Z)
Where p(Z|x) is the context-based retrieval distribution and p(y|x,Z) is the generator distribution.
Training Process:
- Encode dialogue context x using encoder network
- Retrieve top-k subgraphs Z_i ~ p(Z|x) via similarity search
- Encode Z_i invariantly using GNN + perturbation
- Maximize p(y|x,Z_i) for each sample via decoder
- Additionally minimize contrastive loss between Z_i and decoder states
So jointly across batches of dialogue, retrieval distribution and generation distribution are optimized through shared parameters.
Model Choice:
In principle, any sequence-to-sequence language model like T5, BART or even GPT-3 can be used as the generator model by appending encoded knowledge to the input context tokens. The paper uses a T5 model in their experiments but this can be substituted.
Benefits:
- Unified end-to-end training tying components
- Marginal likelihood aggregates overall retina performance
- Modular architecture allows model extensibility
Results:
- Outperforms baselines in metrics measuring knowledge relevance
- Qualitative examples show more factual responses grounded in relevant knowledge
- Ablations validate importance of each component
The authors evaluate SURGE on the OpendialKG and KOMODIS dialogue datasets which provide paired knowledge graphs.
Quantitative improvements:
- SURGE outperforms all baselines in knowledge-relevance metrics like the proposed KQA (Knowledge-Verifying QA) metric which measures factual correctness through an extractor.
- Achieves new state-of-the-art results on existing automatic metrics like BLEU, ROUGE and F1 which assess language fluency.
Qualitative impacts:
- Examples show SURGE generates more informative, factual responses grounded in relevant knowledge from selectively retrieved subgraphs.
- Baselines often omit key facts or even hallucinate irrelevant statements despite having access to the full context.
Ablation studies:
- Removing components like contrastive learning significantly drops knowledge consistency metrics, showing the necessity of each module.
SURGE substantially improves knowledge relevance through targeted augmentation while retaining language fluency. The gains over both knowledge-unaware and knowledge-intensive baselines validate the benefits of selective subgraph retrieval and grounding.