Achieving Greater Self-Consistency in Large Language Models

Author:Murphy  |  View: 22913  |  Time: 2025-03-22 23:52:00

Artificial intelligence software was used to enhance the grammar, flow, and readability of this article's text.

When LLMs are used to evaluate qualities like the correctness, accuracy, or relevance of a piece of text, consistency is paramount. If an LLM exhibits inconsistent judgements, then its evaluations become unreliable and untrustworthy.

If an LLM evaluates the reasoning quality of arguments, but contradicts itself by rating an invalid argument as more logically sound than a perfectly valid one, then it fails as an arbiter of reason. Its evaluations lose credibility due to the model's own lack of logical consistency.

When such inconsistencies appear, there is no stable basis for comparison between the LLM's assessments of different pieces of text. If the model arbitrarily contradicts itself, then sentences cannot be reliably ranked against one another based on the model's inconsistent scorings.

In essence, inconsistency destroys the grounds for comparison that evaluations aim to provide in the first place. If an LLM cannot demonstrate consistent application of assessment criteria, then using it to evaluate text loses all effectiveness and utility.

So, consistency in judgement and evaluation is mandatory for LLMs employed to score or judge textual qualities and features. Without a high level of stability in its assessments, grounded in a consistent understanding of concepts being evaluated, the basis for comparison falls apart when leveraging LLM output as a form of evaluation or scoring.

Sampling multiple solutions reveals consistency between outputs strongly correlates with quality. However, existing consistency techniques rely on extracting and matching closed-form answers, restricting their applicability. This article explores methods to enhance self-consistency without such constraints, while also grounding decisions in real-world knowledge.

Image by the author

The Need for Self-Consistency

Despite rapid progress, logical failures and falsehoods continue hindering reliable reasoning in state-of-the-art models. For complex multi-step analysis or free-form generation, models often contradict themselves or invent unsupported facts.

This manifests in two key ways – inconsistent open-ended generation, and incoherent inferences. When performing open-ended tasks, models generate conflicting outputs when sampled multiple times on the same input. Meanwhile for chained reasoning, models draw irrational conclusions that violate basic transitive properties.

For example, a model may determine A > B and B > C in a ranking comparison, but then inaccurately assess C > A, resulting in circular contradictions. Such failures to preserve transitive coherence fundamentally undermine reliability.

Techniques like self-consistency help address inconsistency by sampling multiple candidate solutions and selecting outputs that display consensus. The key insight is that consensus acts as an effective proxy measure for quality and coherence.

However, existing self-consistency approaches rely on strict answer formats to allow extraction and comparison across responses. This significantly restricts their applicability to open-ended tasks.

Advancing internal self-consistency requires a two-pronged approach. First, consistency techniques like USC eliminate format constraints to enable open-ended selection. Meanwhile, contrastive ranking methods enforce logical constraints on latent representations, ensuring models preserve ordering relationships during multi-step inference.

However, solely enhancing internal consistency without grounding in structured external knowledge remains insufficient for accurate reasoning. Language models lack the dynamic updates, logical expressiveness, and empirical verifiability crucially needed to inform interpretations.

Advancing reliability therefore necessitates improving self-consistency across diverse tasks while integrating structured repositories of world knowledge. Consistency techniques should operate flexibly across free-form outputs, coupled with frameworks leveraging rich semantic representations to inform decisions with logic-driven, externally-verified facts.

This combination of model-internal consensus and structured external grounding complements strengths of each approach to replicate multifaceted human reasoning. Only their orchestration can overcome inherent limitations, incrementally advancing AI capabilities towards matching diverse and fluid cognition.

Introducing Universal Self-Consistency

Universal Self-Consistency for Large Language Model Generation

Chen 2023 et al. proposes Universal Self-Consistency (USC) to enable self-consistency without answer extraction across diverse applications. USC relies on the language model's own capability to select the most consistent response from multiple candidate solutions it originally produced.

Specifically, USC concatenates all sampled responses into a context, then constructs a prompt asking the model to choose the one with highest consensus. Eliminating specialized aggregation logic, USC is applicable even to free-form generation tasks.

Experiments demonstrate USC matches performance of standard self-consistency on mathematical reasoning and code generation benchmarks which permit answer extraction. Crucially, USC also improves open-ended question answering, summarization, and creative writing where existing techniques falter.

Augmenting Reasoning with External Knowledge

While USC offers framework-agnostic consistency, external knowledge remains vital for accurate and robust reasoning. Language models lack the dynamic updates, factual integrity, and logical expressiveness of curated knowledge repositories.

Knowledge graphs allow incorporating empirically-grounded details and interlinked factual relations, enabling more consistent interpretation of situations per the latest information. They facilitate accessing contextual empirical evidence to substantiate decisions instead of relying solely on innate biases embedded in model weights.

Additionally, knowledge graphs manage the evolution of facts over time, ensuring reasoning relies on the most up-to-date information. They also encapsulate domain logic to explicitly encode rules, constraints, and ontological taxonomies. This allows sound deductive reasoning undiscernible from model training Data alone.

Empirically, retrieval-augmented generation using external knowledge graphs demonstrates more consistent output by grounding responses in verified facts rather than hallucinations. Parallel querying of knowledge graphs – even duplicate copies – further improves consistency by accumulating evidence from multiple perspectives.

Orchestrating USC's structured reasoning with a retrieval-augmented system leverages external knowledge to engender a modular hybrid architecture. USC contributes the reasoning "backbone" while parallel retrieval supplies relevant factual details from trustworthy knowledge repositories, enhancing interpretation consistency.

Augmenting Consistency with Seeded Sampling

openai-cookbook/examples/Deterministic_outputs_with_the_seed_parameter.ipynb at main ·…

In open-ended generation tasks, consistency between solutions strongly correlates with quality. However, large language models exhibit inherent randomness that produces variation across outputs.

To mitigate this, OpenAI's Chat Completions API provides a seed parameter that seeds the random number generator for deterministic sampling. Using the same seed and parameters will yield identical or very similar outputs each time.

This allows combining Universal Self-Consistency, which has the model select the most consistent response from candidates, with seeded sampling to evaluate the exact same responses across requests.

There remains a small chance of variation from non-determinism in computers. Additionally, changes to the system_fingerprint indicate backend updates affecting determinism across requests.

Besides seeding, factors like lower temperature and careful prompt engineering also reduce variability in model-generated text. Overall, seeded sampling lets us harness consistency for improved reliability.

By choreographing seeded candidates with USC-based selection, we construct solutions methodically grounded in both external knowledge and sampling determinism. This manifests emergent capabilities surpassing individual techniques.

Contrastive-consistent ranking (CCR)

While previous sections focus on classification and open-ended generation, consistency in rankings elicited from language models is also vital. Recent work explores contrast-consistent ranking (CCR), which adapts Contrast-Consistent Search (CCS) to impose logical constraints on mapping item representations to a consistent scale.

CCR probes item vectors to find this consistent ranking direction without supervision. Experiments on prompting baselines and CCR variants demonstrate improved consistency and generalization.

By extending consistency-based techniques like USC to rankings, CCR offers a way to limit unpredictability and align model outputs across diverse tasks. Both aim to improve reliability by selecting solutions or rankings judged internally coherent by the model itself.

Unsupervised Contrast-Consistent Ranking with Language Models

CCR probing trains an additional "probe" model to find a latent ranking direction in the vector space of a fixed language model.

  • It does not directly prompt or finetune the language model itself.
  • The language model is used to simply generate vector representations of items through input prompts.
  • These item vectors are then fed into the CCR probe, which is trained on an unsupervised loss function to map representations onto a consistent ranking scale.

So in summary, CCR probing introduces an external probe model that wraps the language model to perform ranking in its vector space. The language model remains static. Only the additional probe model is trained with contrastive objectives to uncover inherent rankings.

Technical Architecture

Achieving Structured Reasoning with LLMs in Chaotic Contexts with Thread of Thought Prompting and…

We implement a RAG system accessing multiple knowledge graphs indexed with vector search for fast fact retrieval. Query engines interface the indexes and encapsulate passage search. Helper tools wrap engines, facilitating integration. Separate agents house tools, interfacing the LLM. A super-agent oversees tool coordination.

The system leverages Thread-of-Thought (ToT) prompting for structured reasoning over retrieved passages. ToT guides the model through step-wise analysis, enhancing understanding. Parallel asynchronous retrieval allows simultaneously querying all graphs, accelerating context accumulation.

Multimodal knowledge graphs using diverse algorithms and embeddings provide varied perspectives. Personalized PageRank traversal supports flexible inference along indirect connections. Approximate nearest neighbor search enables efficient lookups. Embeddings enable analogical reasoning via vector arithmetic.

We implement a staged retrieval approach, first querying a domain ontology to establish basic concepts and terminology related to the question. Ontologies provide formal definitions and high-level relationships between entities, allowing grounded understanding before retrieving full knowledge graphs.

The ontology results are appended to the original question to provide context. This enhanced query is then used to retrieve passages from the multiple knowledge graphs. Seeding retrieval with ontology information primes the RAG system for more consistent and relevant passages tailored to the grounded aspects.

This technique combines top-down hierarchical reasoning from the ontology with complementary bottom-up factual details from knowledge graphs. The ontology clarifies ambiguous or broad terminology, while knowledge graphs provide in-depth relational information on refined query focuses.

Choreographing different knowledge sources in this staged manner allows smoothly transitioning between levels of abstraction for reinforced consistency. Retrieval flows from high-level ontology concepts down to specific contextual passages pertinent for the question.

from llama_index.indices.knowledge_graph import KnowledgeGraphRAGRetriever
from llama_index.agent import OpenAIAgent
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.llms import OpenAI
from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1000)

# Your knowledge graphs
my_kgs = {'kg1': kg_index, 'kg2': kg_index2}

# Dictionary to store the agents
kg_agents = {}

# List to store the tools
kg_tools = []

for kg_name, kg in my_kgs.items():
    # Create a query engine for the KG
    query_engine = kg.as_query_engine(
        include_text=True,
        response_mode="tree_summarize",
        embedding_mode="hybrid",
        similarity_top_k=20,
    )

    # Create a tool for the query engine
    tool = QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name=f"tool_{kg_name}",
            description=f"Useful for questions related to {kg_name}",
        ),
    )

    # Add the tool to the list of KG tools
    kg_tools.append(tool)

    # Create an agent for the tool
    agent = OpenAIAgent.from_tools([tool], system_prompt="Walk me through this context in manageable parts step by step, summarizing and analyzing as we go.")

    # Add the agent to the dictionary of KG agents
    kg_agents[kg_name] = agent

# Create the super agent
llm = OpenAI(model="gpt-3.5-turbo-1106")
super_agent = OpenAIAgent.from_tools(kg_tools, llm=llm, verbose=True, memory=memory, system_prompt="Therefore, the answer.")

# Initial question 
initial_question = "Define assets? And optimize them"

# Query the ontology with the initial question
ontology_results = ontology_engine.query(initial_question)
# Combine the initial question with the ontology results
combined_question = f"{initial_question}. {ontology_results}"
# Query the super agent with the combined question
response = await super_agent.astream_chat(combined_question) 
# Collect all tokens into a string
response_text = ""
async for token in response.async_response_gen():
  response_text += token
# Print the response text
print(response_text)

Impact

Combining USC and RAG complements consistency-based thinking with grounding in external knowledge. USC contributes the reasoning structure while RAG expands information breadth. Together these offset LLM limitations to better replicate human cognition.

This orchestration also enhances accuracy, speed, and coverage. Retrieved facts fill knowledge gaps for sound decisions. Parallel knowledge access accelerates understanding. Different knowledge graphs broaden conceptual connections considered.

Through modular augmentation, we gracefully scale emergent capabilities beyond inherent model aptitudes. As LLMs and knowledge bases mature, this composable paradigm will facilitate progressively advancing AI reasoning abilities.

Tags: AI Data Data Science Deep Learning Machine Learning

Comment