The Long and Short of It: Proportion-Based Relevance to Capture Document Semantics End-to-End

Author:Murphy  |  View: 21088  |  Time: 2025-03-22 23:55:32

Artificial intelligence software was used to enhance the grammar, flow, and readability of this article's text.

Dominant search methods today typically rely on keywords matching or vector space similarity to estimate relevance between a query and documents. However, these techniques struggle when it comes to searching corpora using entire files, papers or even books as search queries.

Keyword-based Retrieval

While keywords searches excel for short look up, they fail to capture semantics critical for long-form content. A document correctly discussing "cloud platforms" may be completely missed by a query seeking expertise in "AWS". Exact term matches face vocabulary mismatch issues frequently in lengthy texts.

Vector Similarity Search

Modern vector embedding models like BERT condensed meaning into hundreds of numerical dimensions accurately estimating semantic similarity. However, transformer architectures with self-attention don't scale beyond 512–1024 tokens due to exploding computation.

Without the capacity to fully ingest documents, the resulting "bag-of-words" partial embeddings lose the nuances of meaning interspersed across sections. The context gets lost in abstraction.

The prohibitive compute complexity also restricts fine-tuning on most real-world corpora limiting accuracy. Unsupervised learning provides one alternative but solid techniques are lacking.

In a recent paper, researchers address exactly these pitfalls by re-imagining relevance for ultra-long queries and documents. Their innovations unlock new potential for AI document search.

The Trouble with Long Documents

Dominant search paradigms today are ineffective for queries that run into thousands of words as input text. Key issues faced include:

  • Transformers like BERT have quadratic self-attention complexity, making them infeasible for sequences beyond 512–1024 tokens. Their sparse attention alternatives compromise on accuracy.
  • Lexical models matching based on exact term overlaps cannot infer semantic similarity critical for long-form text.
  • Lack of labelled training Data for most domain collections necessitates unsupervised or minimally-tuned approaches.
  • Long documents covering multiple sub-topics require models that can factor document structure into relevance judgment.

The RPRS method aims to tackle these weaknesses in current retrieval architectures.

Introducing the RPRS Model

The RPRS model computes relevance between a long query document and candidate documents using the proportional matches across their sentences.

The critical insight is that documents containing a relatively higher proportion of sentences similar to sentences from the query are likely more pertinent overall.

The approach consists of 3 key stages:

1. Sentence Encoding

  • Sentences from queries and candidate documents are encoded into vectors using SBERT – an efficient transformer architecture for sentence embeddings.
  • SBERT avoids quadratic complexity allowing incorporation of full document lengths.

2. Most Relevant Sentence Sets

  • For each query sentence, find the k most similar candidate document sentences based on vector embeddings.
  • Determine sets of the most relevant document sentences for every query sentence.

3. Proportion-based Relevance Scoring

  • Define Query Proportion (QP) – the relative proportion of query sentences that have similarity to document sentences
  • Define Document Proportion (DP) – the relative proportion of document sentences that are similar to query sentences
  • Combine QP and DP to compute a final relevance score estimating the inter-relatedness of the texts.

The proportional relevance concept intrinsically accounts for document structure within long-form text.

An extension called RPRS w/freq additionally factors term frequency and length normalization inspired by BM25 to handle repetition and length bias.

The Proportional Relevance Formulation

At the heart of the RPRS method lies a simple yet powerful relevance scoring formula between a query document q and candidate document d:

RPRS_q(d) = QP(q, d) x DP(q, d)

Where:

  • QP(q, d) is the Query Proportion
  • DP(q, d) is the Document Proportion
  • RPRS_q(d) is the Proportional Relevance Score

These proportion factors aim to quantify the inter-relatedness between the query and candidate document texts from both perspectives.

Query Proportion

The Query Proportion determines what percentage of the query content is similar to some part of the document.

For each sentence q_s in query q:

  • Retrieve top n most similar sentences d_si from document d
  • Count # query sentences that have at least 1 similar d_si
  • Divide by total # query sentences
QP(q, d) = Matching query sentences / Total query sentences

A higher QP indicates more of the query finds matches in the document.

Document Proportion

The Document Proportion conversely determines what percentage of the document is similar to some part of the query.

For each sentence d_s in document d:

  • Check if d_s occurs in the top n matches d_si for any q_s
  • Count # document sentences that match some q_s
  • Divide by total # document sentences
DP(q, d) = Matching document sentences / Total document sentences

A higher DP indicates more of the document content matches the query.

Combining the Factors

The Proportional Relevance Score is then simply the product of Query Proportion and Document Proportion:

RPRS_q(d) = QP(q, d) x DP(q, d)

The net effect is rewarding documents that maximally cover the query as well as get maximally covered by the query – indicating comprehensive semantic similarity from both angles.

Results on Legal, Patent and Wikipedia Datasets

The researchers comprehensively evaluated RPRS on five long-document datasets spanning legal case retrieval, patent search and Wikipedia document similarity tasks containing thousands of words.

On all datasets, RPRS significantly outperformed previous state-of-the-art techniques as well as lexical and neural baselines while using just 3 tuned parameters demonstrating its effectiveness. Component importance analysis further validated the proportional scoring approach.

The method combines semantic matching capability through vector embeddings with an intuitive notion of topical relevance across sentences providing interpretable high accuracy retrieval.

Addressing Long Standing Limitations

The RPRS model highlights that key ideas from classic retrieval augmented with modern NLP representations can push boundaries on challenging domains like searching legal corpora and scientific literature which have resisted high-performance automation so far.

In doing so, it also expands the scope of neural search paradigms to ultra-long text where most mature models face limitations today. More broadly, designing architectures around basic principles of relevance tailored for complex document collections remains a fertile area for innovation in search technology.

The paper provides a compelling blueprint for adaptation, but much room remains for integration with large language models, explainability as well as user experience advances in commercial document search solutions by learning from this approach.

Tags: AI Data Data Science Deep Learning Machine Learning

Comment