Case-Study: Multilingual LLM for Questionnaire Summarization

Author:Murphy  |  View: 24247  |  Time: 2025-03-22 20:36:53

An LLM Approach to Summarizing Students' Responses for Open-ended Questionnaires

Madrasa (מדרסה in Hebrew) is an Israeli NGO dedicated to teaching Arabic to Hebrew speakers. Recently, while learning Arabic, I discovered that the NGO has unique data and that the organization might benefit from a thorough analysis. A friend and I joined the NGO as volunteers, and we were asked to work on the summarization task described below.

What makes this summarization task so interesting is the unique mix of documents in three languages – Hebrew, Arabic, and English – while also dealing with the imprecise transcriptions among them.

A word on privacy: The data may include PII and therefore cannot be published at this time. If you believe you can contribute, please contact us.

Context of the Problem

As part of its language courses, Madrasa distributes questionnaires to students, which include both quantitative questions requiring numeric responses and open-ended questions where students provide answers in natural language.

In this blog post, we will concentrate on the open-ended natural language responses.

The Problem

The primary challenge is managing and extracting insights from a substantial volume of responses to open-ended questions. Specifically, the difficulties include:

Multilingual Responses: Student responses are primarily in Hebrew but also include Arabic and English, creating a complex multilingual dataset. Additionally, since transliteration is commonly used in Spoken Arabic courses, we found that students sometimes answered questions using both transliteration and Arabic script. We were surprised to see that some students even transliterated Hebrew and Arabic into Latin letters.

Nuanced Sentiments: The responses vary widely in sentiment and tone, including humor, suggestions, gratitude, and personal reflections.

Diverse Topics: Students touch on a wide range of subjects, from praising teachers to reporting technical issues with the website and app, to personal aspirations.

The Data

There are couple of courses. Each course includes three questionnaires administered at the beginning, middle, and end of the course. Each questionnaire contains a few open-ended questions.

The tables below provides examples of two questions along with a curated selection of student responses.

Example of a Question and Student Responses. LEFT: Original question and student responses. RIGHT: Translation into English for the blog post reader. Note the mix of languages, including Arabic-to-Hebrew transliteration, the variety of topics even within and the same sentences, and the different language registers. . Credit: Sria Louis / Madarsa
Example of a question and student responses. LEFT: Original question and student responses. RIGHT: Translation into English for the blog post reader. Note the mix of languages and transliterations, including both English-to-Hebrew and Hebrew-to-English. Credit: Sria Louis / Madarsa

There are tens of thousands of student responses for each question, and after splitting into sentences (as described below), there can be up to around 100,000 sentences per column. This volume is manageable, allowing us to work locally.

Our goal is to summarize student opinions on various topics for each course, questionnaire, and open-ended question. We aim to capture the "main opinions" of the students while ensuring that "niche opinions" or "valuable insights" provided by individual students are not overlooked.

The Solution

To tackle challenges mention above, we implemented a multi-step natural language processing (NLP) solution.

The process pipeline involves:

  1. Sentence Tokenization (using NLTK Sentence Tokenizer)
  2. Topic Modeling (using BERTopic)
  3. Topic representation (using BERTopic + LLM)
  4. Batch summarizing (LLM with mini-batch fitting into the context-size)
  5. Re-summarizing the batches to create a final comprehensive summary.

Sentence Tokenization: We use NLTK to divide student responses into individual sentences. This process is crucial because student inputs often cover multiple topics within a single response. For example, a student might write, "The teacher used day-to-day examples. The games on the app were very good." Here, each sentence addresses a different aspect of their experience. While sentence tokenization sometimes results in the loss of context due to cross-references between sentences, it generally enhances the overall analysis by breaking down responses into more manageable and topic-specific units. This approach has proven to significantly improve the end results.

NLTK's Sentence Tokenizer ([nltk.tokenize.sent_tokenize](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html)) splits documents into sentences using linguistics rules and models to identify sentence boundaries. The default English model worked well for our use case.

Topic Modeling with BERTopic: We utilized BERTopic to model the topics of the tokenized sentences, identify underlying themes, and assign a topic to each sentence. This step is crucial before Summarization for several reasons. First, the variety of topics within the student responses is too vast to be handled effectively without topic modeling. By splitting the students' answers into topics, we can manage and batch the data more efficiently, leading to improved performance during analysis. Additionally, topic modeling ensures that niche topics, mentioned by only a few students, do not get overshadowed by mainstream topics during the summarization process.

BERTopic is an elegant topic-modeling tool that embeds documents into vectors, clusters them, and models each cluster's representation. Its key advantage is modularity, which we utilize for Hebrew embeddings and hyperparameter tuning.

The BERTopic configuration was meticulously designed to address the multilingual nature of the data and the specific nuances of the responses, thereby enhancing the accuracy and relevance of the topic assignment.

Specifically, note that we used a Hebrew Sentence-embedding model. We did consider using an embedding on word-level, but the sentence-embedding proved to be capturing the needed information.

For dimension-reduction and clustering we used BERTopic standard models UMAP and HDBSCAN, respectively, and with some hyper-parameter fine tuning the results satisfied us.

Here's a fantastic talk on HDBSCAN by John Healy, one of the authors. It's not just very educational; the speaker is really funny and witty! Definitely worth a watch

Tags: Hebrew Llm Applications Multilingual Summarization Topic Modeling

Comment