Use GPT Models to Generate Text Data for Training Machine Learning Models

Author:Murphy | View: 29100 | Time: 2025-03-23 18:06:50

Motivation

Data are fundamental to building Machine Learning models, yet text data for training Machine Learning models are difficult to collect for the following reasons:

Open-source text datasets are limited. Privacy rules and commercial confidentiality often restrict distribution of privileged data. In addition, publicly available datasets may not be licensed for commercial use, or more critically may not be context relevant. For example, IMDB movie reviews are unlikely to be meaningful for analysing customer sentiments towards Banking products.
Machine Learning models typically need a large amount of training data to perform. It may take a company, particularly a start-up, considerable time to collect a credible line of text data. In addition, these data may not have been labelled with a response variable for a specific Machine Learning task. For example, a company may have been collecting customer complaints verbatim, but may not necessarily have a granular understanding of the topics or sentiments of these complaints.

How can we overcome the above constraints and generate fit-for-purpose text data in a scalable and cost-effective way? Given the recent advances in Large Language Models and Generative AI, this article* provides a tutorial on generating synthetic text data by calling OpenAI's GPT model suites in Python.

To demonstrate, let's explore a use case of generating customer complaints data for an insurance company. With enriched text data for training language models, the use case is that the company could potentially achieve better customer outcomes by performing better in Natural Language Understanding tasks such as categorising complaints into topics or scoring complainant sentiments.

*This article is 100% ChatGPT-free.

Prerequisite: Setting up an OpenAI API key

To be able to call the GPT models, simply register an account with OpenAI and access the API key under User Settings. Make sure to keep this key private.

Note that depending on usage, accessing GPT models comes with a cost, although this hasn't been material for myself (<$0.08 USD for preparing this tutorial).

Step 1: Load the required Python packages

#!pip install openai if you haven't already, make sure you install
# and upgrade to the latest version

import os
import openai

import pandas as pd
import numpy as np

Step 2: Generate a single customer complaint

Let's start by generating one customer complaint data point using the text-davinci-003 model under the GPT-3.5 model suites.

openai.api_key = "INSERT YOUR API KEY STRING HERE"

prompt_text = "You are a customer of a life insurance company. 
              n nYou are extremely dissatisfied of the time it takes to assess your life insurance claim. It has been a horrible customer experience.
              n nWhat would you say in a complaint you are going to raise against the life insurance company?"

complaint = openai.Completion.create(

    engine = "text-davinci-003",
    prompt =  prompt_text,
    temperature = 0.6,
    top_p = 1,
    max_tokens = 1000,
    frequency_penalty = 0,
    presence_penalty = 0

    )

print(complaint.choices[0].text)

Note that with respect to the code above:

You need to insert your private API key string in the first line of the code.
The _prompttext helps the GPT model understand its role and thereby generate a customer complaint by responding to the question prompted.
Other parameters referenced in the code (_temperature, top_n, max_tokens, frequency_penalty and presencepenalty) are explained in more detail in this section of my article.

The customer complaints generated by the above code are:

I am writing to express my extreme dissatisfaction with the life insurance company I am a customer of. The time it has taken to assess my life insurance claim has been unacceptable and has had a horrible impact on my customer experience. I find it unacceptable that I have been waiting for so long to receive a response to my claim. I urge the company to take immediate action to address this issue and to ensure that all customers receive prompt and satisfactory responses in the future.

This seems reasonable and coherent at first glance.

Step 3: Generate customer complaints at scale

You may argue that you can simply replicate Step 2 by entering the _prompttext in ChatGPT. You are definitely right if you simply wish to generate limited number of data points. However, it's not at all feasible to be repeating the exercise manually on the ChatGPT front-end for large scale text data generation. How do we then automate this task (ultimately scaling up the operation of generating customer complaints)? The answer lies in just a slight tweak to the code in Step 2.

By design of the GPT models as well as the nature of the temperature parameter which determines the creativity and diversity of the generated texts, each run of the code in Step 2 generates a different customer complaint. Given this, we just need to set up a loop for running the code in Step 2 n times and storing the output of each of these runs.

To demonstrate, the code below creates a loop for generating n = 3 complaints and storing the output in a dataframe:

prompt_text = "You are a customer of a life insurance company. 
              n nYou are extremely dissatisfied of the time it takes to assess your life insurance claim. It has been a horrible customer experience.
              n nWhat would you say in a complaint you are going to raise against the life insurance company?"

text_gen = []

for i in range(0, 3):
  completion = openai.Completion.create(
      engine = "text-davinci-003",
      prompt = prompt_text,
      max_tokens = 120,
      temperature = 0.6,
      #top_p = 1,
      frequency_penalty = 0,
      presence_penalty = 0

      )

  text_gen.append(completion.choices[0]['text'])
  print('Generating complaints number %i'%(i))

text_gen

The snippet below shows the 3 customer complaints generated. Needless to say, the parameter n can be set to a number of your choice.

Image 1: Generated customer complaints. Image by author.

More Advanced Use Cases

Zero-shot vs. Few-shot training for GPT models

Providing standalone text prompts to GPT models per the use case above is considered Zero-shot training. Text data generated via Zero-shot training can be slightly generic for a specific task.

In a scenario where we already have limited but meaningful training data and would like to generate additional training data which provides some resemblance to the existing data, we can point the _prompttext input to existing data. This provides the GPT model Few-shot training.

For example, pretend we would like to Few-shot train a GPT model to generate some texts resembling the data from the IMDB movie review dataset (which is already stored by you in the _dfimdb variable, hypothetically):

text_gen = []

for i in range(len(df_imdb)):

    prompt_text = f"The following are movie reviews with a positive sentiment. REVIEW: {df_imdb[i]} REVIEW:"

    review = openai.Completion.create(

        engine = "text-davinci-003",
        prompt =  prompt_text,
        temperature = 0.6,
        top_p = 1,
        max_tokens = 1000,
        frequency_penalty = 0,
        presence_penalty = 0

        )

    text_gen.append(review.choices[0].text)

Note that this code also implicitly creates a label for the generated text data. In this instance, we are creating ‘positive' reviews. Text data with a ‘negative' or ‘neutral' label can be generated in a similar manner. This gives us a labelled dataset without any manual effort!

Moreover, this code demonstrates that GPT models are able to generate data by recognising the pattern in the text prompt, without needing to be promoted with a question. In this instance, it's been trained to give a movie review after being given the "REVIEW:" prompt.

To improve resemblance to existing data, we can even Few-shot train the GPT models with more than one data point from the existing data. This can be easily done by updating the _prompttext input to:

prompt_text = 
  f"The following are movie reviews with a positive sentiment. REVIEW: {df_imdb[i]} REVIEW: {df_imdb[i-1]} REVIEW:"

Other GPT models

By far the text-davinci-003 GPT model has been used to demonstrate the use cases in this article. Other GPT models can also be called via the engine parameter as needed. For instance, some GPT models such as gpt-3.5-turbo are more powerful than others, but may need to be called differently in Python as they take a ‘dialogue' as input as opposed to a text string.

The code below shows a call to the gpt-3.5-turbo model for generating complaints data.


dialogue = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "My life insurance company won't let me reinstate my policy after a couple of dishonours?"},
        {"role": "assistant", "content": "I'm sorry to hear that, is there anything I can help you with."},
        {"role": "user", "content": "Yes, you can draft me a complaint directed at the life insurance company, in no more than 100 words"}
    ]

text_gen_chat = []

for j in range(0, 3):

    res = openai.ChatCompletion.create(
      model = "gpt-3.5-turbo",
      messages = dialogue,
      temperature = 0.6

  )

    text_gen_chat.append(res.choices[0].message.content)
    print('Generating complaints data number %i'%(j))

text_gen_chat

And the output:

Image 2: Generated customer complaints. Image by author.

Risks

Whilst the use cases above demonstrate the ready-set-go implementation of the GPT models (or more broadly, Large Language Models), given that these models are relatively new, users may need to be cautious about the known and unknown inherent risks associated with integrating these models to real-life data and the way we work. Specific to generating text data for training Machine Learning models, putting my Risk and Governance hat on, below are the key (known) risks in my view when implementing the technology in practice:

There may be Privacy concerns with respect to providing certain type of input to GPT models via OpenAI's API endpoints. This is relevant when companies are leveraging proprietary data to generate augmenting text data (as discussed in the Advanced Use Cases above), particularly when it comes to providing input such as customer complaints which contain personal information. One potential mitigation against this is to de-identify personal information in a pre-processing step to ensure complaints data are used on an anonymous basis. In addition, companies using this kind of technology need to develop guardrails and policies governing the use of private and sensitive information.
For certain type of use cases, the text data generated by the GPT models may introduce Biases. For example, as GPT models were pre-trained based on large corpus of publicly available texts on the internet, for the task of generating both positive and negative customer feedback, it may inherently have bias towards the latter (i.e. customer complaints) under the assumption that these are more prominent on the internet. In addition, the texts generated may not be ‘fine-tuned' enough to cater for a specific product feature offered by a company as related texts may not be available on the internet. This ultimately presents a trade-off between efficient text data generation and usability of such data.
With respect to Recency, although it's expected that the GPT models will continue to receive updates, at time of writing most models were trained based on internet data up to September 2021 (i.e. almost 2 years ago).

Other known risks associated with Large Language Models such as Hallucinations are less relevant in this context.

Concluding Thoughts

This article offers a practical (and generative) way of some practical constraints to accessing textual data for training Machine Learning models.

Taking a step back, with proper attention and care around risks, this article is an example of how users can access OpenAI's GPT models from the ‘back-end'. This allows users to unlock commercial opportunities for using Large Language Models backing ChatGPT which was originally designed for individual ad-hoc use cases.

Are you a fan of these practical tutorials on Machine Learning related topics? As I ride the AI/ML wave, I enjoy writing and sharing step-by-step guides and how-to tutorials in a comprehensive language with ready-to-run codes. If you would like to access all my articles (and articles from other practitioners/writers on Medium), you can sign up using the link here!

Tags: AI Data Science Machine Learning OpenAI