How to Create a Powerful AI Email Search for Gmail with RAG

Author:Murphy | View: 20254 | Time: 2025-03-23 11:30:11

In this article, I will show you how you can develop the MailDiscoverer application to search Gmail emails using RAG. First, I will show you how to set up the authentication pipeline to access user's emails (if consent is given). The emails are then embedded using an OpenAI text embedder and stored in a Pinecone vector database. This allows a user to ask questions regarding the emails, and the RAG system will retrieve the most relevant emails and provide an answer to the question.

Learn how to develop a RAG system to search for your emails. Image by ChatGPT. OpenAI. (2024). *ChatGPT* (4o) [Large language model]. https://chatgpt.com/c/66dd8280-5bc4-8012-9e7e-68fd66ccbfeb

The application developed in this article can be found on Streamlit. My GitHub repository for this code is also available.

The video below showcases how the application works. After logging in and uploading your emails, you can ask a question about them, and the application will provide an answer and showcase the most relevant emails used to provide the answer.

Motivation

My motivation for this article is that I often search for old emails and spend a lot of time attempting to find them. A solution to this problem is to develop a RAG system where you upload your emails, embed the text contents in the Email, and prompt the system to find emails. For example, I can ask: "Have I received an email from Microsoft lately?" and the system will find the most relevant emails and respond using the text in those emails. Additionally, a link to check out each email is also important.

I have previously written an article on developing a similar tool, showing how to create a RAG system to access your data. This article differs in that I am directly accessing emails through the Google Gmail API (which is much more user-friendly than downloading and then uploading your emails). I am also using Pinecone for my vector database, which allows you to host the vector database on a server for free (up to a natural maximum usage limit).

· Motivation · Table of contents · Plan · Add Gmail integration to access emails ∘ Gain access to Gmail API ∘ Logging in a user and obtaining consent ∘ Retrieving emails · Storing the emails in a vector database · Conclusion

Plan

The list below is my plan for developing this Streamlit application:

Add Gmail integration so I can access user's mails
embed all emails and store them in the database for the user
Add a prompt field for the user
Respond to the user prompt with an answer, and the most relevant emails used to provide the answer

To follow this tutorial, you can download the following packages with pip:

openai
pinecone-client
google-api-python-client 
google-auth
google-auth-httplib2 
google-auth-oauthlib
streamlit
python-dotenv
tqdm
langchain
langchain-openaitx

Add Gmail integration to access emails

Gain access to Gmail API

The first step in developing the MailDiscoverer application is to have access to a user's mail. For this, you can use the Google Gmail API, which allows you to have users log into their Google account and consent to the application accessing their emails. The login through Google adds authenticity, and you do not have to worry about developing the authentication pipeline yourself. Furthermore, the Google Gmail API is free up to a relatively large usage limit, making it a good option for accessing user's emails.

To have access to the Gmail API, you must:

Go to the Google Cloud Console
Activate the Gmail API
Go to the "Credentials" page and add an OAuth 2.0 Client ID. Make it a web application, and set Authorized redirect URIs to http://localhost:8080 (note that the port here is important. Furthermore, when hosting your application, you will need to change this URI to match your application URI). NOTE: make sure to run Streamlit on the same port you write here (for example, to run Streamlit on port 8080: streamlit run main.py – server.port 8080)
Download your credentials.json file (you will need it in your programming folder later).
Go to the OAuth consent screen page, and create your app. Fill in all the required information, and add the two scopes: [https://www.googleapis.com/auth/](https://www.googleapis.com/auth/userinfo.email)userinfo.email (access user's email address) and https://www.googleapis.com/auth/gmail.readonly (read user's emails)

You should now have access to the Gmail API, and I will show you how to utilize it with Python.

Logging in a user and obtaining consent

To authenticate a user, I use the following code. First, define your imports:

import streamlit as st
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from google.auth.transport.requests import Request
import os

Then, define your constants. Remember to change the last four constants matching those in the credentials.json file you downloaded earlier.

MAIN_REDIRECT_URI = 'http://localhost:8080/'
SCOPES = ["https://www.googleapis.com/auth/gmail.readonly", "https://www.googleapis.com/auth/userinfo.email"]
PROJECT_ID = "xxx"
AUTH_URI = "xxx"
TOKEN_URI = "xxx"
AUTH_PROVIDER_X509_CERT_URL = "xxx"

I also import my Client ID and Client secret (note that these are sensitive variables, so I store them in the .streamlit/secrets.toml file, and when hosting my streamlit application, I make sure to define the variables in the secret variables section):

CLIENT_ID = st.secrets["GMAIL_API_CREDENTIALS"]["CLIENT_ID"]
CLIENT_SECRET = st.secrets["GMAIL_API_CREDENTIALS"]["CLIENT_SECRET"]

Then you can define the Client Config you need to authenticate a user (this config essentially verifies that you are the author of the application you registered on Google Gmail API):

CLIENT_CONFIG = {
     "web":{"client_id":CLIENT_ID,"project_id":PROJECT_ID,"auth_uri":AUTH_URI,"token_uri":TOKEN_URI,"auth_provider_x509_cert_url":AUTH_PROVIDER_X509_CERT_URL,"client_secret":CLIENT_SECRET,"redirect_uris": ALL_REDIRECT_URIS,"javascript_origins": ALL_JAVASCRIPT_ORIGINS}
     }

With Streamlit, you can now verify a user with (explanation of code is given below):

def get_user_info(creds):
    # Build the OAuth2 service to get user info
    oauth2_service = build('oauth2', 'v2', credentials=creds)

    # Get user info
    user_info = oauth2_service.userinfo().get().execute()

    return user_info.get('email')

def authorize_gmail_api():
      """Shows basic usage of the Gmail API.
      Lists the user's Gmail labels.
      """
      creds = None
      if os.path.exists("token.json"):
        creds = Credentials.from_authorized_user_file("token.json", SCOPES)
        st.info("Already logged in")
      # If there are no (valid) credentials available, let the user log in.
      if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
          creds.refresh(Request())
        else:
          flow = InstalledAppFlow.from_client_config(
              CLIENT_CONFIG, SCOPES
          )
          flow.redirect_uri = MAIN_REDIRECT_URI

          authorization_url, state = flow.authorization_url(
              access_type='offline',
              include_granted_scopes='true',
              prompt='consent')

          # this is just a nice button with streamlit
          st.markdown(
            f"""
            
            Authorize with Google
            """,
            unsafe_allow_html=True
        )

def authenticate_user():
    """after loggin in with google, you have a code in the url. This function retrieves the code and fetches the credentials and authenticates user"""
    auth_code = st.query_params.get('code', None)
    if auth_code is not None:        
        # make a new flow to fetch tokens
        flow = InstalledAppFlow.from_client_config(
                CLIENT_CONFIG, SCOPES, 
            )
        flow.redirect_uri = MAIN_REDIRECT_URI
        flow.fetch_token(code=auth_code)
        st.query_params.clear()
        creds = flow.credentials
        if creds:
            st.session_state.creds = creds
            # Save the credentials for future use
            with open('token.json', 'w') as token_file:
                token_file.write(creds.to_json())
            st.success("Authorization successful! Credentials have been saved.")

            # Save the credentials for the next run
            with open("token.json", "w") as token: 
                token.write(creds.to_json())
            # get user email
            user_email = get_user_info(creds)
            st.session_state.user_email = user_email
            st.rerun()
    else: st.error("Could not log in user")

if st.button("LOGIN"):
    authorize_gmail_api()

if st.query_params.get('code', None):
    authenticate_user()

This was a lot of code, so let me explain. First, I have the _get_userinfo function, which returns the user's email if logged in with credentials. Secondly, the _authorize_gmailapi function first checks if a user already has credentials (a tokens.json file). If so, we do not need to log the user in. Then, I initialize a flow with my config file, where a flow is used to authenticate users with Gmail's authentication system. Using the flow, you then retrieve the authorization URL, which takes the user to a Google page where they can log in and consent to the scopes provided (reading their email address and reading their emails). The last code in this function with st.markdown is a nice green button in Streamlit. After the user enters the authorization URL, and logs in, they will be returned to your app (defined by the redirect URI), together with a code in their URL, this code will be used to authenticate the user is who they say they are.

The _authenticateuser function retrieves the code in the URL (which appears after the user logs in and is redirected back to the application) and sends this to Google for authentication. If Google authenticates the code, you will receive credentials in return, which you can store, and your user will be authenticated. This allows your application to access the data they consented your application to access. The last two if statements in the code are used to trigger the functions I explained above.

This is how the login and logout buttons will look if you implemented the code above. After pressing the Login button, a green button will appear which lets users authenticate using Google. After authenticated, they are returned back to the application with a code in the URI, which you can use to obtain credentials for a user, officially authenticating them, and which allows your application to access the data they consented your application to access. Image by the author.

You can log a user out with:

def logout(is_from_login_func=False):
    """Logs the user out by deleting the token and clearing session data."""
    st.query_params.clear()

    st.session_state.user_email = None
    st.session_state.creds = None

    if os.path.exists("token.json"):
        os.remove("token.json")
    if not is_from_login_func: st.success("Logged out successfully!")

Implementing this authentication flow was quite difficult, especially for production use (running it locally is easier, but the code above works in both local testing and production). But using the code above, you should be able to authenticate your users using their Gmail. Let me know if you have any questions regarding this part!

Retrieving emails

After authenticating the user, you can access their emails since you have read permissions. The functions below allow you to access a user's emails:

 def _get_email_body(msg):
  if 'parts' in msg['payload']:
   # The email has multiple parts (possibly plain text and HTML)
   for part in msg['payload']['parts']:
    if part['mimeType'] == 'text/plain':  # Look for plain text
     body = part['body']['data']
     return base64.urlsafe_b64decode(body).decode('utf-8')
  else:
   # The email might have a single part, like plain text or HTML
   body = msg['payload']['body'].get('data')
   if body:
    return base64.urlsafe_b64decode(body).decode('utf-8')
  return None  # In case no plain text is found

 # Function to list emails with a max limit and additional details
 def _list_emails_with_details(service, max_emails=100):
  all_emails = []
  results = service.users().messages().list(userId='me', maxResults=max_emails).execute()

  # Fetch the first page of messages
  messages = results.get('messages', [])
  all_emails.extend(messages)

  # Keep fetching emails until we reach the max limit or there are no more pages
  while 'nextPageToken' in results and len(all_emails) < max_emails:
   page_token = results['nextPageToken']
   results = service.users().messages().list(userId='me', pageToken=page_token).execute()
   messages = results.get('messages', [])
   all_emails.extend(messages)

   # Break if we exceed the max limit
   if len(all_emails) >= max_emails:
    all_emails = all_emails[:max_emails]  # Trim to max limit
    break

  progress_bar2 = st.progress(0)
  status_text2 = st.text("Retrieving your emails...")

  email_details = []
  for idx, email in tqdm(enumerate(all_emails), desc="Fetching email details"):
   # Fetch full email details
   msg = service.users().messages().get(userId='me', id=email['id']).execute()
   headers = msg['payload']['headers']

   email_text = self._get_email_body(msg)
   if email_text is None or email_text=="": continue
   if len(email_text) >= MAX_CHARACTER_LENGTH_EMAIL: email_text = email_text[:MAX_CHARACTER_LENGTH_EMAIL]  # Truncate long emails

   # Extract date, sender, and subject from headers
   email_data = {
    "text": email_text,
    'id': msg['id'],
    'date': next((header['value'] for header in headers if header['name'] == 'Date'), None),
    'from': next((header['value'] for header in headers if header['name'] == 'From'), None),
    'subject': next((header['value'] for header in headers if header['name'] == 'Subject'), None),
    "email_link": f"https://mail.google.com/mail/u/0/#inbox/{email['id']}"
   }
   email_details.append(email_data)
   progress_bar2.progress((idx + 1) / len(all_emails))  # Progress bar update
   status_text2.text(f"Retrieving email {idx + 1} of {len(all_emails)}")

  return email_details

The first function, __get_emailbody, is a helper function to retrieve the full text contents of an email (which might look confusing since you have to decode the mail and so on), but it works as intended. The second function first retrieves mails from the user with the second line in the function. Since many results can be retrieved, a nextPageToken is used, which allows us to extract all the results. After retrieving the emails and storing them in the _allemails variable, you extract the desired contents from the email, like the text content, ID, date, sender, subject, and a link to open the mail. Additionally, I use a progress bar in Streamlit to allow users to see how much more time it takes to retrieve the emails.

Addressing privacy concerns

Considering the privacy of users is critical when developing an application like this. When you want to deploy an application that utilizes user's emails, Google must verify your application, which includes showing how your application utilizes user's data, adding a privacy policy (mine is [here](https://maildiscoverer.streamlit.app/terms_of_service)), adding a terms of service (mine is here) and guaranteeing that you will not misuse user's data. It is important to take privacy concerns seriously, and treat user's data with care when developing an application such as the one discussed in this article. I have made sure to follow all of Google's requirements for treating sensitive information, and I naturally guarantee I will not misuse any of the data stored with this application.

In the next section, I will discuss storing the emails in a vector database. There are several elements you should take into consideration when doing this. First of all, disclose to your user that you are collecting and storing the information (for example, in the privacy policy). Secondly, make sure you safely store the information by either using a reputable database (I use Pinecone, a service I trust will treat the data properly), or if you are storing the data in your own hosted database, make sure the database is secure.

Storing the emails in a vector database

Since I have already _get_email_body about storing data in a vector database to quickly access it with a RAG system, I will not go in-depth on that topic in this article. I will, however, summarize how I approached the problem.

First, I decided to use Pinecone since it is a tool I am already familiar with. Plenty of other vector databases are out there, so feel free to choose the one you prefer. Pinecone is quite nice if you want to easily deploy your app, as you can quickly store your data in the cloud (with an index stored on the cloud). Furthermore, Pinecone has a good free tier, meaning you can use Pinecone a lot before you have to pay for it.

This flowchart highlights the steps in responding to a user's questions regarding their emails. First, the user logs in and permits the application to read their emails. These emails are then embedded and stored in a vector database. A user can then ask questions regarding their email. The most relevant emails to that question are retrieved, and GPT-4o mini uses these most relevant emails to respond to the user's question. Image by the author,

After retrieving the emails following the last chapter in this article, I embedded them using the text-embedding-3-small text embedding model from OpenAI. Again, you can use any model you desire here, but the one I used has solid results and was cheap. You then gather all the information you want to store in the vector database, which in this case is the email information (date, sender, subject, text) and the embedding.

To find relevant emails from a given query, you then embed the query using the same embedding method you used to embed the emails. This embedding is then compared using the vector database, and the top k most similar emails are returned, where k is a number you decide for yourself. After retrieving the most relevant emails, you then get the text from those emails and input that text, as well as the user's question, into an LLM, such as GPT-4o mini, which will provide an answer to the user's question, given the most relevant emails you gave as context.

It is also important to ensure that when you store vectors in the database, you can separate information based on the user. In this case, the user email will be a unique property (which is why we retrieved the email earlier), so make sure to store each email in the database with the email address of the accompanying owner. When searching for the most relevant emails to a given query, you then only search for emails with the same user.

This image shows how the application retrieves the emails of a user (in this case, for my email, which I have hidden for privacy purposes), and then embeds the emails. Image by the author,

Application performance

I also want to include a section on how well the application performs. To do this, I will show you several concrete examples of how the email can answer prompts from users. Unfortunately, I will not be able to share all my emails to verify the results, but I will show how the application can correctly answer questions from the user. You are naturally welcome to try out the application yourself to verify that it works.

Test 1

For my first test, I prompted MailDiscoverer to see if I have received any security alerts for my Google account lately (for example, if a new device has logged in to my account). I enter the following prompt:

Have I got any security alerts lately?

The application responds with the following:

Yes, you have received several security alerts recently:

X (Twitter) Login Alerts:

On September 1, 2024, you received two alerts about logins to your account from new devices:
One from ChromeDesktop on Windows in Oslo, Norway.
Another from on iPhone in Oslo, Norway.
Google Account Access:

On September 7, 2024, you received a security alert from Google indicating that an application named maildiscoverer was granted access to your Google Account. If you did not grant this access, it is advised to check the activity and secure your account.
These alerts indicate potential security concerns that you should review and address if necessary.

This is correct; I got these alerts on those exact dates. I will note, however, that I had also received some GitHub dependable alerts (about outdated packages on my repositories), which the language model did not include in its response. However, these emails were included as part of the most relevant emails.

Test 2

I also want to test how well the application works to answer specific questions. AWS recently invited me to apply for 300 dollars in credits on their website. I thus prompted the application to find out how many dollars I could get from applying to AWS.

I prompt the model with the following:

How many dollars of credits on AWS have I been offered to apply for?

The model responds with:

You have been offered to apply for $300 in AWS credits.

Which is the correct answer.

Thoughts on tests

I applied a simple test to verify the system's performance, which performed well. Overall, from both these tests and testing the application during development, the application performs well and solves the issue I want to solve: accessing information in emails faster. Since I do not want to share all of my emails, it is difficult to have fully transparent tests, but if you want to verify the application's performance, you are more than welcome to try it out yourself on the MailDiscoverer website.

Conclusion

In this article, I have discussed my application, MailDiscoverer, which makes emails more accessible to users. With the application, you can ask questions about your emails, and the application will respond to your question and link to the most relevant emails used to answer the question. First, I discussed my motivation for creating this application: I spend a lot of time searching for old emails from time to time. Furthermore, I discussed how you can set up the Google Gmail API to access users' emails with their consent. I then showed how you can retrieve those emails. I also discussed using a vector database like Pinecone to store emails and quickly search for the most relevant mail to a given query. This can be used in a RAG system to provide users answers to their questions regarding their emails. Lastly, I applied two simple tests to the system to verify its performance. On the prompts I entered, the application successfully extracted the correct answers.

Tags: AI ai-search Email Hands On Tutorials Retrieval Augmented