Using GPT-4 with Vision as an Art Critic

Author:Murphy | View: 25910 | Time: 2025-03-23 12:14:37

**minimalist painting of an AI art critic**, Image created using an AI image creation program, Midjourney, and edited by the author

About two weeks ago, OpenAI started rolling out access to their GPT-4 model with vision (GPT-4V) [1]. As a subscriber to ChatGPT Plus, which costs US$20 per month, I got early access to the new model. The system is available here: chat.openai.com.

With GPT-4V, I can submit a text prompt with up to four images. The system will analyze the prompt and images and respond accordingly. Since I have been writing about the creative uses of AI and ML, my first instinct was to see if GPT-4V can effectively function as an art critic.

Here is my first experiment. I had the system review an acrylic painting of a New England landscape that I made back in 2002 with the help of a reference image, some custom software, and an overhead projector. I took a photo of my painting and sent it to GPT-4V for review.

**ChatGPT User Interface** (left), **Warren Park in Autumn** (right), Screenshot and Painting by Author

Here's my prompt, followed by GPT-4V's review.

RobG: Based on the photo of this painting, write a review with constructive criticism. Warren Park in Autumn Robert A. Gonsalves Acrylic on Canvas Board 24" x 32"

GPT-4V: "Warren Park in Autumn" by Robert A. Gonsalves is a mesmerizing depiction of nature's beauty in its prime. The artist's choice of vibrant autumnal colors captures the essence of the season, making the scene come alive with fiery oranges and rich browns. The pointillist technique employed lends a dreamy, almost ethereal quality to the piece, allowing viewers to feel as if they are being engulfed in a whirlwind of leaves.

Constructive Criticism:

While the pointillism creates a captivating texture, a bit more variation in dot sizes might add further depth and realism to the foliage.

The sky, although beautifully rendered with its cool blues, could benefit from a smoother transition or gradient to give it a more natural feel.

The signature, "Robert Arthur 11/02", while important, appears slightly prominent. It might be more aesthetically pleasing if it were integrated more subtly into the painting.

Overall, this artwork is a brilliant representation of autumn, capturing the viewer's attention and transporting them to the heart of Warren Park.

The first part is quite flattering. I've noticed that ChatGPT generally offers upbeat responses. It pointed out some objective characteristics of my painting, like the pointillist style and autumnal colors. And I liked the enthusiasm when it said the painting gave viewers the feeling of "being engulfed in a whirlwind of leaves."

**Details of Warren Park in Autumn, Foliage** (left), **Sky** (middle), **and Signature** (right), Images by Author

In my prompt, I asked for some "constructive criticism," and it provided some. The note about the dot sizes was fair, but suggesting a smooth gradient in the sky seemed off, as pointillist paintings are typically made with only dots of colors. However, I was impressed with the system's ability to read my signature and date since I chose colors with barely any contrast. And, yes, my signature is unusually large for the painting.

In this article, I will provide some background on the GPT-4V model and see how well it works when reviewing other forms of visual art, including multimedia installations and AI-generated art.

The GPT-4V Model

In March 2023, OpenAI launched GPT-4, with the non-vision variant offered to ChatGPT Plus users. Here's what the authors said about the vision feature.

GPT-4 accepts prompts consisting of both images and text, which – parallel to the text-only setting – lets the user specify any vision or language task. Specifically, the model generates text outputs given inputs consisting of arbitrarily interlaced text and images. Over a range of domains – including documents with text and photographs, diagrams, or screenshots – GPT-4 exhibits similar capabilities as it does on text-only inputs. – OpenAI [1]

Model Size and Architecture

Unlike how OpenAI released previous models, it did not disclose much information about the size and architecture of its GPT-4 models. However, there are leaked reports that GPT-4 has about 1.8 trillion parameters. It uses a mixture of experts (MoE) system, with 16 experts within its model, each with about 111B parameters. For reference, the earlier GPT-3 models use 175 billion parameters [3].

Training Process

OpenAI did not disclose much about what data they used to train the GPT-4 models. On their research site, they said that their GPT-4 base model was trained with a corpus of public and licensed data, encompassing diverse reasoning abilities, statements, and ideologies. To ensure the model's responses align with user intent, they performed fine-tuning via Reinforcement Learning with Human Feedback (RLHF).

Experimental Results

To test the models, OpenAI ran a diverse set of benchmarks, including simulating exams initially designed for humans. They did not run specific training for these exams. The models saw a minority of the problems in the exams during training; for each exam, they ran a variant with these questions removed and reported the lower score [1]. Here are some of the results.

**Test Results of GPT Models**, Data from OpenAI, Graph by Author

You can see that the GPT-4 models outperformed the GPT-3.5 model significantly for most of the tests. It's interesting to see how GPT-4V did better than the standard GPT-4 on some tests, like the GRE Quantitative test. From their paper, I read that the researchers at OpenAI transcribed any images and diagrams in the tests as objectively as possible to be used for both the vision and no-vision models [1]. However, they didn't offer any insights as to why GPT-4 with vision did better on tests than the standard GPT-4 model. I hypothesize that GPT-4V was better equipped to answer these "word" problems that had descriptions of images.

Limitations of the Models

The authors describe the shortcomings of GPT-4 and GPT-4V. They report that the models, despite their advancements, share some weaknesses with their predecessors: they aren't always reliable, can produce "hallucinations," have a restricted context window, and don't learn from experience. These limitations, especially in crucial contexts, warrant caution when using them. The models' capabilities and constraints introduce new safety concerns, emphasizing the need for research due to its potential societal impact. The paper delves into potential risks, including bias, disinformation, and more, and outlines measures taken to minimize these risks, such as adversarial testing and a safety pipeline [1].

Initial Release of GPT-4V

Initially, the visual version, GPT-4V, was exclusively accessible through Be My Eyes, an app created by a Danish startup. This app aids individuals who are blind or have low vision by connecting them with volunteers for tasks like product identification and airport navigation. Harnessing the capabilities of GPT-4V, the startup introduced a "Virtual Volunteer" feature in the app, designed to match the understanding level of a human assistant [2].

OpenAI's Data Collection Policy

OpenAI changed its data collection policy. By default, your prompts and responses can be used by OpenAI to train future models. They previously used an online form where users could opt out of the data collection. With their new policy, data collection is now tied to chat history. If you turn off chat history, they will not use your prompts and responses to train their models. You can read about their new policy here.

Next, I'll show more experiments using GPT-4V as an art critic.

**ChatGPT User Interface**, Screenshot by Author

GPT-4V as an Art Critic

Can GPT-4V act as an art critic? I asked it this very question in the screenshot above, and it answered, "Absolutely!" and explained why it could. Let's see how it does in a few examples.

**Swarming by Anna Kristina Goransson and Rob Gonsalves**, Images by Author

Installation Art

The first example is a review of an interactive installation called "Swarming," which I co-constructed with my friend and artist, Anna Kristina Goransson. The installation showcases felted and dyed forms, which act as homes for "boids," digital creatures that appear to fly out of the forms and fly around through projection, eventually going back into the forms. The boids can be disrupted and moved by the viewer through a Microsoft Kinect that is watching the viewer.

I sent GPT-4V the three images and a description of the installation to get a review. Here's how the interaction went.

RobG: Based on the photos and description of this interactive installation, write a review with constructive criticism. [description]

_GPT-4V:_Review of "Swarming" by Anna Kristina Goransson and Rob Gonsalves

"Swarming" serves as an ambitious testament to the possibilities of blending tactile artistry with the immersive capabilities of modern technology. The vividly hued felt sculptures, designed by Anna Kristina Goransson, lend a whimsical texture to the exhibit. These organic forms juxtaposed against their animated counterparts offer a surreal escape from the mundane, channeling a world where art literally comes to life.

The four modes of interaction – Sleeping, Scouting, Swarming, and Interacting – have been thoughtfully designed to give visitors a layered experience. The mere presence of an audience can change the installation's behavior, bridging the gap between passive viewing and active participation. The use of the Microsoft Kinect interface to track visitors' movements and trigger visual and auditory responses is both innovative and engaging. This immediate feedback loop allows visitors to feel truly integrated within the artwork.

However, there are areas that might benefit from further refinement:

Ambient Sound: The silent modes, specifically Sleeping and Scouting, could potentially utilize subtle ambient sounds. This would enhance the experience by ensuring that the atmosphere doesn't feel too stark or empty during these quieter moments.

Clarity on Interaction: A simple guide or indication on how visitors can transition the exhibit from one mode to another might be beneficial. This can help ensure that visitors can fully explore the depth of interactivity that "Swarming" offers.

Lighting: Considering the importance of video projectors in this installation, special attention should be given to the ambient lighting of the exhibit room. Proper lighting can enhance the visibility of projections and ensure the felt sculptures are appropriately illuminated.

Diversity in Animation: While the "boids" offer a dynamic element to the exhibit, introducing varied patterns or behaviors could further enrich the viewer's experience. This could be as simple as diversifying the colors or sizes of the boids.

In conclusion, "Swarming" is an evocative blend of traditional art and cutting-edge technology, offering a multi-sensory experience that is both innovative and nostalgic. With a few adjustments, this installation has the potential to captivate audiences even further, making each visit a unique journey of discovery.

GPT-4V's review of the piece was very positive. It started with a highly favorable assessment of the installation using words like "ambitious," "whimsical," "innovative," and "engaging." It's almost as if I paid it to write the review. Oh, wait. I did.

Tags: AI Art Criticism Generative Ai Tools Gpt 4 Machine Learning