The rapid evolution of artificial intelligence has brought us to the latest iteration of the GPT model, GPT-4o, promising significant advancements over its predecessors. Is it significantly better than GPT-4 and GPT-3.5? Let’s see what the results of the GPT-4o vs. GPT-4 vs. GPT-3.5 battle are!

GPT-4o, launched in May 2024, is OpenAI’s latest and most advanced LLM. The “o” in GPT-4o stands for “omni,” highlighting its ability to accept a diverse range of input types, including text, audio, images, and video. This multimodal capability enables a much more natural and seamless human-computer interaction. Remarkably, the model is capable of responding to voice inputs in as little as 232 milliseconds, averaging around 320 milliseconds, which is on par with human conversational response times – making it the fastest model available. Besides its enhanced model capabilities, GPT-4o is designed to be both faster and more cost-effective.

Read also: Is Llama 3 Better Than GPT-4? Llama 3 vs. GPT-4 vs. GPT-4o Comparison

benchmarks of GPT-4o vs. other models in text evaluation

Benchmarks suggest that this new version of the GPT outperforms previous models in various metrics, but evaluating its true capabilities requires more than just numbers. To understand its performance, we are testing it through a series of diverse and complex tasks. This hands-on approach will allow us to see how well the new model handles the specific use case examples and check if improvements on paper translate to practical benefits.

This comparison will provide valuable insights into whether the latest iteration of GPT truly represents a leap forward or if the hype around new releases sometimes overshadows actual performance improvements. It will help you assess which model is the best for your use case and, hopefully, inspire you to explore the nuanced strengths and potential limitations of the latest GPT model compared to its predecessors.

Shortcuts:

GPT-4o vs. Previous Models – the Experiments

Our evaluation focuses on several key areas: multimodality, understanding and reasoning capabilities, and practical applications in daily tasks such as writing emails or coding. By examining these aspects, we aim to determine whether the new model’s capabilities genuinely improve our workflows and problem-solving efficiency. 

OpenAI Models for Text Search (Complex Data Extraction Tasks)

In this experiment, we evaluated how different models handled a prompt requiring the extraction of a specific quote from an example text. The task was to determine why the tippler was drinking, based on the text from “The Little Prince,” and include the exact quote and its page number. The original input included a PDF document with the whole book. As GPT-3.5 is not able to analyze files in such a format, we attempted to copy-paste the text, but it exceeded the allowed context window.

Fragment of "The Little Prince" - a task for GPT-4o vs. previous GPT models in text search

The model was unable to load the document. When the content was copied and pasted, an error message indicated that the context length was too big.

Accuracy: – (Unable to provide a response.)

The model was able to load the document. Here is its response:

Answer:

-> The tippler was drinking to forget that he was ashamed of his drinking habit.

Quote:

-> "So that I may forget," replied the tippler. "Forget what?" inquired the little prince, who already was sorry for him. "Forget that I am ashamed," the tippler confessed, hanging his head.

Page number:

-> 30

Accuracy: Correctly identified the reason for drinking and provided an accurate quote.
Metadata: Provided the approximate page number.

GPT-4o had no problem loading and analyzing the image. Here are its responses to a given prompt:

Answer:

-> The tippler was drinking to forget that he was ashamed of drinking.

Quote:

-> "I am drinking," replied the tippler, with a lugubrious air. 

"Why are you drinking?" demanded the little prince.

 "So that I may forget," replied the tippler. 

"Forget what?" inquired the little prince, who already was sorry for him. 

"Forget that I am ashamed," the tippler confessed, hanging his head. 

"Ashamed of what?" insisted the little prince, who wanted to help him. 

"Ashamed of drinking!" The tippler brought his speech to an end, and shut himself up in an impregnable silence.

Page number:

-> Page 29

Accuracy: Provided a comprehensive quote that included additional context, which made the answer more complete.
Metadata: Correctly identified the page number.

Which GPT model is better for the text search?

GPT-3.5 GPT-4 GPT-4o
Failed to complete the task. Successfully provided the response and quote but had a minor discrepancy in the page number. GPT-4o: Provided a detailed and accurate response with the correct metadata.
Accuracy: – (Unable to provide a response.) Accuracy: 5/5
Metadata: 4/5
Accuracy: 5/5
Metadata: 5/5

GPT-4 and GPT-4o both provided correct answers and accurate quotes, but GPT-4o was slightly more comprehensive and consistent with the metadata. Therefore, GPT-4o is the top performer in this comparison.

OpenAI Models for Picture Analysis

In this experiment, we evaluated how different versions of GPT handled the task of identifying the number of people in an example picture.

GPT models' capabilities in images analysis tasks

GPT-3.5 with Image Inputs

GPT-3.5 was not capable of loading or providing any information regarding the picture.

Accuracy: – (Unable to provide a response.)

GPT-4 with Image Inputs

Increased capabilities of GPT-4 enabled it to perform operations on image inputs – in a better or a worse way. In this case, it was slightly mistaken.

-> There are five people in the picture: a man, a woman, two children, and a small dog walking along the beach.

Accuracy: This response indicates five people plus a dog. It provides a detailed description but appears to count the dog as one of the “people,” which is inaccurate if we strictly count only human individuals.

GPT-4o with Image Inputs

As the most advanced ones, GPT-4o’s text and image capabilities have been released first. No wonder! It handled the given task better than the previous models.

-> There are four people in the picture along with a dog.

Accuracy: This response clearly separates the count of people from the dog, indicating four people and one dog. This is a precise and clear response if we are counting only humans.

Which language model is better for picture analysis?

GPT-3.5 GPT-4 GPT-4o
Failed to complete the task. Provided a detailed but slightly misleading answer by including the dog in the count of people. Provided a clear and accurate response, correctly distinguishing between people and the dog.
Performance: – (Unable to provide an answer.) Accuracy: 3/5 Accuracy: 5/5

GPT-4o is the top performer in this comparison for accurately identifying the number of people and distinguishing them from the dog.

GPT Models for Content Creation

In this experiment, we evaluated how different versions of GPT models handled word manipulation tasks, such as drafting a professional email proposing a new marketing campaign for the next quarter. The task required covering the campaign goal, high-level concept, target audience, key strategies and tactics, timeline and budget estimate, expected measurable outcomes, value proposition, benefits, and a request for feedback. The email should be short, concise, professional, and persuasive, addressing relevant stakeholders.

Checking model behavior in email drafting

GPT-3.5 for Content Writing

Famous for its skills in text operations, the GPT-3.5 model was finally able to show its capabilities.

Subject: Proposal for Next Quarter Marketing Campaign

Dear [Stakeholder Name],

I hope this email finds you well. I am writing to propose an exciting new marketing campaign for next quarter aimed at achieving our overarching goal of increasing sales and enhancing brand recognition.

1. Campaign Goal: Our primary objective is to boost sales by 20% and elevate brand recognition within our target market.

2. High-level Concept and Target Audience: The campaign, titled "Unlock Your Potential," will focus on showcasing how our products/services empower customers to achieve their goals. Our target audience comprises tech-savvy millennials aged 25-35, primarily urban professionals seeking innovative solutions.

3. Key Strategies/Tactics: We will leverage a multi-channel approach, including social media advertising, influencer partnerships, engaging content creation (blogs, videos, infographics), and targeted email campaigns.

4. Timeline and Budget Estimate: The campaign will run for 12 weeks, commencing on July 1st, with an estimated budget of $150,000 allocated across channels.

5. Expected Measurable Outcomes: We anticipate a 20% increase in website traffic, a 15% rise in social media engagement, and a 10% growth in sales conversion rates. Additionally, we aim to acquire at least 5,000 new leads through the campaign.

6. Value Proposition and Benefits: "Unlock Your Potential" will position our brand as a catalyst for personal and professional growth, resonating with our audience's aspirations. By showcasing real-life success stories and actionable insights, we will reinforce our value proposition of delivering innovative solutions that drive tangible results.

7. Request for Feedback: Your input on the proposed campaign strategy, including any suggestions or concerns, would be invaluable as we finalize the plan. Please let us know your thoughts by [insert deadline].

Thank you for considering this proposal. I look forward to your feedback and the opportunity to collaborate on executing this impactful marketing initiative.

Best regards,

[Your Name]

[Your Position]

[Your Contact Information]

Clarity: The email is clear and detailed,
Professionalism: While professional, the tone could be more engaging to avoid sounding overly formal.
Persuasiveness: The language is persuasive but sometimes feels more like an ad than a genuine message.
Conciseness: The email is slightly long. While comprehensive, it could be more concise to hold the reader’s attention better.

GPT-4 for Content Creation

While the GPT-3.5’s response wasn’t bad, the GPT-4 model seems to be a little better. Just like this mom’s friend’s son, who always got this extra point on the test.

Subject: Proposal for New Marketing Campaign Q3 - Your Input Needed

Dear [Stakeholder Name],

I hope this email finds you well. I am writing to propose a strategic marketing campaign for the upcoming quarter, aimed at enhancing our brand's visibility and driving significant growth. Below is an overview of the campaign:

Campaign Goal: Our primary objective is to increase sales by 15% while also acquiring new customers and improving brand recognition.

High-Level Concept & Target Audience: The campaign will focus on the theme "Empower Your Everyday" targeting young professionals aged 25-35 who value quality and innovation. This demographic frequently engages with digital content and values social validation.

Key Strategies/Tactics

1. Social Media Marketing: Utilizing platforms like Instagram and LinkedIn to reach our target audience with engaging content.

2. Influencer Partnerships: Collaborating with key influencers in the lifestyle and tech industries to enhance credibility and reach.

3. Content Marketing: Producing high-quality blog posts, videos, and interactive tools to educate and engage potential customers.

Timeline & Budget: The campaign will run for 3 months, starting from the beginning of Q3. We estimate a budget of $50,000 to cover all aspects of the campaign, including content creation, influencer fees, and ad spend.

Expected Measurable Outcomes: 

- A 15% increase in sales compared to the previous quarter.

- Acquisition of 10,000 new customers.

- A 20% boost in social media engagement and follower growth.

Value Proposition & Benefits: This campaign will position our brand as a leader in [Your Industry], directly contributing to increased market share and customer loyalty. By engaging with the target audience through preferred channels and influencers, we expect a high return on investment.

I would appreciate your feedback on this proposal. Are there any areas you would like us to enhance or additional insights you believe are crucial for the campaign's success?

Looking forward to your valuable input.

Best regards,

[Your Name]  

[Your Position]  

[Your Company]  

[Your Contact Information]

Clarity: Very clear and well-structured, covering all necessary points concisely.
Professionalism: Maintains a professional and respectful tone.
Persuasiveness: Strong use of persuasive language, particularly in the value proposition.
Conciseness: Well-balanced in terms of length and detail.

GPT-4o for Content Creation

Did “mom’s friend’s son” (a.k.a. GPT-4) finally find a worthy rival? Let’s see.

Subject: Proposal for New Marketing Campaign for Next Quarter

Dear Team,

I hope this email finds you well. I am excited to propose a new marketing campaign for the upcoming quarter aimed at boosting our sales and enhancing brand recognition. Below are the key details:

Campaign Goal:

Increase sales by 15% and improve overall brand recognition.

High-Level Concept and Target Audience:

We will focus on a "Summer Revival" theme, targeting millennials and Gen Z (ages 18-35) who are tech-savvy and active on social media.

Key Strategies/Tactics:

1. Social Media Campaigns: Leverage Instagram, TikTok, and Facebook with engaging posts, stories, and ads.

2. Influencer Partnerships: Collaborate with micro-influencers to authentically promote our products.

3. Content Marketing: Develop blog posts, videos, and infographics centered around summer trends and lifestyle.

4. Email Marketing: Send out personalized email campaigns to our existing customer base with special offers and discounts.

5. SEO and PPC: Enhance our website's SEO and run targeted PPC ads to capture search traffic.

Timeline and Budget Estimate:

- Planning and Preparation: June 1 - June 15

- Execution: June 16 - September 15

- Budget: $50,000 (detailed breakdown attached)

Expected Measurable Outcomes:

- Sales Increase: 15%

- Brand Recognition: 20% increase in social media followers and engagement

- Customer Acquisition: 10% growth in new customers

- Website Traffic: 25% increase in unique visitors

Value Proposition and Benefits:

This campaign will not only drive immediate sales but also strengthen our brand presence in the market. By tapping into the summer vibes and leveraging popular platforms, we can connect more deeply with our target audience.

Request for Feedback:

I would appreciate your feedback on this proposal by the end of the week. Your insights and suggestions will be invaluable to refining and finalizing our campaign plan.

Thank you for your time and consideration.

Best regards,

[Your Name]  

[Your Position]  

[Your Contact Information]

Clarity: Very clear, with distinct sections that enhance readability.
Professionalism: Professional tone, with a friendly but slightly overenthusiastic touch.
Persuasiveness: Persuasive and compelling, especially in the value proposition.
Conciseness: Concise and to the point, without missing any crucial details.

Do models from OpenAI have similar capabilities in writing emails?

Which language model is the best for email drafting?

GPT-3.5 GPT-4 GPT-4o
Good overall, but slightly lengthy. Excellent performance, clear and concise, with a strong professional tone. Great clarity and structure, maintaining a professional and engaging tone.
Clarity: 4/5
Professionalism: 3/5
Persuasiveness: 3/5
Conciseness: 3/5
Clarity: 5/5
Professionalism: 5/5
Persuasiveness: 5/5
Conciseness: 5/5
Clarity: 5/5
Professionalism: 4/5
Persuasiveness: 5/5
Conciseness: 5/5

GPT-4 and GPT-4o both performed well, providing clear, concise, and persuasive emails that effectively addressed all required points. GPT-4 had a slight edge in readability and structure, making it the top performer in this comparison.

OpenAI Models for Math Problem Solving

In this experiment, we evaluated how different versions of GPT models handled an example math question involving an arithmetic sequence. The task was to find the equation of a sequence given the first, third, fifth, and seventh terms: 2, 8, 14, and 20, respectively, where n = 1 corresponds to the first term.

The correct answer is:

f(n)=3(n-1)+2

Correct Answer Explanation:

The key points in finding the correct equation:

1. The first term is 2.

2. The terms provided are not consecutive but are two steps apart.

3. The common difference between these terms is halved because they are two steps apart (i.e., 6/2=3).

Using the general form of an arithmetic sequence an= a1+(n-1)d and the given terms, we derive:

an=2+(n-1)3

an=2+3n-3

an=3n-1

Let’s see how the models perform!

Which non-fine-tuned model is best for math problem solving?

GPT-3.5 for Math Problem Solving

In 2023, it was well-known that large language models struggle with complex mathematical questions. To be fair, it is possible to use a fine-tuned model for solving math problems in a more accurate manner. In this experiment, however, we decided to stick only to base models in each variant. Here is how GPT-3.5 dealt with the task.

To find the equation of the arithmetic sequence, we first need to find the common difference (d) between consecutive terms. 

The common difference (d) can be calculated using the formula:

d=an+1-an

Given the sequence: 2, 8, 14, 20

a1=2

a2=8

a3=14

a4=20

Using the formula, we can find the common difference:

d=a2-a1=8-2=6

Now that we know the common difference (d), we can write the equation of the arithmetic sequence.

The general formula for the nth term (an) of an arithmetic sequence is:

an=a1+(n-1)d

Substituting the values we have:

an=2+(n-1)6

GPT-3.5 explained the process well but made a mistake in the calculation of the common difference:

d=6

This led to an incorrect equation:

an=2+6(n-1)

Which simplifies to:

an=2+6n-6

an=6n-4

Correctness: The explanation was clear, but the calculation was incorrect due to a misunderstanding of the term spacing.

GPT-4 for Math Problem Solving

Early experiments showed GPT-4 scoring highly on the SAT Math and AP Calculus tests and on undergraduate-level mathematics. The expectations are set high – let’s see if GPT-4 is able to fulfill them.

To find the equation of an arithmetic sequence where \( n = 1 \) corresponds to the first term, we start by noting the general form of an arithmetic sequence:

an=a1+(n-1)d

where a1 is the first term and d is the common difference.

Given that the first term a1=2, we can use the information about the other terms to find d.

- The first term  a1=2.

- The third term  a3=8.

- The fifth term a5=14.

- The seventh term a7=20.

The third term a3 is given by:

a3=a1+2d=8

Substituting a1=2, we get:

2+2d=8

Solving for d:

2d=8-2=6

d=62=3

Now, we have a1=2 and d=3. The general term of the sequence is:

an=a1+(n-1)d

an=2+(n-1)3

an=2+3n-3

an=3n-1

Thus, the equation of the arithmetic sequence is:

an=3n-1

GPT-4 provided a clear and correct explanation, accurately calculating the common difference and deriving the correct equation:

d=3

Thus, the equation is:

an=2+(n-1)3

an=2+3n-3

an=3n-1

Correctness: GPT-4 showed a comprehensive understanding and correctly derived the sequence equation.

GPT-4o for Math Problem Solving

Can GPT-4o get better results than GPT-4? Let’s check!

To find the equation of the arithmetic sequence where the first, third, fifth, and seventh terms are given, we start by recognizing the form of an arithmetic sequence:

an=a+(n-1)d

Where a is the first term and d is the common difference.

Given:

- The first term a1=2.

- The third term a3=8.

- The fifth term a5=14.

- The seventh term a7=20.

First, we use the general formula of the arithmetic sequence to express these terms:

a1=a=2

a3=a+2d=8

2+2d=8

2d=6

d=3

Now, let's check the remaining terms to verify our common difference d:

a5=a+4d=2+43=2+12=14

a7=a+6d=2+63=2+18=20

Thus, we have verified that the terms given follow the pattern with the first term a=2 and common difference d=3.

Therefore, the equation for the n-th term of the sequence is:

an=2+(n-1)3

Simplifying this, we get:

an=2+3n-3

an=3n-1

Thus, the equation of the arithmetic sequence is:

an=3n-1

To summarize, the equation of the sequence where n-1 corresponds to the first term is:

an=3n-1

Just like GPT-4, GPT-4o also thoroughly explained and verified the common difference by checking additional terms. It correctly derived the sequence equation:

d=3

Thus, the equation is:

an=2+(n-1)3

an=2+3n-3

an=3n-1

Correctness: GPT-4o not only derived the correct equation but also verified the common difference, showing a robust approach and strong understanding capabilities.

Which model is better for math problem-solving?

GPT-3.5 GPT-4 GPT-4o
Good explanation but incorrect common difference calculation, leading to a wrong equation. Correctly derived the equation with a clear explanation. Correctly derived the equation and verified the common difference, showing thoroughness.
Explanation: 4/5
Correctness: 2/5
Explanation: 5/5
Correctness: 5/5
Explanation: 5/5
Correctness: 5/5

GPT-3.5 explained the process but miscalculated the common difference, resulting in an incorrect equation. GPT-4 correctly identified the common difference and derived the correct equation with a clear explanation. GPT-4o provided the correct equation and verified the calculation through additional steps, demonstrating thoroughness. Overall, GPT-4 and GPT-4o excelled, with GPT-4o showcasing a more robust approach.

Which non-fine-tuned model is best for math problem solving?

GPT Models for Programming

Do developers’ tasks require using a fine-tuned model? Not necessarily! In this experiment, we set out to see how well different versions of GPT could write a functioning Snake game. There were no specific requirements for resolution, color scheme, or collision mechanics. The main goal was to assess how each version of GPT handled this simple task with minimal intervention. Given the popularity of this particular programming problem, it’s likely that parts of the code might have been included in the training data for models, which might have introduced bias.

GPT-3.5 for Programming

The oldest architecture started well. It proposed a code structure, listed the requirements, and differentiated between the game’s key components, such as the snake and the food. However, the code didn’t run correctly from the start, which wasn’t entirely surprising. We refrained from fixing the code and instead copy-pasted the errors from the terminal into GPT-3.5. After a few iterations (around five), we encountered an issue with event handling that GPT-3.5 couldn’t resolve. With a minor tweak to the game loop, the game finally ran. Surprisingly, this version of Snake was quite enjoyable! The controls were smooth, the pace was just right, and it even allowed Pac-Man-style wall warping. Really enjoyable!

Snake game generated with GPT-3.5 base model

Code Generation: Required multiple iterations to resolve errors.
Gameplay: The final game was enjoyable.

GPT-4 for Programming

We repeated the process with GPT-4. To our surprise, it required only the first error from the terminal to fix all issues. The development was extremely fast, taking just two or three minutes. Additionally, GPT-4 included a scoring mechanism. However, this version didn’t feel as smooth as the one produced by GPT-3.5.

Snake game generated with GPT-4 large model

Code Generation: Fixed errors quickly and implemented a scoring mechanism.
Gameplay: Slightly less smooth gameplay.

GPT-4o for Programming

We had high expectations for the latest version. Unfortunately, the disappointment matched the anticipation. Initially, it created versions that completely ignored collisions between the snake and the food. Even when instructed to fix this issue, it couldn’t solve it. After two attempts, it produced a version that started with a game over screen and couldn’t be played. Finally, on the fourth attempt, it created a game that ran without issues. However, the game feel was subjectively worse than the two older models.

Snake game generated with GPT-4o base model

Code Generation: Struggled with collision handling and initial issues. Needed a few iterations.

Gameplay: The final game was less enjoyable.

Which language model is the best for programming?

GPT-3.5 GPT-4 GPT-4o
Required a few fixes but produced enjoyable gameplay.s to resolve errors. The final game was enjoyable. Quickly resolved errors with slightly less smooth gameplay. Struggled with initial issues and produced a less enjoyable game.
Code Generation: 3/5
Gameplay: 5/5
Code Generation: 4.5/5
Gameplay: 4/5
Code Generation: 2/5
Gameplay: 3/5

In conclusion, GPT-3.5 model, despite requiring more iterations, delivered the most enjoyable gameplay experience. GPT-4 provided a faster development process but with slightly less smooth gameplay. GPT-4o struggled the most with initial issues and produced a less enjoyable final product.

The comparison of the leading genrative AI models

GPT-3.5 vs. GPT-4 vs. GPT-4o – Key Takeaways

In a series of experiments, we tested three versions of the models—GPT-3.5, GPT-4, and GPT-4o—on various tasks: text search, picture analysis, email drafting, math problem-solving, and programming. The results reveal some interesting differences in their capabilities.

Model Text Search Picture Analysis Email Drafting Math Problem Solving Programming
GPT-3.5 Failed to complete the task. Couldn’t analyze the picture. Good, but a bit long. Incorrect solution. Many fixes needed; yet, fun game.
GPT-4 Correct answer, slight page error. Counted dog as a person. Clear, concise, very professional. Perfect solution. Quick fix, but the gameplay less smooth.
GPT-4o Detailed and spot-on. Accurate, separated dog count. Clear, engaging, well-structured. Correct and thorough. Struggled at first, less enjoyable game

GPT-4 and GPT-4o models both show significant improvements over GPT-3.5, but each has its strengths and weaknesses. It’s worth noting that this comparison is subjective, not a rigorous scientific study.

There are two key takeaways:

1. GPT-3.5 falls short on all of the presented tasks compared to two models: GPT-4 and GPT-4o.

2. The GPT-4o model, despite the buzz, still has some limitations.

When should I choose GPT-4o, GPT-4 and GPT-3.5?

Does this mean we should ditch GPT-3.5 model entirely? Absolutely not! GPT-3.5 remains a robust model, especially for tasks like data extraction, summarization, and text parsing. Plus, it’s faster and cheaper than its newer counterparts.

While GPT-4o model is still finding its footing, it shows great promise for tackling more challenging tasks and offers cost benefits. After all, GPT-4o introduces several enhancements, such as the ability to respond in real-time audio, better comprehension of various languages, and vision and audio input capabilities, which were not available in the previous versions of the model. Also, its capabilities are expected to improve in the coming weeks.

So, what’s the bottom line? In theory, it’s simple: Use GPT-3.5 for the vast majority of straightforward, simple tasks. For more complex challenges, go with the new versions of the GPT model (GPT-4 or GPT-4o), depending on your style and budget. The practice shows, however, that it’s not always that obvious, so before making the final decision, it’s worth building some proof of concepts and testing different models for the specific task.

Wondering which model is the best for YOUR use case? Let’s test it together!

Tell us about your project.