The release of ChatGPT has caused a huge hype around the language models behind it: GPT-3, and the later versions, GPT-3.5, GPT-3.5-turbo, and GPT-4. No wonder, as they’re pretty capable! These models can perform NLP tasks with high accuracy and automate many language-related tasks, such as text classification, question answering, machine translation, and text summarization; they can be used for generating content, analyzing customer data, developing advanced conversational AI systems, and more.

If you’re reading this article, you’ve probably already had a chance to play around with ChatGPT or seen it in action featured on YouTube, blogs, and social media posts, and now you’re thinking about taking things to the next level and harnessing the power of GPT models for your own projects.

But before you dive into all the exciting possibilities and plan your product’s roadmap, let’s address one important question: How much does it cost to use GPT-3 in a commercial project?

Shortcuts:

How much does it cost to use GPT-3 in a commercial project? OpenAI API pricing overview

OpenAI promises simple and flexible pricing of its generative AI models.

gpt-3 language models

We can choose from four language models: Ada, Babbage, Curie, and Davinci. Davinci is the most powerful one (used in ChatGPT), but the other three still can be successfully used for easier tasks, such as writing summaries or performing sentiment analysis. The price is calculated per every 1K tokens. Using the Davinci model, you would pay $1 for every 50K tokens used. Is it a lot? As explained on the OpenAI pricing page:

You can think of tokens as pieces of words used for natural language processing. For English text, 1 token is approximately 4 characters or 0.75 words. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words). As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.

So for only $100, you are able to perform operations on ~3,750,000 English words, which is ~7500 single-spaced pages of text. However, as we can read further:

Answer requests are billed based on the number of tokens in the inputs you provide and the answer that the model generates. Internally, this endpoint makes calls to the Search and Completions APIs, so its costs are a function of the costs of those endpoints.

So our 7500 pages of text include input, output, and the prompt with “instructions” for the model. This makes the whole estimation process a bit tricky as we don’t know what the output may be.

To find out, we decided to run an experiment. The goal was to check the actual token usage with the three sample prompts, understand what factors impact the output, and learn to estimate the cost of generative AI projects based on GPT-3 better.

How to measure token usage in GPT-3?

The experiment involved combining prompts with text corpora, sending them to an API, and calculating the returned tokens. The API request cost was monitored in the usage view, with a wait time of at least 5 minutes per request to match the billing window. Then, we manually calculated the cost and compared it with the recorded usage to check for discrepancies.

The plan was simple: collect several corpora (~10), prepare the prompts, estimate token usage, and call an API a few times to see the results. We aimed to find correlations between input (corpora + prompt) and output, discover factors impacting output length, and predict token usage based on the input and prompt.

Read also: Open-source vs. OpenAI. 8 best open-source alternatives to GPT

Step 1: Estimating the price of GPT-3 inputs

First, we wanted to verify the accuracy of the OpenAI Pricing page. We used the Tokenizer — an official OpenAI tool — to calculate token counts for a piece of text so we could later compare these results with data from the usage view and actual billing.

We used descriptions of the ten most downloaded apps (TikTok, Instagram, Facebook, WhatsApp, Telegram, Snapchat, Zoom, Messenger, CapCut, and Spotify) as our corpora. This allowed us to test different use cases like keyword searching, summarizing, and transforming text into project requirements. The descriptions ranged from 376 to 2060 words.

Let’s take a look at what it looked like. Here is the fragment of a TikTok description:

gpt-3 pricing estimate token usage (tokenizer)
gpt-3 pricing estimate token usage (tokenizer)

The text sample consisted of 1609 words and 2182 tokens, which — depending on the chosen GPT-3 model — should cost:

AdaBabbageCurieDavinci
$0,0009$0,0011$0,0044$0,0437

We did the same with each of the ten app descriptions in our corpora.

gpt-3 token usage of a sample text

This was our reference for the actual tests with GPT-3 API.

Step 2: Preparing the prompts

As a next step, we prepared the prompts. For the purposes of this experiment, we wanted to use three prompts for three different use cases.

Read also: Top generative AI marketing tools: how they impact marketers’ work?

Prompt #1: Gathering project requirements with GPT-3

The first prompt was about gathering project requirements based on the given app description.

Describe in detail, using points and bullet points, requirements strictly related to the project of an application similar to the below description:

Our prompt was 22 words (148 characters) long, which equaled 26 tokens. We added these values to the corpora and calculated the estimated token usage again for each model.

gpt-3 pricing experiment - project requirements token usage

Prompt #2: Writing a TL;DR summary with GPT-3

The second prompt was about writing summaries of long fragments of text. The model’s “job” was to identify the most important parts of the text and write a concise recap.

Create a short summary consisting of one paragraph containing the main takeaways of the below text:

Our prompt was 16 words (99 characters) long, which equaled 18 tokens. Again, we added these values to the corpora.

gpt-3 pricing experiment - summary token usage

Prompt #3: Extracting keywords with GPT-3

The last prompt was supposed to find and categorize the keywords from the text and then present them in a certain form.

Parse the below content in search of keywords. Keywords should be short and concise. Assign each keyword a generic category, like a date, person, place, number, value, country,  city, day, year, etc. Present it as a list of categories: keyword pairs.

It was 41 words (250 characters) long, which equaled 61 tokens. Together with the corpora text, it gave us:

gpt-3 pricing experiment - tagged keywords token usage

The next step was supposed to finally give us some answers. We were going to send our prompts with corpora texts to API, calculate how many tokens are returned in output, and monitor our API requests in the usage view.

Get Inspired and Go Big With GPT-3

From generating content, analyzing customer data, to developing advanced conversational AI systems. See how GPT-3 can help your business grow.

Discover GPT-3 use cases

Step 3: GPT-3 API testing

At this stage, we decided to focus only on the most advanced of the GPT models: Davinci.

Since token usage on the OpenAI platform is measured in 5-minute intervals, our script sent one API request every 5 minutes. Each request combined one piece of text (corpora) with one prompt, allowing us to get precise token usage data for each combination and compare it with estimates. We tested 30 combinations: 3 prompts x 10 app descriptions. We didn’t add variables to model settings (like model temperature) to avoid increasing the number of combinations and the experiment’s cost.

gpt-3 billing information

After sending these 30 requests, we compared the results shown in the Usage view with the ones taken directly from the metadata of our API calls.

The results were coherent with each other. Moreover, the token usage of the prompts — including both the prompt and the corpora — was also coherent with the usage estimated earlier with the Tokenizer.

At this point, we knew that we could estimate the token usage of the input with high accuracy. The next step was to check if there was any correlation between the length of the input and the length of the output and find out if we could estimate the token usage of the output.

gpt-3 input vs output tokens

The correlation between the number of input and output tokens was very weak*. Measuring the number of input tokens was not enough to estimate the total number of tokens used in a single request.

* The slope varied between 0,0029 in the TL;DR summary and 0,0246 in the project requirements request.

graph showing correlation between input and output in GPT-3 requests
graph showing correlation between input and output in GPT-3 requests
graph showing correlation between input and output in GPT-3 requests

What factors impact the GPT-3 cost?

While there was no clear correlation between input tokens (prompt + corpora) and output tokens (response), we could see that the prompt itself impacted the number of output tokens. In all analyzed cases, generating project requirements took more tokens than extracting and grouping keywords, but the cost difference was minimal (~$0.04 per request). Costs would likely increase if the prompt required the GPT-3 model to create longer texts, like blog articles.

Apart from the specific use case (what we use the model for), there are also other factors that can impact the cost of GPT models integration in your project. Among others, these would be:

Model’s temperature

The temperature parameter controls the randomness of the model’s outputs, and setting it to a higher value can result in more diverse and unpredictable outputs. This can increase the computational resources required to run the model and, therefore, affect the cost.

Quality of prompt

A good prompt will minimize the risk of receiving the wrong response.

Availability

The cost of using GPT-3 may also be impacted by the availability of the model. If demand for the model is high, the cost may increase due to limited availability.

Customization

The cost of using GPT-3 can also be influenced by the level of customization required. If you need specific functionality, additional development work may be required, which can increase the cost.

As a user, you are able to control the budget by setting soft and hard limits. With a soft limit, you will receive an email alert once you pass a certain usage threshold, and a hard limit will simply reject any subsequent API requests once it’s reached. It is also possible to set the max_tokens parameter in the request.

max_tokens gpt-3 how much does it cost

However, you need to keep in mind that the limits you set will have an impact on the efficiency of the model. If the limits are too low, API requests simply won’t be sent, so you — and your users — won’t get any response.

Read also: RAG vs. Fine-Tuned Models vs. Prompting: How to Reduce Generative AI’s Hallucinations and Boost Its Performance?

How to estimate the cost of using GPT-3? OpenAI pricing simulation

The experiment showed that estimating token usage based solely on corpora and prompts is very difficult. GPT-3 costs are influenced by many factors, including use case, prompt quality, customization, API call volume, and computational resources. From our experiment, we can roughly estimate GPT3 cost for specific use cases like keyword extraction, gathering project requirements, or writing summaries.

Cost of using GPT-3 – project simulation

Let’s take a look at the first case and assume that you have a customer service chatbot on your website, and you would like to know what the users usually ask about. To get such insights, you need to:

  • analyze all the messages they send,
  • extract the entities (e.g., product names, product categories),
  • and assign each an appropriate label.

You have ~15.000 visitors per month, and every visitor sends 3 requests twice a week. In this scenario, we have 360K requests per month. If we take the average length of the input and output from the experiment (~1800 and 80 tokens) as representative values, we can easily count the price of one request:

An equation helping to explain the cost of using GPT-3 - project simulation

The cost of using GPT-3 (Davinci model) in the analyzed case would be ~$14,4K per month.

Note that this is a simplified simulation, so the results are not fully representative. The actual cost of building a GPT-3-powered product depends on many factors (project complexity, data quality, prompts, model settings, user count). The margin of error for such estimates can be 50-100%. To get more reliable estimates, it’s best to run a proof-of-concept project and test different scenarios on your specific data samples.

How much does it cost to use GPT language models? — summary

GPT-3 is a relatively new technology, and there are still many unknowns related to its commercial use. Measuring token usage and its price on the input side ($0.04 per 1000 tokens for the Davinci model) is possible, but predicting output values is difficult. Many variables impact these values, and the correlation between input and output is low.

Because of that, any “raw” estimates are guesswork. To improve accuracy (and validate GPT-3’s feasibility for a specific use case), a proof of concept is necessary. In a PoC, we take sample corpora and test the model with different prompts and settings to find the best combination.

Build your Proof-of-Concept with GPT models

Test OpenAI’s generative AI models for your specific use case. Reduce the risks and uncertainties and increase the chances of success.

Complete the form below, and let’s talk!

BONUS: How much does it cost to use GPT-3.5 turbo with OpenAI Foundry?

On February 21, news about OpenAI’s new offering, Foundry, went viral, reaching top tech media like TechCrunch and CMS Wire. According to the product brief, running a lightweight GPT-3.5 version costs $78,000 for three months or $264,000 for a year. The advanced Davinci model (with a token limit eight times higher than GPT-3) will cost $468,000 for a three-month commitment or $1,584,000 over a one-year commitment.

OpenAI foundry cost gpt-4

But what is this all about? As we can read on Techcrunch:

If the screenshots are to be believed, Foundry — whenever it launches — will deliver a “static allocation” of compute capacity (…) dedicated to a single customer. (…)

Foundry will also offer service-level commitments for instance uptime and on-calendar engineering support. Rentals will be based on dedicated compute units with three-month or one-year commitments; running an individual model instance will require a specific number of compute units.

It seems, though, that the service-level commitment should not be treated as a fixed-price contract. For now, it’s safe to assume the price covers only access to a model with dedicated capacity and “full control over its configuration and performance”, according to the product brief screenshots.

The prices of tokens in the new models were not announced in the product brief. In the official OpenAI documentation update from March 1, though, we could read that the GPT-3.5 Turbo comes at 1/10th of the cost of the GPT-3 Davinci model — which gives us $0,002 per 1k tokens in GPT-3.5 Turbo.

In the analyzed case of having a customer service chatbot on your website, with ~15.000 visitors per month, each sending 3 requests twice a week, the estimated price of using GPT-3.5-Turbo would be not ~$14.4K but ~$1.44K.

The next update from March 14 revealed the pricing for the two remaining models — now officially named GPT-4.

The GPT-4 with an 8K context window (about 13 pages of text) will cost $0.03 per 1K prompt tokens, and $0.06 per 1K completion tokens. The GPT-4-32k with a 32K context window (about 52 pages of text) will cost $0.06 per 1K prompt tokens, and $0.12 per 1K completion tokens.

As you can see, there is a significant difference in the pricing compared to the older versions of the model. While GPT-3 and GPT-3.5 had a fixed price per 1K tokens, GPT-4 distinguishes the cost of prompt tokens and completion tokens. Applying this to our scenario (360K requests per month, each with 1800 prompt and 80 completion tokens), we would get the total cost of using GPT-4:

  • $21.2K with the 8K context window,
  • $42.3K with the 32K context window.

While this cost is much higher than with the text-davinci-003 model (or gpt-3.5-turbo), remember that use cases for various GPT models differ. This reinforces that “raw” estimates are guesswork. To improve accuracy and validate a model’s feasibility for a specific use case, a proof of concept is necessary.