Azure OpenAI Service quotas and limits
This article contains a quick reference and a detailed description of the quotas and limits for Azure OpenAI in Azure AI services.
Quotas and limits reference
The following sections provide you with a quick guide to the default quotas and limits that apply to Azure OpenAI:
Limit Name | Limit Value |
---|---|
OpenAI resources per region per Azure subscription | 30 |
Default DALL-E 2 quota limits | 2 concurrent requests |
Default DALL-E 3 quota limits | 2 capacity units (6 requests per minute) |
Default Whisper quota limits | 3 requests per minute |
Maximum prompt tokens per request | Varies per model. For more information, see Azure OpenAI Service models |
Max fine-tuned model deployments | 5 |
Total number of training jobs per resource | 100 |
Max simultaneous running training jobs per resource | 1 |
Max training jobs queued | 20 |
Max Files per resource (fine-tuning) | 50 |
Total size of all files per resource (fine-tuning) | 1 GB |
Max training job time (job will fail if exceeded) | 720 hours |
Max training job size (tokens in training file) x (# of epochs) | 2 Billion |
Max size of all files per upload (Azure OpenAI on your data) | 16 MB |
Max number or inputs in array with /embeddings |
2048 |
Max number of /chat/completions messages |
2048 |
Max number of /chat/completions functions |
128 |
Max number of /chat completions tools |
128 |
Maximum number of Provisioned throughput units per deployment | 100,000 |
Max files per Assistant/thread | 20 |
Max file size for Assistants & fine-tuning | 512 MB |
Assistants token limit | 2,000,000 token limit |
GPT-4o max images per request (# of images in the messages array/conversation history) | 10 |
GPT-4 vision-preview & GPT-4 turbo-2024-04-09 default max tokens |
16 Increase the max_tokens parameter value to avoid truncated responses. GPT-4o max tokens defaults to 4096. |
Regional quota limits
Region | GPT-4 | GPT-4-32K | GPT-4-Turbo | GPT-4-Turbo-V | gpt-4o | gpt-4o - GlobalStandard | GPT-35-Turbo | GPT-35-Turbo-Instruct | Text-Embedding-Ada-002 | text-embedding-3-small | text-embedding-3-large | Babbage-002 | Babbage-002 - finetune | Davinci-002 | Davinci-002 - finetune | GPT-35-Turbo - finetune | GPT-35-Turbo-1106 - finetune | GPT-35-Turbo-0125 - finetune | GPT-4 - finetune |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
australiaeast | 40 K | 80 K | 80 K | 30 K | - | - | 300 K | - | 350 K | - | - | - | - | - | - | - | - | - | - |
brazilsouth | - | - | - | - | - | - | - | - | 350 K | - | - | - | - | - | - | - | - | - | - |
canadaeast | 40 K | 80 K | 80 K | - | - | - | 300 K | - | 350 K | 350 K | 350 K | - | - | - | - | - | - | - | - |
eastus | - | - | 80 K | - | 150 K 1 M |
450 K 10 M |
240 K | 240 K | 240 K | 350 K | 350 K | - | - | - | - | - | - | - | - |
eastus2 | - | - | 80 K | - | 150 K 1 M |
450 K 10 M |
300 K | - | 350 K | 350 K | 350 K | - | - | - | - | 250 K | 250 K | 250 K | - |
francecentral | 20 K | 60 K | 80 K | - | - | - | 240 K | - | 240 K | - | 350 K | - | - | - | - | - | - | - | - |
japaneast | - | - | - | 30 K | - | - | 300 K | - | 350 K | - | 350 K | - | - | - | - | - | - | - | - |
northcentralus | - | - | 80 K | - | 150 K 1 M |
450 K 10 M |
300 K | - | 350 K | - | - | 240 K | 250 K | 240 K | 250 K | 250 K | 250 K | 250 K | 100 K |
norwayeast | - | - | 150 K | - | - | - | - | - | 350 K | - | - | - | - | - | - | - | - | - | - |
southafricanorth | - | - | - | - | - | - | - | - | 350 K | - | - | - | - | - | - | - | - | - | - |
southcentralus | - | - | 80 K | - | 150 K 1 M |
450 K 10 M |
240 K | - | 240 K | - | - | - | - | - | - | - | - | - | - |
southindia | - | - | 150 K | - | - | - | 300 K | - | 350 K | - | 350 K | - | - | - | - | - | - | - | - |
swedencentral | 40 K | 80 K | 150 K | 30 K | 150 K 1 M |
- | 300 K | 240 K | 350 K | - | 350 K | 240 K | 250 K | 240 K | 250 K | 250 K | 250 K | 250 K | 100 K |
switzerlandnorth | 40 K | 80 K | - | 30 K | - | - | 300 K | - | 350 K | - | - | - | - | - | - | - | - | - | - |
switzerlandwest | - | - | - | - | - | - | - | - | - | - | - | - | 250 K | - | 250 K | 250 K | 250 K | 250 K | - |
uksouth | - | - | 80 K | - | - | - | 240 K | - | 350 K | - | 350 K | - | - | - | - | - | - | - | - |
westeurope | - | - | - | - | - | - | 240 K | - | 240 K | - | - | - | - | - | - | - | - | - | - |
westus | - | - | 80 K | 30 K | 150 K 1 M |
450 K 10 M |
300 K | - | 350 K | - | - | - | - | - | - | - | - | - | - |
westus3 | - | - | 80 K | - | 150 K 1 M |
450 K 10 M |
- | - | 350 K | - | 350 K | - | - | - | - | - | - | - | - |
gpt-4o rate limits
gpt-4o
introduces rate limit tiers with higher limits for certain customer types.
gpt-4o global standard
Note
The global standard model deployment type is currently in public preview.
Tier | Quota Limit in tokens per minute (TPM) | Requests per minute |
---|---|---|
Enterprise agreement | 10 M | 60 K |
Default | 450 K | 2.7 K |
M = million | K = thousand
gpt-4o standard
Tier | Quota Limit in tokens per minute (TPM) | Requests per minute |
---|---|---|
Enterprise agreement | 1 M | 6 K |
Default | 150 K | 900 |
M = million | K = thousand
Usage tiers
Global Standard deployments use Azure's global infrastructure, dynamically routing customer traffic to the data center with best availability for the customer’s inference requests. This enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage may see more variability in response latency.
The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customer’s usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant.
GPT-4o global standard & standard
Model | Usage Tiers per month |
---|---|
GPT-4o |
1.5 Billion tokens |
Other offer types
If your Azure subscription is linked to certain offer types your max quota values are lower than the values indicated in the above tables.
Tier | Quota Limit in tokens per minute (TPM) |
---|---|
Azure for Students, Free Trials | 1 K (all models) |
MSDN subscriptions | GPT 3.5 Turbo Series: 30 K GPT-4 series: 8 K |
Monthly credit card based subscriptions 1 | GPT 3.5 Turbo Series: 30 K GPT-4 series: 8 K |
1 This currently applies to offer type 0003P
In the Azure portal you can view what offer type is associated with your subscription by navigating to your subscription and checking the subscriptions overview pane. Offer type corresponds to the plan field in the subscription overview.
General best practices to remain within rate limits
To minimize issues related to rate limits, it's a good idea to use the following techniques:
- Implement retry logic in your application.
- Avoid sharp changes in the workload. Increase the workload gradually.
- Test different load increase patterns.
- Increase the quota assigned to your deployment. Move quota from another deployment, if necessary.
How to request increases to the default quotas and limits
Quota increase requests can be submitted from the Quotas page of Azure OpenAI Studio. Please note that due to overwhelming demand, quota increase requests are being accepted and will be filled in the order they are received. Priority will be given to customers who generate traffic that consumes the existing quota allocation, and your request may be denied if this condition isn't met.
For other rate limits, please submit a service request.
Next steps
Explore how to manage quota for your Azure OpenAI deployments. Learn more about the underlying models that power Azure OpenAI.
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for