On April 14, 2025, OpenAI introduced a new family of models named GPT-4.1, GPT-4.1 Mini, and GPT-4.1 Nano. These models are built to be excellent at coding and following complex instructions. They are available through OpenAI’s API but not on ChatGPT. Each model has a 1 million token context window, which means they can handle about 750,000 words at once.
OpenAI says the GPT-4.1 models are a big step toward their goal of creating an Agentic Software Engineer. This is what CFO Sarah Friar shared at a tech summit in London last month. The company wants future models to code entire apps from start to finish, including tasks like quality assurance, bug testing, and writing documentation.
The full GPT-4.1 model scored 55% on the SWE-Bench benchmark, a test that measures coding skills. This is better than OpenAI’s earlier models like GPT-4o and GPT-4.5 in some areas. On a human-validated subset called SWE-Bench Verified, GPT 4.1 scored between 52% and 54.6%. However, competitors like Google’s Gemini 2.5 Pro scored 63.8%, and Anthropic’s Claude 3.7 Sonnet scored 62.3% on the same test.
In another test called Video MME, which checks how well models understand video content, GPT 4.1 achieved 72% accuracy in the "long, no subtitles" category, topping the charts according to OpenAI.
The GPT 4.1 Mini and Nano models are designed to be faster and more efficient, though they may be slightly less accurate. GPT 4.1 Nano is OpenAI’s fastest and cheapest model yet. Pricing for these models is affordable, GPT 4.1 costs $2 per million input tokens and $8 per million output tokens, GPT 4.1 Mini is $0.40 per million input tokens and $1.60 per million output tokens, and GPT 4.1 Nano is $0.10 per million input tokens and $0.40 per million output tokens.
These models come as OpenAI faces tough competition from companies like Google, Anthropic, and a Chinese startup DeepSeek.
A Stanford University report notes that Google and DeepSeek now match OpenAI’s abilities in some AI tasks. Google’s Gemini 2.5 Pro also has a 1 million token context window and ranks high on coding benchmarks, just like Anthropic’s Claude 3.7 Sonnet and DeepSeek’s V3.
OpenAI has worked to make GPT-4.1 better for real-world coding tasks. They improved areas like frontend coding, making fewer unnecessary edits, following formats correctly, and using tools consistently, based on feedback from developers.
Kevin Weil, OpenAI’s chief product officer, said during a livestream that these models are great for building AI agents and coding. Developers have already shared positive feedback, with some noting that a test version of GPT 4.1, codenamed Alpha Quasar, fixed issues they faced with other AI-generated code.
However, OpenAI admits that GPT 4.1 has limits. It becomes less reliable with more input tokens, dropping from 84% accuracy with 8,000 tokens to 50% with 1 million tokens on their own test, OpenAI MRCR.
It can also be more "literal" than GPT-4o, meaning users might need to give more specific prompts. Studies also show that even the best AI models today can struggle with tasks experts find easy, like fixing security issues or avoiding bugs in code.