OpenAI Releases GPT-5.4 and it is State of the Art (SOTA) across Key Coding Benchmarks

On March 5th, 2026, OpenAI released GPT-5.4  GPT-5.4 brings together the best of their “recent advances in reasoning, coding, and agentic workflows into a single frontier model”. OpenAI said that GPT-5.4 “incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex⁠ while improving how the model works across tools, software environments, and professional tasks involving spreadsheets, presentations, and documents.”  GPT 5.3 Coden is great for agentic coding and combining it with GPT-5.4 helps you get real work done accurately, effectively, and efficiently.

OpenAI shared the following benchmarks comparing 5.4 to 5.3 Codex and GPT 5.2 in their announcement:

 

For Coding, they also shared their SWE-Bench Pro (public) Accuracy vs Latency Benchmarks:

 

As a demonstration of their model’s improved computer-use and coding capabilities working in tandem, they released an experimental Codex skill called “Playwright (Interactive)⁠(opens in a new window)”. This new skill “allows Codex to visually debug web and Electron apps; it can even be used to test an app it’s building, as it’s building it.”

 

I can’t wait to test and see how it has improved AI assisted game development!  I’ll let you know how my testing goes via my AI Assisted Unreal Engine video game development project using my Betide NeoStack AI Plugin and my OpenRouter account.

Epic Games recently acquired Meshcapade, a startup specializing in AI technologies for creating and animating hyper-realistic digital human models and animations from video recordings 

Epic Games recently acquired Meshcapade, a startup specializing in AI technologies for creating and animating hyper-realistic digital humans. Meshcapade, a spin-off from the Max Planck Institute for Intelligent Systems based in Tübingen, Germany, develops AI tools that generate precise 3D body models and animations from video recordings.

Meshcapade’s AI tools, built on the SMPL (Skinned Multi-Person Linear) parametric body model, primarily target and automate the body modeling and full-body motion capture stages in Epic’s MetaHuman pipeline, which currently relies on limited presets for bodies and less advanced video-based tracking for animation.

What Meshcapade Replaces in the current MetaHuman Pipeline:

  1. Preset/Manual Body Modeling: Replaces MetaHuman Creator’s ~50 preset bodies and external sculpting/rigging with AI-generated custom SMPL bodies from a single image, video clip, or scan. It extracts precise shape, pose, and clothing details automatically—no manual adjustments needed.
  2. Expensive Mocap Hardware: Replaces marker/suit-based systems (e.g., optical mocap studios) with markerless full-body capture from any camera (phone/webcam/pro). It tracks subtle motions like fingers/hands, camera movement, and multi-person scenes.
  3. Fragmented Body Animation: Complements/enhances MetaHuman Animator’s body solve with superior AI-driven full-body mocap, reducing retargeting hassles (pre-acquisition the plugin existed; now it will become native).

With the acquisition, Epic says it’s “looking forward to working together to advance digital human technologies for use across gaming, film and entertainment.”  I’m personally looking forward to a much simpler MetaHuman creation and animation pipeline for video game development and virtual production. development.

 

Claude Sonnet 4.6 Released on February 17, 2026

Anthropic released Claude Sonnet 4.6 on February 17, 2026 – almost 2 weeks after releasing their flagship Claude Opus 4.6 model.   Below are the latest benchmarks that Anthropic published in their Claude Sonnet 4.6 announcement:

 

You can read more about Anthropic Claude Sonnet 4.6 here.

AI Co-Development Just Got a Lot Easier with the Release of Claude Opus 4.6 and Open AI GPT 5.3 Codex on February 5, 2026

In this article, I will compare two specific domains: coding an agentic application (e.g., autonomous AI agents that use tools, plan multi-step workflows, interact with environments like terminals/browsers, debug in loops, and handle real-world software engineering) and game development (e.g., designing game logic, implementing mechanics in engines like Pygame/Unity-style code, handling physics/AI/pathfinding, procedural generation, balancing systems, and iterating on prototypes).

On February 5, 2026 Anthropic released Claude Opus 4.6 which is their current flagship, optimized for the most demanding, long-horizon, and complex tasks.  On the very same day, OpenAI released GPT-5.3 Codex (the dedicated agentic coding line). GPT-5.3 Codex is OpenAI’s current flagship for advanced coding and agentic workflows.

Key Specs Comparison:

Context Window: Claude Sonnet 4.5: 200K tokens Claude Opus 4.6: 200K standard / 1M tokens (beta, with much higher reliability) GPT-5.3-Codex: Not explicitly stated in releases (likely 200K–512K range, consistent with GPT-5 family; strong on long-horizon without 1M claims yet)

Pricing (API, per million tokens) Sonnet 4.5: $3 input / $15 output Opus 4.6: $5 input / $25 output (higher beyond 200K) GPT-5.3-Codex: Similar to prior GPT-5.x tiers (~$5–15 input / $15–75 output range; exact Codex rates match high-tier GPT access via app/CLI/API)

Speed Sonnet 4.5: Fastest of the three for iteration Opus 4.6: Slower (deeper thinking/effort modes) GPT-5.3-Codex: ~25% faster than GPT-5.2-Codex; feels responsive for agent steering

Reasoning/Agent Style All three are hybrid/agentic-tuned. Opus 4.6 has adaptive effort + best long-horizon sustain. GPT-5.3-Codex emphasizes steerable, interactive agents (you guide mid-task without context loss). Sonnet 4.5 balances speed + quality.

Performance on Agentic Coding / Building Agentic Applications:

This category is extremely competitive right now — the February 5 releases were direct head-to-heads.

  • Terminal-Bench 2.0 (terminal/tool-use agentic coding): GPT-5.3-Codex leads at ~77.3% (state-of-the-art claim); Opus 4.6 close behind (~65–72% range in prior reports, but Anthropic claims top on agent evals); Sonnet 4.5 lower (~51%).
  • SWE-Bench Pro/Verified (real GitHub issues, multi-lang): GPT-5.3-Codex hits new highs on Pro variant (multi-lang, harder); Opus 4.6 edges on Verified for complex fixes (~80–81%); Sonnet 4.5 strong but trails slightly.
  • OSWorld / Computer Use: GPT-5.3-Codex strong (~64–65%); Opus 4.6 leads in sustained GUI/terminal chains.
  • Long-horizon Agents: Opus 4.6 excels at multi-hour autonomous runs, massive context, self-correction. GPT-5.3-Codex shines on interactive steering (like a live colleague) + tool/research chaining. Sonnet 4.5 great for volume/prototyping but needs more retries on extreme complexity.
  • Real-world vibe: Developers report GPT-5.3-Codex feels “more Claude-like” than prior OpenAI models (better git, debugging, broad tasks). Opus 4.6 often wins on deep codebase navigation + fewer hallucinations. Many use both in parallel.

Below is Anthropic’s Terminal Bench 2.0 Rating of 65.4% published by Anthropic:

Below is OpenAI’s GPT-5.3 Codex Terminal Bench 2.0 Rating of 77.3% published by OpenAI:

For serious agentic apps (autonomous tools, computer-use agents, production reliability): It’s a toss-up between Opus 4.6 (best long sustain + 1M context) and GPT-5.3-Codex (interactive steering + benchmark edges). Sonnet 4.5 is excellent but not quite frontier here.

Performance on Game Development

Game development favors creativity + systems integration + rapid iteration (mechanics, AI, procedural gen, balancing, physics sims).  All three handle Pygame prototypes, Godot/Unity pseudocode, NPC AI, etc., very well.

  • Sonnet 4.5: Best for fast prototypes/iterations — clean loops, quick balancing tweaks, simple pathfinding/behavior trees. Preferred for indie-speed work.
  • Opus 4.6: Pulls ahead on complex, interconnected systems (e.g., economy + AI opponents + physics + UI + narrative). Better at coherent large-scale design → code chains, debugging edge cases in bigger codebases, creative-yet-coherent ideas.
  • GPT-5.3-Codex: Strong contender — OpenAI highlights building “highly functional complex games and apps from scratch over days.” Excels at multi-step execution + research/tool use (e.g., pulling refs for mechanics). Feels more “productive” for iterative game workflows.

For most game prototypes/indie development: Sonnet 4.5 (speed + cost) or GPT-5.3-Codex (if you want interactive guidance).

For ambitious/systems-heavy games (e.g., simulation layers, procedural worlds, long development cycles): Opus 4.6 edges out slightly on cohesion + autonomy, but GPT-5.3-Codex is very close and often faster to iterate.

My Current Recommendation for Cloud AI Agent Assisted Development:

  • Claude Sonnet 4.5 — Default choice for 80% of agentic coding + game development: fast, cheap, reliable near-frontier.
  • Claude Opus 4.6 — Pick for the hardest agentic/long-horizon work, massive context, or when you need max reliability/cohesion (especially complex games or production agents).
  • GPT-5.3-Codex — Pick (or combine) if you value interactive steering, OpenAI ecosystem (Codex app/CLI/Copilot integration), or hit edges where OpenAI’s benchmarks shine. It’s neck-and-neck with Opus 4.6 right now — many developers test both on the same task.

I am currently using the Claude models with my OpenRouter account as I test the  Neostack AI Plugin for Unreal Game Development.

Bottom line, the AI Assisted Video Game Development race is razor-close! Real preference often comes down to workflow (e.g., Claude Code vs. Codex app), ecosystem lock-in, and the development tools for your specific game project.   In a future post, I will go over using Claude with the Unreal Engine Game and the Betide Noestack AI Plugin.

Co-Developing Video Games using the Latest version of the NeoStack AI Unreal Game Engine Plugin Using Multiple LLMS via OpenRouter

I am currently the NeoStack AI Plugin by BETIDE STUDIO.   The NeoStack AI plugin is currently in Beta and was originally released on the Epic FAB store on January 20, 2026.  I am currently testing the latest beta version (v0.3.1) which was released on 2/9/2026.

The developer is constantly releasing updates and fixes so I expect I will be constantly updating the plugin while it is in Beta.   I am looking forward to seeing how the Agent Integration plugin evolves over the next several months.

To give you a perspective of how rapidly the plugin is evolving, here is a list of updates that were released over the last couple of weeks:

The v.0.3 beta update which released the other day added the following features and fixes:

  • Composite Graphs, Comments, Macros, Local Variables, Component, Reparenting, Blueprint Interfaces are now supported! (This means agents now can do 100% of things that you can in regards of Blueprints)
  • Almost 500+ additional checks added over the plugin to reduce the number of crashes. Just want to say, if you crash, please let us know! We want to take the number of crashes to 0
  • Subconfigs added which means you can configure what tools a profile has
  • Session Resume has been fixed – No need to tell the agent what you already have
  • You can now see Usage Limits for OpenAi Codex/Claude Code in the plugin (See Image)
  • Codex Reasoning Parameter is now supported
  • Task Completion Notifications — Toasts, taskbar flash, and sound playback when the agent finishes (Configurable)
  • Montage Editing made better
  • Pin Splitting/Recombining & Dynamic Exec Pins is now supported
  • GitHub Copilot Native CLI support added
  • Fixed Packaging Issues if using Plugin
  • Class Defaults editing — ConfigureAssetTool can now set Actor CDO properties like bReplicates, NetUpdateFrequency, AutoPossessPlayer
  • Last-used agent persistence — Remembers which agent you used last instead of defaulting to Claude Code every time
  • Fixed issues with if you already are using the port for MCP

The v.0.2 beta update from the week before added the following new features and fixes:

  • Full Animation Montage editing!
  • Enhanced Input support! Create & edit Enhanced Input assets
  • Agent can now see what’s in your viewport! Captures the active viewport as an image with camera info
  • Level Sequencer has been completely rewritten
  • Agent can now set any node property directly on Behaviour Trees
  • BT reads now always show full node tree + blackboard details. Blueprint vars show replication flags. New readers for Enhanced Input & Gameplay Effects. Level Sequence reads are way more detailed now
  • New “Think” toggle for Claude Code — control how much the agent reasons
  • Streaming no longer bounces — smooth throttled updates
  • Message layout redesigned — cleaner look
  • Session saving no longer freezes UI on tab switch
  • Server now blocks browser requests by default (CSRF protection)
  • 24 Crash Fixes (It is still in Beta!)

Prior to releasing the beta plugin, Betide Studio published a standalone Windows & Mac version of the tool called NeoAI.  NeoStack AI now works across virtually every major Unreal Engine system as it is a native Unreal Engine Plugin.  It currently supports the following Unreal Engine systems:

  • Blueprints
  • Materials
  • Animation Blueprints
  • Behaviour Trees
  • State Trees
  • Structs
  • Enums
  • DataTables
  • Niagara VFX
  • Level Sequences
  • IK Rigs
  • Animation Montages
  • Enhanced Input
  • Motion Matching
  • PCG
  • MetaSounds
  • and much more planned over the next several months

NeoStack AI is currently priced at $109.99 on the FAB Store.  You can read more about all the different features here.

It currently works with Claude, Gemini and OpenAI CLI.  Claude Code & Gemini CLI integration requires those tools to be installed separately.  I personally prefer using my OpenRouter API Key so I can pick which AI model to use depending on the which development task I am currently doing.   Betide Studio currently recommends using Claude Code as it works best for creating Blueprint logic while the other models produce less than desired results.  The best part of NeoStack AI is that you get to choose which model to use (400+) and how much you want to pay for tokens.  You can even use the free models available on OpenRouter (e.g. GLM, DeepSeek, Qwen Coder, etc.) if you don’t want to pay for tokens and you just want to play around with it.

The Default Agent LLMs listed in the OpenRouter Dropdown menu are the following:

Once you install the plugin and properly configure it, you get access to this Agent Chat Window.  I configured the plugin to connect to OpenRouter with my API Key:

I’ll provide a follow up on my beta testing process after I have tested the plugin over the next several weeks.

Using the GLM 4.7V Flash Local LLM Model by Z.ai to Develop a Moon Landing Simulation Using C# on my Alienware Aurora R11 RTX-3080 10GB Video Card

The GLM-4.7 Flash model, an Open Weight 30B-parameter  Mixture of Agents (MoE) variant, released by Beijing Zhipu Huazhang Technology Co., Ltd on January 19, 2026, is positioned as a lightweight, efficient option for local deployment and agentic tasks like coding.

The Lunar Lander coding simulation coding test ran without any problems at 6.12 tokens per second using LM Studio:

Below you can see the Task Manager Performance Chart for my NVIDIA GeForce RTX-3080 with 10GB VRAM;

Based on available benchmark data, it achieved the following scores on the specified evaluations:

GPQA: 75.2%
AIME 25: 91.6%
SWE-bench Verified: 59.2%

These results position it as a strong performer in its size class, particularly for coding and reasoning, outperforming comparably sized models like Qwen3-30B on SWE-bench Verified while maintaining lower resource requirements. Note that benchmarks can vary slightly based on evaluation configurations, but these figures are consistent across official sources and reviews.

 

 

Using GLM 4.6V Flash Local LLM Model by Z.ai to Develop a Moon Landing Simulation Using C# on my Alienware Aurora R11 RTX-3080 10GB Video Card

Over the past 6 months, I have continued to test locally hosted open-source multimodal agentic models which could run comfortably on my Nvidia RTX 3080 with 10GB.  Back in the August, I added GLM 4.5 to my testbench as it was surpassing or matching DeepSeek V3, Qwen 2.5 Coder, and Llama 3.1 in benchmarks. At the time, I was busy testing OpenAI GPT OSS so I didn’t get a chance to write a blog post about my GLM 4.5 testing.
Z.ai (Zhipu AI) recently launched GLM 4.6V Flash on LM Studio which is a 9B vision-language model optimized for local deployment and low-latency applications. It supports a context length of 128k tokens and achieves strong performance in visual understanding among models of similar scale.  The model introduces native multimodal function calling, enabling vision-driven tool use where images, screenshots, and document pages can be passed directly as tool inputs without text conversion.
It needs a “minimum” of 8GB VRAM to run so I ran my standard Lunar Lander Simulation coding test using the default 4096 token context window. My initial coding test quickly reached the LM Studio default 4086 context window and errored out with a “Failed to send message” after 1 minute and 44 seconds:

I then reloaded the model with a maximum token context window of 131072 with maximum layers offloaded to the GPU:

The Lunar Lander coding simulation coding test then ran without any problems at 11.69 tokens per second:

Below you can see the Task Manager Performance Chart for my NVIDIA GeForce RTX 3080;

GLM 4.6 scored 81% on GPQA, 93.9 on AIME 2025, and 69% on SWE-bench Verified.  The Open Source LLM Models keep getting better and better at coding with each new release and the releases keep coming out faster and faster.   Z.ai just released GLM 4.7 on December 22, 2025, and it quickly jumped to the top of the Open-Source Leaderboard.  The new GLM 4.7 scored 85.7% on GPQA, 96.7 on AIME 2025, and 73.8% on SWE-bench Verified. Once the new 4.7 Flash model gets uploaded to LM Studio, I’ll test the new 4.7 model as well.

Wishing you and your family a Merry Christmas and a Happy New Year!

 

OpenAI Releases GPT-5 and it is State of the Art (SOTA) across Key Coding Benchmarks

OpenAI released GPT‑5 today and it is now state-of-the-art (SOTA) across key coding benchmarks, scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot.

SWE-bench Verified: (Tests AI models on real-world GitHub issues, evaluating their ability to generate accurate code patches)

GPT-5 with Thinking (High) scored highest with 74.9%, followed closely by Claude Opus 4.1 at 74.5%.  For comparison, OpenAI o3 High scored at 69.1%

Aider Polyglot: (Evaluates code editing across multiple programming languages (e.g., Java, Rust, Python):

GPT-5 dominates with 88%, a substantial lead over competitors.  Grok 4 comes in next at 79.6%.  For comparison, OpenAI o3 came in at 81%.

This new era of AI Assisted coding LLMs just keeps getting better and better each day!

Using OpenAI GPT-OSS Open Weight Local LLM Model to Develop a Moon Landing Simulation Using C# on my Alienware Aurora R11 RTX-3080 10GB Video Card

OpenAI released their new Open Weights local LLM model today under the Apache License that you can download and run it locally on your own hardware without the need of a cloud subscription.  The larger 120 billion parameter model will require a system with an 80GB GPU (who has that!?) or a MAC M3/M4 system with at least 128GB of Integrated/shared memory. The new AMD Ryzen AI 395 Max + 395 with 128GB can also run the 120B parameter model locally.    You can definitely run the smaller 4-bit quantization 20 billion parameter model locally using a NVidia GeForce 3090, 40×0 or 50×0 video card with at least 16GB of VRAM.  I personally downloaded the smaller 20B parameter model and got around 11 tokens per second using my RTX 3080 with 10GB using LM Studio.  Some of the GPT-OSS 20b model had to be loaded into my local system ram on my Alienware A11 which made it run much slower.

On a system with 16GB VRAM or 128GB of Integrated memory, the free GPT-OSS LLM performs pretty well on the Codeforces Competiton code benchmark tests against OpenAI’s other subscription-based cloud LLMs:

Source: Codeforces Benchmark from OpenAI

Since I only have 10 GB of VRAM on my RTX-3080, I tested GPT-OSS-20b with LM Studio and offloaded some of the model into my 64GB of system memory.  I first tested it with a simple Hello World C++ application, and I got an average of 11 tokens per second dividing the model between high-speed Ge-Force VRAM and slow system RAM.

I then tested it by building a much larger C# Moon Landing simulation and it ran out of context memory, and it did not complete the coding task.

Update:  I finally figured out how to enlarge my context window on LM Studio to 131072 and was able to successfully build the Lunat Lander game:

I definitely need to have at least 16 of VRAM, or better yet, 128GB of shared integrated memory in my development system to get faster local GPT-OSS LLM performance along with a bigger context window to complete this Moon Landing C# AI Assisted coding task.

Microsoft also announced today that you can also try it out in Foundry Local or AI Toolkit for VS Code (AITK) and start using it to build your Microsoft applications today using the GPT-OSS local LLM models. Let the free local GPT-OSS LLM Open Weights AI Assisted App & Game Development begin!

Ludus AI Agent Assisted Game Development Tools for Unreal Game Engine

In preparation for my presentation for my July 2025 CGDC presentation on AI Assisted Game Development Tools, I spent a lot of time beta testing the latest beta version of the Ludus AI Blueprints Toolkit for the Unreal Game Engine using Unreal 5.6.

I did my Blueprints testing using their 14 Day free Pro Subscription Trial with 20,00 credits.  They also provided additional testing credits for the closed beta period to help with the closed beta, which I really appreciated!

I was particularly focused on their new AI Agent Blueprint Generation Features.   They currently provide support for: Actor, Pawn, Game Mode, Level Blueprint, Editor Utility Blueprints & Function Library Blueprints.   They also offer partial support for Widget (UMG).  They are planning to provide support for Material, Niagara, Behaviour Tree, Control Rig, and MetaSound in the near future.  Their goal is to fully integrate Blueprint Generation into the main plugin interface. Once the Open Beta phase concludes, activation via a special beta code will no longer be necessary.

For my beta testing, I asked the Ludus AI agent to identify all the load issues I was getting after upgrading an old Kidware Software Unreal Engine 4.26 RPG academic game project to the latest version of Unreal 5.6.  I asked the new Ludus AI Blueprints agent to identify and fix the load issues I was getting after I updated the old project from Unreal 4.26 to 5.6.

The Ludus AI Agent examined my updated Unreal Engine 5.6 project and determined which Blueprints were broken and it gave me a systematic plan to fix them. I then asked Ludus AI (in Agent mode) to apply all the recommended fixes and test them for me. You can watch the 5-minute video below showing the Ludus AI agent in action:

In Agent mode, Ludus AI made all the fixes (without my help) and my RPG game project worked flawlessly using Unreal Engine 5.6.  I was genuinely impressed!   Below is the video I took of the fixed gameplay after Ludus AI did its Unreal Engine 5.6 Blueprint repair magic:

Back in 2022, it took me many hours to identify and fix all those Unreal 5 Upgrade errors that I was getting back then doing it all manually myself.  The Ludus AI Blueprints beta agent did all the repair work for me in approximately 5 minutes.

I was so impressed by the Beta Ludus AI Agent Blueprints results; I personally purchased a subscription for the Ludus AI plugin for myself.  They offer several tiers of pricing for Indies, Pros and Enterprise customers:

They also offer a Free 400 Credit per month subscription so you can take a look at their plugin.

They also sell additional credit packages just in case you burn through your credit allotment during a given month.

The Ludus AI Blueprints 0.6.0 beta agent support moved into “Open Beta” on July 29, 2025, so you can now try out their new AI Blueprint beta features yourself at https://ludusengine.com/ using a Ludus AI Pro 14 Day Free Evaluation Subscription.

I plan to give you another update on my Ludus AI Assisted Unreal Game Engine Blueprint testing in a future blog post.