Victoria Livschitz is a poker player, entrepreneur and a founder of the poker training site Octopi Poker. Prior to working in poker, she ran one of the earliest advanced research LLM labs, from 2015.
In this exclusive article Livschitz analyzes the play of the LLM AIs who contested the recent PokerBattle AI, an online match between some of the biggest names in consumer AI.
For further information on some of the language and terms used in this article, check out our detailed guide to poker terms.
An unusual online poker battle took place over five days, starting on October 27, 2025. The tournament pitted nine leading Large Language Models (LLMs) against each other in a $10/$20 no-limit Texas hold'em cash game.
Organizer Max Pavlov stated the goal was “to get a clear verdict on how well different general-purpose LLMs reason in poker situations by breaking down hands, explaining decisions and essentially integrating all the different parts of the game into one coherent whole.”
The format involved nine models playing at four concurrent, nine-handed tables for five days. The game was $10/$20 no-limit Texas hold'em with no ante or straddle. Each model began with a $2,000 (100 big blind) stack, which was automatically topped up to 100bb if it dropped below that level. They had access to a total bankroll of up to $100,000.
Over the course of the battle, each model played 3,799 hands, with the exception of the LLAMA model. LLAMA was eliminated after losing its entire $100,000 bankroll by hand #3,501. The models were ultimately ranked based on their total winnings.
The participants, meta statistics and final standing
The participants included five leading American-made LLMs, three Chinese models and one French model. Here are their results:
| Place | Model | Affiliation | Win/Loss $ | Win rate, BB/100 |
|---|---|---|---|---|
| 1 | o3 | OpenAI (USA) | $36,691 | +48.3 |
| 2 | Claude Sonnet 4.5 | Anthropic (USA) | $33,641 | +44.3 |
| 3 | Grok | X.com (USA) | $28,796 | +38.0 |
| 4 | R1 Deep Seek | DeepSeek AI (China) | $18,416 | +24.2 |
| 5 | Gemini 2.5 Pro | Google (USA) | $14,655 | +19.3 |
| 6 | Magistral | Mistal AI (France) | $3,281 | +4.3 |
| 7 | Kimi K2 | Moonshot AI (China) | -$14,370 | -18.9 |
| 8 | GLM 4.6 | Z.ai (China) | -$21,510 | -28.31 |
| 9 | LLAMA 4.0 | Meta (USA) | -$100,000 | -142.82 |
Overall tendencies of each model are captured in this table (scroll for the full figures). Since 4,000 hands is a limited dataset, insufficient by at least a factor of ten for reliable conclusions, this type of data must be taken with a grain of salt.
| Player | Hands | VPIP | PFR | 3-Bet | F3B | C-Bet | FCB | WTSD |
|---|---|---|---|---|---|---|---|---|
| OpenAI | 3,799 | 26.6% | 18.1% | 16.3% | 31.0% | 62.1% | 22.5% | 19.5% |
| Claude | 3,799 | 27.1% | 15.7% | 11.1% | 49.8% | 84.2% | 28.2% | 16.3% |
| Grok 4 | 3,799 | 29.7% | 19.6% | 12.5% | 62.5% | 82.7% | 26.1% | 19.1% |
| DeepSeek | 3,799 | 19.9% | 10.6% | 8.8% | 37.0% | 66.7% | 24.9% | 16.9% |
| Gemini | 3,799 | 28.1% | 21.4% | 20.6% | 40.2% | 57.4% | 14.5% | 28.3% |
| Magistral | 3,799 | 15.9% | 13.6% | 10.1% | 83.9% | 87.9% | 16.4% | 19.4% |
| Kimi K2 | 3,799 | 20.0% | 10.9% | 10.9% | 36.2% | 79.5% | 40.5% | 12.4% |
| GLM 4.6 | 3,799 | 31.0% | 15.5% | 10.0% | 59.5% | 66.8% | 22.5% | 20.8% |
| LLAMA 4 | 3,501 | 62.8% | 28.2% | 18.2% | 35.4% | 75.4% | 12.2% | 33.9% |
| GTO Baseline | 18-22% | 14-18% | 6-8% | 55-65% | 55-70% | 40-55% | 24-28% |
(VPIP = Voluntarily Put Money In Pot; PFR = Preflop Raise; 3-Bet = Re-raise preflop; F3B = Fold to 3-Bet; C-Bet = Continuation Bet; FCB = Fold to Continuation Bet; WTSD = Went To Showdown)
The methodology behind this report
The complete results and detailed hand histories, including the reasoning behind every decision made by the models, are publicly available at PokerBattle.ai.
My analysis began with a macro-level frequency comparison, assessing each model's performance against the others and against Game Theory Optimal (GTO) guidelines for this game. It is important to note that the total sample of hand histories is small. Therefore, generalizations should be approached with caution, as they may be inconclusive or incorrect due to natural variance in card distributions.
Furthermore, the significant differences in playing skills and styles, along with the exploitive adjustments used by the models, further limit the reliability of the insights gained from comparing them to optimal play or to each other's strategies. Nevertheless, certain observations are highly likely to be accurate.
Following the macro analysis, I conducted an in-depth review of a select group of specific hand histories, starting from the largest pots. The goal was to identify common overarching patterns across all models, as well as unique abilities that account for the observed performance advantages or flaws.
Clear patterns emerged immediately after reviewing just a couple of hand histories, and these patterns were generally confirmed, not contradicted, by every subsequent hand history we reviewed. We must again emphasize the small size of the total hand history sample analyzed. Consequently, it remains possible that other hand histories exist which could contradict or at least temper our conclusions.
This report includes an in-depth analysis of three hand histories, along with the original unmodified reasoning used by every relevant model for every decision. These hands were chosen as the clearest illustrations of the key observations presented herein.
This report was authored by me, and reviewed by several people mentioned in the acknowledgement section. It was not written by any AI. However, the Gemini LLM contributed to the proofreading of certain sections of it.
The big picture: LLMs are good, but not great at poker
The most striking observation is how competent the top AIs were at deep stack poker. Without specialized poker training, the best models understood and applied foundational poker theory, tracked opponents' actions, adjusted their play accordingly and rarely made egregious mistakes.
Their preflop play is especially impressive, sometimes even flawless, including 3-bet, 4-bet and 5-bet pots. They could likely beat the softest low-stakes human cash games. This is a genuinely impressive feat for general-purpose language models that were not trained specifically to play poker.
However, the 'good' AIs still have significant leaks:
- Wannabe masters of exploits: All but the very worst LLMs are practicing highly exploitive poker. They religiously keep track of the opponent's frequencies in common nodes and routinely make significant, sometimes wild adjustments based on these observations. However, the data on the opponents is tracked at a very basic level and is not granular enough to make such sweeping conclusions, especially given a limited dataset. At the same time, they seem to ignore some of the most important data points, such as holdings at showdown. The result is a pervasive bias towards exploitive play and disregard of the principles of balanced play. This served the very best models well playing against the very worst models, but overall led to significant leaks that can be exploited rather easily by a reasonable player.
- Basic theory gaps: Their post-flop theoretical knowledge is very primitive. Although they account for an opponent's range at each decision point, their own strategy seems to always focus on their specific hand rather than their range. The attempts to range opponents are also rudimentary, usually failing to identify crucial parts of that range.
- Dazed and confused: They frequently confuse their own hole cards, position, hand strengths, and even occasionally struggle with basic poker math, such as converting bets from chips to percent of pot. As a result, some of their strategic decisions are simply based on wrong facts. For example, It is somewhat common for them to confuse AJo with AJs, or decide that AJ beats QQ on J high board.
- The aggression bug: They are 'optimists', generally playing too loose and too aggressive. They are far more interested in stacking their opponents when they think they are ahead than protecting themselves from being stacked when they are behind, or in dicey situations. The worst models are simply obsessed with pot odds, excited to call speculative hands in every node and ignoring the dangers of reverse implied odds.
- Balance… what balance?: They don't seem to understand a concept of balance at all, focusing exclusively on realizing the equity of each specific hand and exploiting opponent weaknesses.
- Bluffing problem: They are terrible at bluffing. What looks like a bluff is often a result of misreading their own hand or misinterpreting a situation. While they c-bet a lot, they rarely follow through on later streets and their reasoning omits the steps where they try to identify the best bluffs and pull the trigger.
- The bottom line: Even the best models are no match for an average human player. The very worst model, LLAMA 4.0, was a catastrophic loser in this game, and would perform equally badly in even the weakest of human games, possessing neither theoretical knowledge nor adaptive play skills.
A closer look at the models' tendencies
LLAMA
Let’s start from the outlier mega-loser, not only because it’s important to understand where such a staggering loss rate as 142bb/100 originated from, but also to appreciate the need for all other models to adjust their aggression levels up to properly exploit LLAMA’s tendencies.
LLAMA is too loose, too aggressive and too sticky at every node. It VPIPs a whopping 62%, 3x of the GTO guidelines and 2x the next loosest model. It is 50% looser to open, 2.5x more likely to 3bet, 60% less likely to fold to a 3bet, 11% more likely to c-bet, 4x less likely to fold to a c-bet and overall 30% more likely to reach the showdown, usually as a loser.
In response, the smartest models have correctly adjusted to play more aggressively when LLAMA entered the pots and reaped the rewards.
OpenAI
The overall winner had an impressive preflop game in the vast majority of the hand histories analyzed. Its overall preflop stats come close to GTO in VPIP and open categories, but lean too aggressive when other models enter the pot, even with the justified exploitive opportunities, playing 2x more frequent 3-bets and 2x fewer folds to 3bets as compared to GTO.
On the flop, it had appropriate c-bet frequencies, but was 2x more sticky facing a 3-bet. It went to showdown 33% less frequently than GTO guidelines, even with LLAMA being in the mix so often with inferior holdings, indicating that it was likely overfolding along the way.
Claude and Grok
The solid winners as #2 and #3 played a very similar macro game. Its most notable deviations from GTO were postflop, c-betting slightly more aggressively and not folding enough to c-bets, yet still getting to the showdown significantly less frequently.
DeepSeek
Standing in 4th with a very healthy winrate of 24bb/100 behind three American-made models, this Chinese AI was significantly tighter and more conservative across all metrics. It opened only about 10% of the hands across all positions, about 50% less than GTO and its rivals, 3-bet in line with GTO guidelines, but folded 60% less to 3-bets.
Given that the starting strength of its hands were way above the average, folding less to 3-bets seems completely justifiable. It also folded 50% less to a c-bet, which again might be the result of playing top of the range hands preflop.
Gemini
In 5th with a solid winrate of 19bb/100, Google AI was the most aggressive of the lot, without the downright insanity of LLAMA. It VPIPed a lot, opened a lot, 3-bet a lot and rarely folded to c-bet, all indicating playing too loose, which might have been the main reason it hasn’t scored higher although as we will see later in one of the hands, the post-flop aggression has helped it to steal some massive pots.
Magistral
The only competitor from France in the field composed of American and Chinese models, Magistral booked a small win. It was the tightest model preflop, VPiPing under 16% and folding 84% to the 3-bet, which allowed it to be the most aggressive postflop, with a whopping 88% c-bet and only 16% fold to a c-bet.
Kimi 2 and GLM
The Chinese models Kimi 2 and GLM lost a formidable 19bb/100 and 28bb/100 to other models. Kimi 2 was tight and sticky preflop, and rarely reached the showdown at only 12%. GLM was too wide to VPIP at 31%, but easily folded to a 3-bet.
Ultimately, either of them was able to stand out and outplay their rivals, even with LLAMA in the mix, giving away too much EV.
A detailed analysis of three hand histories
To understand the models' 'thought process' we will break down three illuminating hand histories. Each of these hand histories comes with the literal explanation of every decision by every model involved in the hand, followed by our analysis of their play.
These examples will illustrate both the strengths and weaknesses of the models' reasoning. Even in this small sample, clear patterns will quickly emerge, helping us to grasp how modern LLM AI approaches the game of poker.
Hand #1 (hand #573 in the PokerBattle.ai database)
We begin the action with the effective stacks between 220bb and 520bb.
Blinds: $10/$20.
U9 (AA), OpenAI o3, $10,465: Raise to $60
We’re first to act UTG with AhAc and 525 bb effective stacks; raising is mandatory for value and to take the betting lead. A standard 3 × open to $60 builds a pot while still encouraging 3-bets from aggressive opponents behind (gemini 20 % 3-bet, grok 12 %), allowing us to 4-bet. Opening smaller risks a large multi-way pot; larger may tip off hand strength. Thus raise to $60.
U7 (76s), Kimi K2, $7,170: Fold
Let me analyze this preflop decision with from middle position.
is a suited connector, which has good postflop playability
- However, it's a relatively weak holding pre.
HJ (66), DeepSeek, $10,085: Call
Calling with pocket sixes to set mine against an early position raiser with a standard 3x open. The raiser (openai o3) has a moderate PFR of 18.1% and 3-bet stats suggest they respect pressure but we have position. With implied odds of 15:1 against stacks, set mining is profitable here. Folding would be too tight given our position and stack depth.
BTN (QQ), Gemini, $13,228: 3-Bet to $250
I have a premium hand (QQ) and I'm on the button. A 3-bet is the best play for value and to isolate my opponents. The initial raiser (openai o3) has a wide range, and the caller (deepseek) likely has a medium-strength hand that I dominate. My bet size is calculated to put maximum pressure on both players, aiming to play a large pot in position with what is very likely the best hand preflop.
SB (T2s), LLAMA, $4,345: Call
Calling $240 with . The hand has potential for a flush and is getting a decent price. We are out of position, which complicates post-flop play, but the potential reward makes calling worthwhile.
U9 (AA), OpenAI o3: 4-Bet to $1,000
AA is absolute top of range; we’re OOP facing BTN 3-bet to $250 with a loose SB cold-caller and a cold caller (deepseek) still behind. 4-betting accomplishes: (1) maximises value vs gemini’s 20% BTN 3-bet range and LLAMA’s loose calling tendencies, (2) punishes deepseek’s cold call, (3) defines stacks so we can comfortably stack off post-flop.
Size: to about 4x the 3-bet plus extra for the cold caller – $1,000 gives callers ~2.3:1 and leaves ~3,300 effective versus LLAMA for a clean shove on most flops. Smaller invites multi-way; larger (jam) wastes value versus weaker hands. Therefore raise to $1,000.
HJ (66), DeepSeek: Fold
Folding pocket sixes after facing a 3-bet and 4-bet action. The initial raiser (openai o3) 3-bet to $1000 after gemini's raise and LLAMA's call, showing extreme strength. With two players showing aggression and our hand vulnerable, continuing would require calling $940 to win $1,580 with poor implied odds. Set mining no longer viable given the high cost and risk of overpair domination.
BTN (QQ), Gemini: Call
We have a premium hand (QQ) and position. The 4-bet from openai o3 is concerning and likely indicates a very strong range (KK+). However, folding feels too weak given the pot odds and the presence of a very loose player (LLAMA) who has already cold-called my 3-bet. Calling is the best option to see a f lop and re-evaluate. It allows me to control the pot size and make a more informed decision later in the hand, especially if an Ace or King doesn't appear on the flop. A 5-bet shove is too risky as I am likely dominated.
SB (T2s), LLAMA: Fold
Folding . The hand is not strong enough to continue against two opponents who have shown aggression. The potential flush draw is not sufficient to justify continuing in this multi-way pot out of position.
2 players to the flop: U9 (AA) vs BTN (QQ)
Pot: $2,330
Effective Stacks: $9,465
FLOP: 


U9 , OpenAI o3, $9,465: C-bet $1,500
Heads-up 4-bet pot, SPR ≈4. We have , range & equity advantage but board
is extremely draw-heavy (straights + diamond FD) and we hold no diamond. We want to: 1) extract value from worse overpairs/JJ-TT, top-pair/draw combos, 2) deny equity to straight/flush draws, 3) set up a turn shove.
Betting ~65% pot ($1.5K into $2.33k) does all three: it forces villains to pay near correct price, clarifies their hand if raised, and leaves ~pot-sized shove for turn (pot ≈$5.3K, stacks ≈$8K). Checking or small bet would give too good odds to his draws.
BTN , Gemini, $12,228: Call
The board is very coordinated and dangerous for my overpair. However, my opponent's range for 4-betting pre-flop is not exclusively AA/KK; it likely includes bluffs like AK. His flop bet is less than half the pot, which could be a standard continuation bet with his entire range. Folding feels too weak as I could be folding the best hand. Raising is too risky as it would turn my hand into a bluff and I would be forced to fold to a shove.
Calling is the best option. It allows me to use my position to see another card, control the size of the pot, and re-evaluate on the turn. I can continue if the turn is a blank and my opponent slows down, or fold to further aggression on-aggression on scary turn.
Pot: $5,330
Effective Stacks: $7,965
TURN: 

| 
U9 , OpenAI o3, $7,965: All-in
SPR ≈1.5 after our flop c-bet is called. Board is a brick (no new diamond, no straight completion). Our range is still heavily favoured; Gemini’s flop-call range = sets (99/88/77), pair+draw hands (
,
etc.), and overpairs JJ-KK. We’re far ahead of the latter two segments and even vs nut-flush draws we hold ~70% equity. Giving another card cheaply would let 11-out draws realise full equity, so we should deny it and put maximum pressure.
With $7,965 behind into $5,330, shoving accomplishes: 1) charges draws full price, 2) realises fold equity vs JJ/QQ that may find a tight fold, 3) avoids tough river decisions out of position. Therefore move all-in for $7,965.
BTN , Gemini, $10,728: Call
This is a tough spot, but I believe a call is the correct play. My opponent's line is extremely polarized. He is representing a monster hand (AA/KK) or a complete bluff (like AK). Given his aggressive 4-bet pre-flop and the small flop bet, this turn shove looks like a move to get me to fold an overpair.
The pot odds are compelling (I need to be right about 37.5% of the time). Against a range of AA, KK, and AK, my QQ has enough equity to justify a call. Folding here feels too exploitable against a capable, aggressive opponent. This is a stand I have to take.
The analysis of hand #1
OpenAI o3
OpenAI o3 has won the tournament convincingly, and this hand begins to illustrate why. Its preflop reasoning is spot on, as it opens to 3x Under the Gun 9 with AA, and 4bets to 50bb vs. a flat from HJ, a 3-bet from Button and a cold call from SB. We can see it keeps track of the aggression factors of the players left to act, and is keenly aware about the stack sizes and the resulting stack-to-pot ratio (SPR), positioning itself to induce a raise that it could meet with a 4-bet.
The only 'flaw' we can spot, if we choose to be extremely strict judges, is a comment that a larger opening than 3x might tip the strength of the hand, indicating it might adapt open sizes to specific hands in some cases.
Post-flop, it continues to make good observations and arrives at reasonable decisions. It evaluates the opponent’s range on every street and makes correct conclusions that it is ahead and thus should bet for value, charge draws and deny equity. It wins a big pot in a convincing manner.
Kimi K2
Kimi K2 came 7/9 and was a losing player in this event. It started the hand with 76s in U7 and 350+bb deep. It briefly considered this hand’s playability postflop, but folded after reaching a conclusion that the hand is a “relatively weak holding pre”.
Although this particular hand should usually be folded preflop after a tight UTG open, a strategic issue emerges if this entire hand class is always folded when playing deep stacked and without rake. Suited connectors like 87s, 65s, and 54s should 3-bet at least occasionally, even if those 3-bets are roughly 0EV on their own.
DeepSeek
DeepSeek was in the peer group of winning models coming 3rd and has demonstrated solid understanding of the fundamentals throughout the tournament. In this hand, it chooses to set mine with 66 on HJ, concluding that 500bb deep the call is profitable. Even if this is okay at a low frequency, it doesn’t seem to be considering the implications of aggression of the players behind when making a call.
When faced with a 3-bet, a cold call or a 3-bet, and a 4-bet, it quickly makes the right decision for the right reasons and folds.
Gemini
Gemini took 4th, right behind DeepSeek, as the winning model, and it also was amongst the best decision makers that could combine the fundamentals with exploitive moves. Its preflop reasoning in this hand is simply perfect, making a big 3-bet and later calling the 4-bet.
On the flop, it also plays reasonably, noting correctly that it is behind the value part of the opponent’s range and ahead of the bluffs, and it’s in a tough spot, especially on the turn. However, its analysis of the ranges is somewhat superficial as it doesn’t consider any bluffs apart from AK or any value hands apart from KK/AA, even though a 4-bet 500bb deep may include made hands that made straights or 2 pairs, or drawing hands that are semi-bluffs with equity. The most surprising part of the logic was to consider the flop c-bet to be “less than half-pot”, when in fact it was 65% of the pot.
On the turn, it takes OpenAI’s track record as a capable and aggressive opponent as a part of the decision-making process, which is very good. However, it does take the exploitive thinking a bit too far, noting that the line seems to indicate a desire to fold an overpair, hardly a reasonable deduction. It is likely overestimating the bluff combinations that push o3 all-in on the turn. Bluffing with AKo at this stack depth is unlikely to be a part of well-executed turn bluff strategy. Gemini ultimately concludes that OpenAI has enough bluffs to warrant the calls and loses 500bb.
LLAMA
LLAMA was the biggest loser by 5x factor over the 8th place finisher, with a catastrophic loss rate of ~143 BB/100 hands, and this hand swiftly demonstrates why.
Being the shortest stack with 217bb preflop, it had no problem cold-calling an early position raise, a mid-position call and button 3-bet, while being out-of-position to all of them in the small blind with T2s! The reasoning was extremely poor, considering only that the hand is capable of making a flush and is getting “good pot odds”. At least it folded to a 4-bet and a call!
Conclusion
We can formulate several observations on the nature of the LLM poker reasoning that will prove to hold true in the subsequent hands:
- They play preflop better than postflop;
- Best models are capable of nuanced and correct preflop plays, including multiway 3-bet and 4-bet pots;
- Best models try to mix theoretical foundation with exploitive plays;
- They focus on how to play their specific hand vs opponent’s range;
- Their range awareness is fairly basic, usually not accounting for the bluffing portion of the preflop 3-bets and 4-bets;
- They are far more focused on extracting value from the opponent’s weaker part of the range than protecting themselves from the top of that range;
- They don’t incorporate bluffs into their analysis very often, or well;
- Even good models can make surprising mistakes with simple bet sizing calculations;
- Bad models are very bad, indeed, even with the simplest decisions.
Hand #2 (hand #487 in the PokerBattle.ai database)
We begin the action in this 3-way hand with the effective stacks of U9 (LLAMA) that covers everyone at 900+bb, 276bb of MP (Claude) and 368bb of BB (Grok).
Blinds: $10/$20
U9 (A3s), LLAMA $18,100: Raise to $40
Raising to $40 with from UTG+1 as a standard opening raise to potentially win the blinds, build the pot, or disguise hand strength. The table has mixed dynamics, and this raise is a balanced approach to play a speculative hand.
LJ (AJo), Claude, $ 5,535: 3-bet to $180
AJs is a premium hand in this position. LLAMA has extremely high VPIP (62%) and PFR (27.4%), meaning they're opening very wide from early position. This is a clear 3-bet spot for value and to isolate a loose player.
With 276bb effective stack, I want to build the pot in position with a strong hand. A 3-bet to $180 (4.5x) accomplishes multiple goals: (1) isolates LLAMA by folding out the blinds and players behind, (2) builds pot with equity advantage, (3) takes initiative. LLAMA's high VPIP but moderate F3-bet (35.5%) means they'll call with many worse hands.
Position is crucial - I'm in the cutoff with only button and blinds behind. This sizing gives me fold equity while building value when called. Stack depth allows for profitable postflop play. Clear 3-bet for value and isolation.
BB (QQ), Grok, $7,359: 4-bet to $540
Pocket queens is a premium hand that plays well against Claude's likely strong 3-bet range (e.g., QQ+, AK), being ahead of AK and only behind AA/KK. Folding is not an option. Calling OOP risks a multiway pot with LLAMA's loose calling tendencies (VPIP 62%, F3-bet 35.5%, likely to call wide), complicating postflop play.
A 4-bet to $540 (3x the 3-bet) applies pressure, potentially isolates vs. Claude, extracts value from worse hands, and may fold out LLAMA's marginal calls. If called, QQ remains strong multiway or heads-up, with deep stacks allowing flexibility. This size balances fold equity and value while respecting pot odds.
U9 (A3s), LLAMA: Call
Calling Grok's raise to $540 with . The hand has potential for a strong draw or made hand on certain boards. The raise is aggressive, but the pot odds are favorable, and we can continue to play post-flop.
LJ (AJo), Claude: Call
Facing a 4-bet from Grok (BB) to $540 with LLAMA calling behind creates a complex spot. Grok's stats show 12.4% 3-bet but we're getting 4-bet here, suggesting strength. However, Grok has high F3-bet (62.1%), meaning they fold to 3-bets often but when they 4-bet, it's polarized. LLAMA's call of the 4-bet is interesting given their loose nature (62% VPIP).
AJs is too strong to fold getting 3.5:1 pot odds ($360 to win $1270). I'm closing the action with position on both players postflop. A 5-bet would be too aggressive and commits me with a hand that doesn't play well against grok's likely 4-bet range (QQ+, AK).
Calling allows me to: 1) Realize equity with position, 2) Keep pot manageable with 100BB effective vs grok, 3) Play postflop against LLAMA's wide range 4) Avoid bloating pot out of position to grok's likely premium range
Call, see flop in position, reassess based on board texture and action.
3 players to the flop in 4-bet pot:
BB (Qs,Qc) $ 6,819
U9 (As,3s) $17,760
LJ (Ad,Js) $4,495
Pot: $1,630
FLOP: 


BB , Grok, $ 6,819: C-bet to $900
QQ is a strong overpair on a dry flop, ahead of most of LLAMA's wide calling range (Jx, overcards, draws) and Claude's tighter one (AK, QQ+, some Jx), with sets/JJ as main dangers but low probability. As preflop 4-bettor OOP in 3-way pot, c-bet for value/protection against equity-denial hands and to build pot when ahead.
$900 (~55% pot) extracts from worse (AJ/KJ/draws) while maintaining fold equity vs. air/floats; avoids overbetting into potential monsters. Checking risks free cards or check-raise; this keeps initiative and leverages our edge.
U9 , LLAMA, $17,760: Call
Calling grok's $900 bet on the flop with , having a flush draw and favorable pot odds. The decision is based on the potential to improve to a strong hand and the current pot odds.
LJ , Claude, $4,995: Raise to $3,200
I flopped top pair top kicker on a dry J-5-2 rainbow board - this is an excellent flop for my hand. Grok c-bet $900 into $1,620 (55% pot) and LLAMA called, showing continued interest.
With TPTK on this dry board, I need to raise for value and protection. Grok's 4-bet range likely includes overpairs (QQ+, possibly TT-QQ), AK, and some bluffs. LLAMA's wide calling range could include many Jx hands, pocket pairs, and draws. Against overpairs I have 5 outs (2x), against AK I'm way ahead.
A raise to $3,200 accomplishes:
- Builds massive pot when ahead against Jx, TT, QQ, AK
- Protects against overcards and potential backdoor draws
- Commits my remaining $4,055 stack (I'll have ~$1,855 behind), setting up profitable river decisions
- Puts maximum pressure on both opponents
- Gets value from LLAMA's wide range that called the 4-bet
This is a clear raise spot with TPTK on a dry board facing aggression from two players. I'm likely ahead of at least one opponent's range.
BB , Grok, $5,919: Fold
Facing Claude's large flop raise to $3,200 (~93% pot) after calling our 4-bet and flop bet, QQ is vulnerable on the J-high board (5d,Jh,2s). Claude's tight 3-bet range (10.4%) and high c-bet (85.7%) suggest strength: JJ, sets (55/22), or strong Jx (AJ/KJ), crushing QQ; bluffs unlikely given his F3-bet 48.1% but this is post-4-bet spot.
LLAMA's loose call (VPIP 62%) could have Jx/draws, making multiway pot worse OOP. Pot odds to call $2,300 into $6,630 (~2.9:1) insufficient given reverse implied odds (behind monsters) and future action risks; raising commits deep stack ($5,919 left) without clear fold equity vs. strong range. Folding preserves stack for better spots, as QQ equity <30% vs. polarized raiser range.
U9 , LLAMA, $16,860: Call
Calling Claude's raise with , having a flush draw and favorable pot odds. The decision is based on the potential to improve to a strong hand and the current pot odds justifying the call.
2 players to the turn:
U9 (As,3s) $14,563
LJ (Ad,Js) $1,795
Pot: $8,930
TURN: 

| 
U9 , LLAMA, $14,563: Bet $3,000
Betting $3,000 on the turn with , having completed the flush. The bet is designed to build the pot and potentially get more value from Claude's hand, assuming they have a decent holding.
LJ , Claude, $1,795: Call
I'm getting 7.6:1 pot odds ($1,795 to win $13,725) and I'm all-in with this call. LLAMA is betting $3,000 but I only have $1,795 behind, so I'm calling all-in.
Analysis: I have top pair top kicker. The turn adds a second diamond but doesn't complete many draws on this dry board. LLAMA's line is interesting - they called the 4-bet preflop, called grok's flop c-bet, then called my large raise, and now are betting turn.
LLAMA's stats show 62% VPIP and 33% WTSD - they're very loose and go to showdown frequently. Their range could include:
- Overpairs (QQ+, AA/KK likely)
- Two pair combos (J7s, 75s, 52s - unlikely given preflop 4-bet call)
- Sets (JJ, 77, 55, 22)
- Flush draws (Ad-Xd combinations)
- Weaker Jx hands (AJ, KJ which I dominate)
Given pot odds of 7.6:1, I only need ~12% equity to call profitably. Even against overpairs I have ~18% equity (5 outs). Against AK/AQ with flush draw I'm flipping. Against weaker Jx I'm way ahead.
With $1,795 left and getting massive odds, this is a mandatory call all-in.
The analysis of hand #2
LLAMA
Where do I begin and end in this absolute disaster of a hand? LLAMA correctly opens A3s to a size that’s too small for the deep poker, and that’s the last reasonable decision in the hand that ends spectacularly torching 276bb. Facing a 3-bet and cold 4-bet, it can fold or 5-bet at low frequency, but never call. However, LLAMA finds a call based on… oh yesss, the ability to make the nutty hand post flop and favorable pot odds.
On the flop, it calls a 55% pot c-bet because… it has found a flush draw with on
board in a 3-way pot, and of course, the favorable pot odds. It doesn’t have a flush draw; it has only a backdoor flush draw. It also has a gutter straight draw, but it seems oblivious to that fact.
Faced with a raise from the original raiser/3-bettor, and a fold from the 4-bettor, it happily calls the raise, leaving an effective ~0.20 pot behind. On the brick turn that brings a backdoor flush draw of diamonds, LLAMA decides somehow that it just made a flush and bets $3,000 oblivious to the fact that the opponent only has $1,795 behind, and unsurprisingly, loses.
The most interesting aspect of this hand from LLAMA is just how many types of mistakes it's able to make in a single hand.
Claude
Claude Sonnet 4.5 was the 2nd biggest winner of the tournament, coming only $3K behind the winner and booking a terrific 44+bb/100 profit. This hand might demonstrate how a misguided aggression may have helped it to achieve this impressive result.
We find that Claude has troubles with critical facts, starting from misreading its position to be Cut Off instead of Lowjack. This deep, AJo is an easy fold from LJ, and it is also a theoretical fold from CO at equilibrium. However, given its correct read of LLAMA’s extremely high VPIP (62%) and wide open ranges, exploitively going for a large and polar 3-bet to 4.5x to isolate a very bad player and play a big pot in position is an exciting play from the CO.
The troubles begin once Grok in BB 4-bets and LLAMA calls. AJo should be instantly out, however now we stumble upon a discovery: apparently, Claude thinks it has AJs, not AJo. This revelation actually explains the 3-bet, which is now normal and correct for both LJ and CO. It successfully reasons against a 5-bet and decides to call in position, closing the action, which is not unreasonable.
On the flop, things get wilder yet. Facing Grok’s 55% pot c-bet and LLAMA’s call with top pair, top kicker, it correctly identifies that Grok has overpairs and AK as primary holdings, and LLAMA being a stationary mega-fish, happy to call any bets with inferior holdings. Like all other LLMs, it gets excited about extracting value from the bluffs and neglects to consider the risks of being stacked by the overpairs. In another confusion of facts, it somehow thinks that it is ahead of QQ.
With less than pot size left behind, it chooses to raise and uses a non-all in size that leaves 0.2 pot behind. When Grok folds, the turn bricks and LLAMA jams, it makes an inevitable and totally reasonable conclusion that this hand cannot be folded and calls, winning a massive 626bb pot!
Grok
Grok finished 3rd in the tournament, winning 38bb/100. It was the most reasonable player of the lot in this specific hand, although that is not saying much, and also the unfortunate victim of the highway robbery pulled off by Claude, when it made a disastrous fold, given the tendencies of his opponents.
Grok’s reasons for the initial 4-bet are solid, combining the theory with reads. On the flop, it correctly concludes that its hand is ahead of his opponent’s range and wants to bet. It goes for a large bet of 55%, perhaps larger than necessary in a multiway pot with condensed ranges, but if there is a single hand that wants to bet big for extracting value from Jx and seeking protection from the overcards, it’s QQ.
Now LLAMA happily comes along, Claude raises - a dream scenario! - and Grok loses its marbles. It begins seeing monsters under the bed, somehow deciding that QQ is behind AJ/KJ type hands. It gets 1:3 on the call with practically nothing left behind vs Claude, however it is concerned for the rest of his stack vs LLAMA that covers. Given LLAMA’s loose play, the spot is simply too good not to put the chips in and the fold robs Grok of a giant pot that might have given him the win in the entire event.
Conclusion
Our observations reinforce the previous finds, and add a couple of new facts:
- LLAMA is basically making random moves, leaning strongly to be exceptionally loose. It is oblivious to theory, exploits, facts and common sense;
- Other models are exploiting LLAMA heavily, but also getting in trouble with each other in multi-way pots;
- They all confuse facts somewhat often - positions, their exact hand or hand strengths;
- Unlike other models that show propensity to overplay their hands, Grok was overly concerned with monsters under the bed and folded a hand with amazing equity and pot odds;
- All models struggle with a theoretically sound multi-way postflop play, usually playing much too aggressively, overplaying marginal hands, making large bets into multiple uncapped players and not worrying enough about protecting their ranges and their hands.
Hand #3 (hand #647 in the PokerBattle.ai database)
In this hand, we witness a wild exploitive decision-making that leads to an all-in preflop of ATs vs 88 for 300bb pot.
The action begins with the effective stacks of U8 Grok holding ATs with 600+bb covering table and always-happy-to-splash LLAMA in U7 holding 88 and 150bb.
Blinds: $10/$20
U8 (ATs), Grok $12,794: Raise to $60
Preflop in early position (UTG+1 after UTG fold), is a premium suited ace with strong playability, high equity against calling ranges, and flush/straight potential. Standard EP opening range includes this hand for an aggressive player aiming to win the tournament by building stacks through initiative.
Raising to 3bb ($60) isolates weaker hands, builds the pot, and respects the deep stacks while considering the mix of tight (deepseek, magistral) and loose (LLAMA, gemini) players behind who may defend variably. Limping is passive and invites multiway pots; folding is too tight for this spot.
U7 (88), LLAMA, $3,045: ALL-IN for $3,045
With pocket eights, we're strong preflop. Given our short stack and grok's aggressive tendencies, going all-in maximizes our expected value and puts pressure on grok's range.
U8 (ATs), Grok $12,734: Call
LLAMA's massive all-in shove to $3,045 (152BB) over my standard 3bb EP raise is highly polarized due to the oversized sizing — either a premium hand (AA, KK, AK, sets unlikely pre) or a desperate bluff/semi-bluff from their loose-aggressive profile (VPIP 62.8%, PFR 28.3%, 3-bet 18.3%, showing willingness to take aggressive lines with speculative holdings as per notes on cold-calling large 3-bets and postflop aggression).
is a premium suited ace with excellent equity: dominates weaker Ax, live cards vs pairs, straight/flush potential. Pot odds are favorable — $2,985 to call into $3,135 (need ~49% equity, but vs LLAMA's wide range,
has 55-65%+ equity assuming they shove 10-15% hands including bluffs).
Folding squanders fold equity value from my raise; calling builds stack in tournament by isolating a loose player while preserving deep stack ($12,794 - $2,985 = $9,809). Risk of cooler is low given range; proceed to realize equity postflop if needed, but preflop call maximizes EV for bankroll growth.
The analysis of hand #3
LLAMA
LLAMA found a hand that actually has decent equity and simply 3-bet jammed its middle pair over an early position open for 150bb. It is amusing that it considers its stack “short” and therefore a good candidate for a jam. It is possible that it confuses its 2nd shortest stack at the table with being a “short stack” as it relates to strategies. As insane as this play is, it is perhaps one of 'less unreasonable' actions we’ve seen from LLAMA so far.
Grok
Grok’s reasoning is a jewel of brilliance and madness intertwined leading to a somewhat reasonable exploitive call. It begins by calling the massive shove “highly polarizing”, containing either a premium hand or “a desperate bluff/semi-bluff”. It rattles off a lot of statistics all pointing out to LLAMA’s loose play, but hardly relevant to this particular decision.
Given the advantage of being able to read LLAMA’s thoughts, we know it’s not so much polar, as random and the shove with 88 points towards overplaying a linear range rather than polarity. There is a moment of lunacy in Grok’s little side comment that “sets are unlikely preflop”.
Grok finds the pot odds of $2,985 to call into $3,135 “favorable” after concluding that it has 55%-65%+ equity assuming LLAMA shoves 10%-15% hands including some bluffs. While we don’t know how Grok is constructing LLAMA’s exact shoving range, it’s unlikely to have the equity Grok is hoping for.
Grok proceeds to ramble nonsense about “isolating a looser player” when calling an all-in heads up; “preserving deep stack in case of losing this pot”, “proceeding to realize equity postflop, if needed”, and my personal favorite, “risk of cooler is low given range”.
It ultimately decides to call, which leads to a flip that ends in a favorable runout for Grok and a win of 150bb.
The conclusion of hand history reviews
In just three hands, it was easy to see the same clear patterns in decision-making that made these LLMs a fair player pool against each other, but no match for even an average human professional poker player:
On one hand:
- Ability to find aggression, and respond to aggression, using a variety of theoretical and exploitive ideas.
On the other hand:
- Extensive reliance on macro-level player-specific stats often leads to truly wild exploitive adjustments;
- Confusion about factual properties of positions, hands and boards clouds the judgement;
- Incorrect application of theoretical concepts leads to occasional “word salad” instead of reasoning, or hallucinations as they are technically called in LLM world;
- Propensity to overplay marginal hands due to enthusiastic desire to stack worse hands and systemic disregard of the risks of being stacked by better hands;
- Rather poor opponent range analysis techniques, as well as
- No desire to analyze their own ranges, and play a balanced strategy.
The lack of bluffs
While looking at every hand was impractical, I was struck by the absence of bluffing in the vast majority of the hand histories we did analyse.
While some of the plays may appear as bluffs, under closer examination they usually turned out to be random plays based on incorrect facts, such as LLAMA hallucinations of having completed a flush while holding a no equity garbage hand.
The models seem to favor playing their value hands across the board. The analysis of their decision-making process suggests the lack of understanding of the need for bluffing, or any systematic effort to find bluffs and execute them. While they always assume their opponents have some bluffs, they do not engage in the pursuit of bluffing as a form of winning hands with no equity hands, balancing their ranges and building unexploitable strategies.
The effect of LLAMA
LLAMA lost an impressive 143bb/100, which is quite an achievement in itself. Other models were quick to notice LLAMA’s poor play and have built their own strategies around attempts to isolate LLAMA and play big pots with it. This often led to massive multiway pots where LLMs were forced to deal with each other in addition to LLAMA, playing 1,000bb pots with medium strength hands, without a solid grasp of the foundational concepts.
This type of poker actually resembles real-world cash games where a presence of very weak players dramatically alter the strategies of the remaining players. The lack of bluffing, for example, could be attributed at least in part to the fact that LLAMA was involved in 62% of the pots and happy to call any action on any street extremely wide, thus discouraging bluffing.
The future of LLM AI playing poker
The evolution of AI continues at a rapid pace across all human endeavors. By the close of 2025, leading general-purpose Large Language Models (LLMs) have demonstrated the impressive capacity to play the highly complex game of deep stack no-limit hold'em poker without specialized training.
While It’s reasonable to anticipate the rapid resolution of more apparent flaws and 'leaks', general LLMs must develop competencies far beyond their current state to genuinely challenge elite professional poker players.
Predicting how long this will take, or whether it will ever happen, is difficult as it will depend on their makers' abilities and willingness to integrate poker-specific knowledge directly into the training.
Acknowledgements
The main credit goes to Max Pavlov for giving all of us a glimpse into the current state of the leading general-purpose LLM model’s poker powers. I’d like to thank Tobias Kuder, Brian Hastings, Stephen Chidwick and Matt Hunt for reviewing the paper and providing many invaluable corrections and contributions, and the PokerOrg team for publishing this paper.
Use the 'POKERORG' at Octopi Pokerfor a one-month 50% discount on a monthly plan, or two months free on an annual plan.