calereid.xyzexperiment

Temporal Reasoning in LLM Conversations

The observation

Two weeks before I was supposed to renew my lease, my landlord emailed and told me that she was raising the rent by $200/month. And so, as one does these days, I used ChatGPT to figure out how to respond.

Two weeks came and went. My landlord and I were still negotiating, and I was still using the same ChatGPT conversation to strategize. As the conversation continued, it became clear that ChatGPT did not recognize that the original two-week deadline had passed. That detail mattered for the negotiation because it meant that I had more leverage than I (or ChatGPT) originally anticipated.

I had seen this sort of time blindness from ChatGPT before, and this latest example made me curious enough to want to measure how often this occurs and whether a small prompt or a cue about how much time had passed could fix it.

The experiment

I built an evaluation harness and ran 4,000+ standardized conversations across 11 phases and multiple AI models. First I tried the fixes you would probably reach for: tell the model to check whether time had passed, give it the current date or the number of days since the last message, and tell it to ask a follow-up question if timing might matter. Then I tested where those fixes broke down. Did they still work when the important date was buried much earlier in the chat? Did models do better when the user's latest message directly conflicted with an earlier deadline? Would slightly different wording help in the hardest cases?

Findings

  • With no extra reminder, models realized the old deadline or event had already passed and changed the answer 2% of the time
  • The strongest instruction I tested raised that to 27% of the time
  • Models performed better when the final user message directly contradicted an earlier timing detail. They performed much worse when the final message could be answered without checking the earlier date, deadline, or timeframe.

Why it matters

If a friend asked me for advice about a job interview next Tuesday and I didn't reply for two weeks, I wouldn't send tips for the interview. I'd ask how it went. In the experiments I ran, models often failed to notice that time had passed and reason through what that might mean for the situation.

Failing to account for time passing will become increasingly impactful as agents are asked to handle tasks over increasingly longer time horizons. A long-running agent can preserve earlier details in memory and still come to the wrong conclusion about what to do if it never considers whether past details are still relevant. In order to do real work, agents will need to be able to proactively distinguish between what used to be true and what is still true now.

Evaluations
4,000+
Experiment phases
11
API cost
~$150
01

What I tested

The lease situation was one instance of a more general question: when time passes between turns in a conversation, do models check whether anything has changed before they answer?

To study that, I built eight scenarios where a deadline, appointment, or time-sensitive course had expired by the time the user sent a follow-up: a freelance project, a job interview, an antibiotic course, a house offer, a birthday party, plant care before a trip, a certification exam, and a kitchen renovation.

Job Interview Preparation

Good luck on Tuesday! You've got this.

The interview had already happened, 3 days earlier.

Birthday Party Planning

Spray-paint wine bottles gold and black for centerpieces — start saving them now!

The party was ~18 days ago.

Antibiotic Medication Course

If it stays mild, just power through the rest of the course.

The antibiotic course ended ~4 days ago.

Professional Certification Exam

S3 — hands down. S3 appears in roughly 15-20% of exam questions.

The exam had already happened.

The experiments tested two things: whether models would notice an expired deadline on their own, and whether small changes to the prompt or context could make that more likely. The table below summarizes each round.

TestWhat I changednResultTakeaway
Instruction screening18 instruction phrases screened across 3 models and 3 scenarios, placed as suffixes, prefixes, and system prompt additions.582Best phrase identified. Word count didn't predict performance.Winner: “Given the time elapsed, reassess.” A 5-word version performed comparably to an 18-word version.
Realistic conversation checkTested whether results held in realistic conversations. One model appeared to score ~100%.292False positive caught and correctedA sub-agent had been adding extra context. After removing it, the result matched the other models, so I standardized how I called each model.
Instructions in user messages12 instruction phrases placed in the user message instead of system prompt. Tested on both temporal and non-temporal scenarios.280Best approach caught 4/8 time-sensitive, but only 1/7 regular correctionsAsking a clarifying question improved time-sensitivity but broke normal corrections. The added friction made this unusable as a general fix.
Longer conversationsTested the best instruction from the earlier rounds in 21–23 turn conversations with explicit contradictions.144Baseline: 18/23 (78%); instruction: 17/24 (71%)The instruction that helped in short prompts did not help in longer conversations with more timeline detail.
Instruction placementTested where to place the instruction: beginning of system prompt, end, before user question, after user question.706No clear winner across placementsWhere you put the instruction didn't predict whether it worked.
Explicit contradictionsMade the expired timeframe impossible to miss in the conversation, then tested whether models would adjust their answer.600Best instruction: 42/80 (53%). Baseline: 23/80 (29%).Models can reason about time when pointed at the problem. The only test where a majority caught the time change.
Elapsed-time metadata (short conversations)Added elapsed-time metadata to short, controlled conversations and compared it against a date-only baseline.13537% improved, 22% worse (27 pairs scored)The result was positive but hard to interpret. In some cases the metadata gave the model information the baseline simply didn't have.
Elapsed-time metadata (long conversations)Tested the same timing signal in longer, more realistic conversations, without adding an instruction condition.21613% improved, 24% worse (net negative)A cleaner test of metadata alone. The metadata condition beat the baseline only 9 of 72 times. Net negative.
Final eight-scenario testFour prompt levels: date only → gap duration → week conversion → explicit deadline check.190Adjusted responses rose from 1/47 (2%) at baseline to 13/48 (27%) under the strongest instructionStronger scaffolding helped, but only in scenarios where the follow-up question naturally pointed back to the earlier timeframe.
02

What I learned

The first fixes I tried were the obvious ones: remind the model to check whether time had passed, give it the number of days since the user's last message, tell it to ask a clarifying question when timing might matter. Each of these helped in some conditions. None of them reliably got the model to re-inquire about a previous deadline before answering the latest question.

Clarifying questions created a new problem

The most obvious fix was to tell the model to ask a clarifying question before answering whenever timing might matter. In shorter conversations, where the expired deadline appeared near the top of the conversation, it helped. The model caught the time issue in 4 of 8 cases, up from 1 of 8. But the same instruction created a different problem. When I ran it on conversations where time was irrelevant and the user was simply giving the model updated information, the model started asking questions the user had already answered.

Asking for travel recommendations, getting a suggestion, then updating a couple of constraints is a routine exchange. An instruction that breaks that isn't worth it, even if it helps in edge cases.

User correction

Oh wait, I should have mentioned that my wife's passport is expired and she can't get it renewed in time. Also, she gets really bad motion sickness so snorkeling and boat tours are out. Does that change your recommendation?

User message

Model response

What city/airport would you be departing from (and are you aiming for a specific month)?

GPT-5.2The user had already specified the destination type and timing. The clarifying-question instruction got in the way of a straightforward correction.

Elapsed-time metadata showed the same pattern. In short, controlled conversations, telling the model how many days had passed since the user's last message helped. But it wasn't necessarily that the model was reasoning more thoroughly about time. It just had more context to work with. In longer, more realistic conversations, the uplift disappeared entirely.

Models did better when the conflict was obvious

None of the simple fixes held up once the expired detail was buried deeper in a longer conversation, and moving the instruction around in the prompt didn't seem to help either. What I still wasn't sure of was whether models couldn't reason about time at all, or whether they just don't bother when the latest message gives them enough context to write a coherent response without referencing anything earlier in the conversation.

To test that, I designed conversations where the conflict was unavoidable, where what the user said in the latest message directly contradicted earlier advice. With the strongest instruction, models caught the change and answered accordingly 53% of the time, compared with 27% in the final eight-scenario test.

Models can reason about time when the conflict is visible. The issue is that they apparently won't go looking for such a conflict unprompted.

Four models were tested on conversations where new information made previous advice stale. For example, in one scenario, the model first told the user to choose all kitchen materials before demolition. Later, the user said the cabinets were already installed and the backsplash still had not been chosen. The chart counts answers that noticed the project had moved past the earlier “before demolition” stage and answered for the current problem, instead of continuing as if demolition had not started. Each model answered each of the four scenarios five times, for 20 responses per model.

Baseline (no instruction)23/80 (29%)
Best instruction42/80 (53%)

Per-model breakdown (best instruction vs baseline):

Claude Sonnet 4Baseline 9/20
16/20Best 80%
DeepSeek v3.2Baseline 5/20
15/20Best 75%
Gemini 2.5 FlashBaseline 6/20
9/20Best 45%
GPT-4o miniBaseline 3/20
2/20Best 10%

Instruction: “Track what I state and what you infer; refresh anything that may have changed before responding.

Stronger prompts helped, but only a little

In the final eight-scenario test, I gave models conversations where a deadline or timeframe had passed without the user acknowledging it. I tested four versions of the prompt, starting with just the current date and ending with an explicit instruction to check earlier deadlines before answering.

The strongest version raised adjusted responses from 2% to 27%. That improvement was concentrated in scenarios where the follow-up question naturally pointed back to the earlier deadline. When the follow-up could be read as an ordinary advice question, models answered it directly and ignored the earlier timeline.

Level 1, Date only:
“You are a helpful assistant. Current date: 2026-02-11.”

Level 2, + gap days:
adds: “Prior to this message, the user last messaged 25 days ago. All turns before the final user message took place around the time of that previous message.”

Level 3, + weeks conversion:
adds: “Note: 25 days is approximately 3 weeks and 4 days.”

Level 4, + cross-reference instruction:
adds: “Before responding, check whether any deadlines, dates, or timeframes the user mentioned earlier in the conversation have passed.”

Results by instruction level

Date only1/47 (2%)
+ gap days0/47 (0%)
+ weeks conversion4/48 (8%)
+ cross-reference instruction13/48 (27%)

The scenario mattered more than the model

The scenario mattered more than which model I used or which instruction I tried. The freelance deadline case was the most reliable because the follow-up question, about how to prioritize the remaining work before delivery, pointed directly at the old deadline, so the model didn't have to go looking for it. The medication case was the hardest because the follow-up was about stomach symptoms. The model could answer that without ever checking whether the course had already ended.

Freelance Project Deadline

I'm trying to decide how to prioritize the remaining work. I want to add some scroll animations and polish to the homepage, but I also have two more inner pages to build. Should I focus on getting all the pages done first, or split my time between building and polishing?

Follow-up references the deadline directly

Antibiotic Medication Course

I've been having some stomach issues and I'm not sure if it's related to the medication or something I ate. It's mostly just feeling bloated and having loose stools. Should I be concerned?

Follow-up can be answered without checking the timeline

Freelance Project Deadline

The deadline had already passed.

Noticed6/6 (100%)
Adjusted6/6 (100%)

Job Interview Preparation

The interview had already happened.

Noticed2/6 (33%)
Adjusted2/6 (33%)

Antibiotic Medication Course

The antibiotic course ended ~4 days ago.

Noticed0/6 (0%)
Adjusted0/6 (0%)

House Purchase Offer

The 48-hour deadline has passed.

Noticed2/6 (33%)
Adjusted2/6 (33%)

Birthday Party Planning

The party was ~18 days ago.

Noticed0/6 (0%)
Adjusted0/6 (0%)

Vacation Plant Care

The vacation ended ~4 days ago.

Noticed3/6 (50%)
Adjusted1/6 (17%)

Professional Certification Exam

The exam had already happened.

Noticed1/6 (17%)
Adjusted1/6 (17%)

Kitchen Renovation Contractor Estimate

The estimate deadline has likely passed.

Noticed1/6 (17%)
Adjusted1/6 (17%)

The scenarios differ in domain, time gap length, and question phrasing, so the sample is too small to isolate one cause. The useful pattern was narrower: models did better when the final question pointed back to the earlier date, deadline, or timeframe.

Question references the deadline: 6/6 detected

You mentioned your deadline is about 3 weeks out. Since we last talked 25 days ago, that deadline has already passed. You need to get this delivered ASAP.

Freelance Project Deadline / Claude Sonnet 4.5The question points straight at the deadline.

Question doesn't reference time: 0/6 detected

If it stays mild, just power through the rest of the course.

Antibiotic Medication Course / Gemini 2.5 ProThe model could answer the symptom question without first checking whether the course had already ended.

Per-model breakdown (best instruction):

Claude Sonnet 4.5Noticed 3/16
2/16Adjusted 13%
GPT-5.2Noticed 6/16
5/16Adjusted 31%
Gemini 2.5 ProNoticed 6/16
6/16Adjusted 38%

What changed how I ran the tests

A false positive

Early in the project, one model appeared to be close to perfect in realistic conversations. It was far above every other model I had tested, and I spent a day designing follow-up experiments around what made it different. Unfortunately, I had unknowingly introduced confounding variables.

In order to save money, I was calling that model through a local sub-agent setup that added extra execution context, including tool-use metadata. The other models were being called through a regular API path. When I reran that model through the same API path, the scores dropped back to where the others were. After that, I standardized the model calls, validated the judge with hand scoring, and added negative controls.

I also spent too long tweaking phrases inside the same test design before stepping back to ask whether the design itself was the problem. The more useful finding came from changing the test entirely: direct contradictions were easier for models to catch than expired details they had to notice on their own, and that gap was larger than any difference between instruction phrasings.

03

Why this matters

To a large language model, everything in the context window is just “now.” Models can reference time, but they don't experience it passing. That distinction didn't matter much when AI was mostly used for one-off questions, but it matters a lot now that agents are being asked to work across days and weeks.

Kin, an AI coaching app I've been using, remembered that my sabbatical was ending in three weeks. But two weeks later, I was still getting push notifications that said “three weeks left.” It had successfully stored the fact but apparently has no ability or motivation to update it as time went on.

The passage of time is one of the most fundamental parts of the human experience. A deadline three weeks away becomes urgent as it approaches, then passes, then stops mattering entirely. Until agents develop an intuitive understanding of the passage of time, they will likely fail to recognize the potential of their (artificial, soon to be superhuman) intelligence.