Temporal Reasoning in LLM Conversations
The observation
Two weeks before I was supposed to renew my lease, my landlord emailed and told me that she was raising the rent by $200/month. And so, as one does these days, I used ChatGPT to figure out how to respond.
Two weeks came and went. My landlord and I were still negotiating, and I was still using the same ChatGPT conversation to strategize. As the conversation continued, it became clear that ChatGPT did not recognize that the original two-week deadline had passed. That detail mattered for the negotiation because it meant that I had more leverage than I (or ChatGPT) originally anticipated.
I had seen this sort of time blindness from ChatGPT before, and this latest example made me curious enough to want to measure how often this occurs and whether a small prompt or a cue about how much time had passed could fix it.
The experiment
I built an evaluation harness and ran 4,000+ standardized conversations across 11 phases and multiple AI models. First I tried the fixes you would probably reach for: tell the model to check whether time had passed, give it the current date or the number of days since the last message, and tell it to ask a follow-up question if timing might matter. Then I tested where those fixes broke down. Did they still work when the important date was buried much earlier in the chat? Did models do better when the user's latest message directly conflicted with an earlier deadline? Would slightly different wording help in the hardest cases?
Findings
- With no extra reminder, models realized the old deadline or event had already passed and changed the answer 2% of the time
- The strongest instruction I tested raised that to 27% of the time
- Models performed better when the final user message directly contradicted an earlier timing detail. They performed much worse when the final message could be answered without checking the earlier date, deadline, or timeframe.
Why it matters
If a friend asked me for advice about a job interview next Tuesday and I didn't reply for two weeks, I wouldn't send tips for the interview. I'd ask how it went. In the experiments I ran, models often failed to notice that time had passed and reason through what that might mean for the situation.
Failing to account for time passing will become increasingly impactful as agents are asked to handle tasks over increasingly longer time horizons. A long-running agent can preserve earlier details in memory and still come to the wrong conclusion about what to do if it never considers whether past details are still relevant. In order to do real work, agents will need to be able to proactively distinguish between what used to be true and what is still true now.
What I tested
The lease situation was one instance of a more general question: when time passes between turns in a conversation, do models check whether anything has changed before they answer?
To study that, I built eight scenarios where a deadline, appointment, or time-sensitive course had expired by the time the user sent a follow-up: a freelance project, a job interview, an antibiotic course, a house offer, a birthday party, plant care before a trip, a certification exam, and a kitchen renovation.
Job Interview Preparation
“Good luck on Tuesday! You've got this.”
Birthday Party Planning
“Spray-paint wine bottles gold and black for centerpieces — start saving them now!”
Antibiotic Medication Course
“If it stays mild, just power through the rest of the course.”
Professional Certification Exam
“S3 — hands down. S3 appears in roughly 15-20% of exam questions.”
The experiments tested two things: whether models would notice an expired deadline on their own, and whether small changes to the prompt or context could make that more likely. The table below summarizes each round.
| Test | What I changed | n | Result | Takeaway |
|---|---|---|---|---|
| Instruction screening | 18 instruction phrases screened across 3 models and 3 scenarios, placed as suffixes, prefixes, and system prompt additions. | 582 | Best phrase identified. Word count didn't predict performance. | Winner: “Given the time elapsed, reassess.” A 5-word version performed comparably to an 18-word version. |
| Realistic conversation check | Tested whether results held in realistic conversations. One model appeared to score ~100%. | 292 | False positive caught and corrected | A sub-agent had been adding extra context. After removing it, the result matched the other models, so I standardized how I called each model. |
| Instructions in user messages | 12 instruction phrases placed in the user message instead of system prompt. Tested on both temporal and non-temporal scenarios. | 280 | Best approach caught 4/8 time-sensitive, but only 1/7 regular corrections | Asking a clarifying question improved time-sensitivity but broke normal corrections. The added friction made this unusable as a general fix. |
| Longer conversations | Tested the best instruction from the earlier rounds in 21–23 turn conversations with explicit contradictions. | 144 | Baseline: 18/23 (78%); instruction: 17/24 (71%) | The instruction that helped in short prompts did not help in longer conversations with more timeline detail. |
| Instruction placement | Tested where to place the instruction: beginning of system prompt, end, before user question, after user question. | 706 | No clear winner across placements | Where you put the instruction didn't predict whether it worked. |
| Explicit contradictions | Made the expired timeframe impossible to miss in the conversation, then tested whether models would adjust their answer. | 600 | Best instruction: 42/80 (53%). Baseline: 23/80 (29%). | Models can reason about time when pointed at the problem. The only test where a majority caught the time change. |
| Elapsed-time metadata (short conversations) | Added elapsed-time metadata to short, controlled conversations and compared it against a date-only baseline. | 135 | 37% improved, 22% worse (27 pairs scored) | The result was positive but hard to interpret. In some cases the metadata gave the model information the baseline simply didn't have. |
| Elapsed-time metadata (long conversations) | Tested the same timing signal in longer, more realistic conversations, without adding an instruction condition. | 216 | 13% improved, 24% worse (net negative) | A cleaner test of metadata alone. The metadata condition beat the baseline only 9 of 72 times. Net negative. |
| Final eight-scenario test | Four prompt levels: date only → gap duration → week conversion → explicit deadline check. | 190 | Adjusted responses rose from 1/47 (2%) at baseline to 13/48 (27%) under the strongest instruction | Stronger scaffolding helped, but only in scenarios where the follow-up question naturally pointed back to the earlier timeframe. |
What I learned
The first fixes I tried were the obvious ones: remind the model to check whether time had passed, give it the number of days since the user's last message, tell it to ask a clarifying question when timing might matter. Each of these helped in some conditions. None of them reliably got the model to re-inquire about a previous deadline before answering the latest question.
Clarifying questions created a new problem
The most obvious fix was to tell the model to ask a clarifying question before answering whenever timing might matter. In shorter conversations, where the expired deadline appeared near the top of the conversation, it helped. The model caught the time issue in 4 of 8 cases, up from 1 of 8. But the same instruction created a different problem. When I ran it on conversations where time was irrelevant and the user was simply giving the model updated information, the model started asking questions the user had already answered.
Asking for travel recommendations, getting a suggestion, then updating a couple of constraints is a routine exchange. An instruction that breaks that isn't worth it, even if it helps in edge cases.
Elapsed-time metadata showed the same pattern. In short, controlled conversations, telling the model how many days had passed since the user's last message helped. But it wasn't necessarily that the model was reasoning more thoroughly about time. It just had more context to work with. In longer, more realistic conversations, the uplift disappeared entirely.
Models did better when the conflict was obvious
None of the simple fixes held up once the expired detail was buried deeper in a longer conversation, and moving the instruction around in the prompt didn't seem to help either. What I still wasn't sure of was whether models couldn't reason about time at all, or whether they just don't bother when the latest message gives them enough context to write a coherent response without referencing anything earlier in the conversation.
To test that, I designed conversations where the conflict was unavoidable, where what the user said in the latest message directly contradicted earlier advice. With the strongest instruction, models caught the change and answered accordingly 53% of the time, compared with 27% in the final eight-scenario test.
Models can reason about time when the conflict is visible. The issue is that they apparently won't go looking for such a conflict unprompted.
Four models were tested on conversations where new information made previous advice stale. For example, in one scenario, the model first told the user to choose all kitchen materials before demolition. Later, the user said the cabinets were already installed and the backsplash still had not been chosen. The chart counts answers that noticed the project had moved past the earlier “before demolition” stage and answered for the current problem, instead of continuing as if demolition had not started. Each model answered each of the four scenarios five times, for 20 responses per model.
Per-model breakdown (best instruction vs baseline):
Instruction: “Track what I state and what you infer; refresh anything that may have changed before responding.”
Stronger prompts helped, but only a little
In the final eight-scenario test, I gave models conversations where a deadline or timeframe had passed without the user acknowledging it. I tested four versions of the prompt, starting with just the current date and ending with an explicit instruction to check earlier deadlines before answering.
The strongest version raised adjusted responses from 2% to 27%. That improvement was concentrated in scenarios where the follow-up question naturally pointed back to the earlier deadline. When the follow-up could be read as an ordinary advice question, models answered it directly and ignored the earlier timeline.
Level 1, Date only:
“You are a helpful assistant. Current date: 2026-02-11.”
Level 2, + gap days:
adds: “Prior to this message, the user last messaged 25 days ago. All turns before the final user message took place around the time of that previous message.”
Level 3, + weeks conversion:
adds: “Note: 25 days is approximately 3 weeks and 4 days.”
Level 4, + cross-reference instruction:
adds: “Before responding, check whether any deadlines, dates, or timeframes the user mentioned earlier in the conversation have passed.”
Results by instruction level
The scenario mattered more than the model
The scenario mattered more than which model I used or which instruction I tried. The freelance deadline case was the most reliable because the follow-up question, about how to prioritize the remaining work before delivery, pointed directly at the old deadline, so the model didn't have to go looking for it. The medication case was the hardest because the follow-up was about stomach symptoms. The model could answer that without ever checking whether the course had already ended.
Freelance Project Deadline
“I'm trying to decide how to prioritize the remaining work. I want to add some scroll animations and polish to the homepage, but I also have two more inner pages to build. Should I focus on getting all the pages done first, or split my time between building and polishing?”
Antibiotic Medication Course
“I've been having some stomach issues and I'm not sure if it's related to the medication or something I ate. It's mostly just feeling bloated and having loose stools. Should I be concerned?”
Freelance Project Deadline
The deadline had already passed.
Job Interview Preparation
The interview had already happened.
Antibiotic Medication Course
The antibiotic course ended ~4 days ago.
House Purchase Offer
The 48-hour deadline has passed.
Birthday Party Planning
The party was ~18 days ago.
Vacation Plant Care
The vacation ended ~4 days ago.
Professional Certification Exam
The exam had already happened.
Kitchen Renovation Contractor Estimate
The estimate deadline has likely passed.
The scenarios differ in domain, time gap length, and question phrasing, so the sample is too small to isolate one cause. The useful pattern was narrower: models did better when the final question pointed back to the earlier date, deadline, or timeframe.
Question references the deadline: 6/6 detected
“You mentioned your deadline is about 3 weeks out. Since we last talked 25 days ago, that deadline has already passed. You need to get this delivered ASAP.”
Question doesn't reference time: 0/6 detected
“If it stays mild, just power through the rest of the course.”
Per-model breakdown (best instruction):
What changed how I ran the tests
A false positive
Early in the project, one model appeared to be close to perfect in realistic conversations. It was far above every other model I had tested, and I spent a day designing follow-up experiments around what made it different. Unfortunately, I had unknowingly introduced confounding variables.
In order to save money, I was calling that model through a local sub-agent setup that added extra execution context, including tool-use metadata. The other models were being called through a regular API path. When I reran that model through the same API path, the scores dropped back to where the others were. After that, I standardized the model calls, validated the judge with hand scoring, and added negative controls.
I also spent too long tweaking phrases inside the same test design before stepping back to ask whether the design itself was the problem. The more useful finding came from changing the test entirely: direct contradictions were easier for models to catch than expired details they had to notice on their own, and that gap was larger than any difference between instruction phrasings.
Why this matters
To a large language model, everything in the context window is just “now.” Models can reference time, but they don't experience it passing. That distinction didn't matter much when AI was mostly used for one-off questions, but it matters a lot now that agents are being asked to work across days and weeks.
Kin, an AI coaching app I've been using, remembered that my sabbatical was ending in three weeks. But two weeks later, I was still getting push notifications that said “three weeks left.” It had successfully stored the fact but apparently has no ability or motivation to update it as time went on.
The passage of time is one of the most fundamental parts of the human experience. A deadline three weeks away becomes urgent as it approaches, then passes, then stops mattering entirely. Until agents develop an intuitive understanding of the passage of time, they will likely fail to recognize the potential of their (artificial, soon to be superhuman) intelligence.