Projects

Earned Confidence

Reasoning through consequential decisions

Earned Confidence is a prototype for decisions with real tradeoffs. It grew out of my Borrowed confidence essay, where I argued that AI can make a conclusion feel earned before the user has done the reasoning required to stand behind it. The prototype starts by clarifying the framing, then maps possible paths into a graph of claims, assumptions, tensions, and dependencies. You move through the field, weight what feels true or load-bearing, and watch the synthesis change until the structure of the decision becomes clear enough to seal, reopen, or revise.

More info

Why I built it

I wrote Borrowed confidence after noticing how persuasive fluent AI answers feel on questions where there is no ground truth: what you value, what you should do, what kind of tradeoff you are willing to make. The product question underneath the essay was whether interface design could help people build confidence through the act of reasoning. Earned Confidence keeps multiple paths visible, exposes the claims beneath them, and asks the user to participate in the synthesis.

How it works

The prototype begins with a short framing stage. Before the graph appears, it asks clarifying questions so the decision has enough context to reason from. It then opens a spatial canvas with possible paths, supporting and opposing claims, assumption clusters, and the relationships between them.

The user moves through the graph and assigns weight to the claims that feel true, weak, important, or unresolved. Those weights feed the synthesis, which changes as the reasoning field changes. The final artifact is a written decision synthesis with the live tradeoffs, assumptions, and reasons to revisit the decision still attached to the graph behind it.

What I learned

I learned that a useful decision tool has to leave room for uncertainty while still helping the user move forward. The strongest moments in the prototype are when a person can choose a direction, name the assumptions behind it, and understand what would make them revisit the decision later.

I also learned how much the pacing matters. If the graph appears too early, it feels like a diagram. If the synthesis appears too early, it feels like advice. The experience works best when the user has made enough small judgments that the final synthesis feels like something they helped build.

xyz

Multi-AI collaboration in Slack

I wanted a workspace where I could collaborate with multiple models, and where conversations persist and are searchable. A bit over-engineered, but a cool concept nonetheless.

More info

Demo

Why I built it

I've found the most effective way to use AI is to have different models critique each other's outputs. I wanted a workspace where I could do that easily, where conversations persist, and where I can swap models in and out without friction. xyz is built on OpenRouter, which makes switching models simple. And because Slack was designed for asynchronous communication, it is a great workspace for AI collaboration.

The collaborators

Four personas (analytical, creative, divergent, synthesizer), each running on a different model. The idea is to assign different perspectives to different models and use them together, as inputs and outputs to each other.

Staged execution

xyz orchestrates in stages: x and y respond first, f reviews both, then z synthesizes. Each model sees what came before, so the thinking compounds.

What I learned

Building xyz taught me a lot about agentic coding. I ran into most of the common pitfalls: agents starting to code before making architectural decisions, implementing the same feature multiple ways, changes being lost when the context window fills up. I wrote more about this shift in how I work in Vibe Shift.
I spent more time than expected on formatting. Different models output things differently. Grok handles web references in a completely different way than ChatGPT, for example. Getting responses to feel consistent and readable took real effort.
f's divergent perspective turned out to be more useful than I expected. Getting two initial takes from x and y, then having f push back, has been genuinely valuable.
It's really frustrating that there's no native way to use ChatGPT and Claude together. These are text-based systems that could easily pass context between each other. Building a custom app shouldn't be the only way to make that happen.
xyz is also a sandbox. I built it to be extensible so I can keep experimenting with different approaches to multi-model collaboration.
From watching beta testers, I noticed people often use AI like Google, asking questions with clear right answers. For those, xyz isn't better than ChatGPT. But for ongoing decisions without obvious answers, like "what should my portfolio highlight," multiple perspectives are genuinely useful. That's what xyz is for.

Comparing xyz to ChatGPT

Yar

Flags when AI is telling you what you want to hear

Sycophancy is a well-documented problem in AI, but most strategies to address it are prompt techniques that put the burden on the user. I wrote about this pattern and then wrote about it more broadly. Yar automates that check. It's a Chrome extension that flags sycophantic responses across ChatGPT, Claude, Gemini, and Grok using features derived from the research literature. It runs entirely in your browser, explains what it found, and suggests counter-prompts you can send back to the model.

More info

Why I built it

Sycophancy is well-documented in AI, but most strategies to address it are prompt techniques or model-level changes. I've written about the pattern in Yes, and and about the broader problem in Borrowed confidence. I wanted something that ran alongside my normal workflow and flagged it automatically, without me having to change how I prompt.

How it works

Yar has a two-tier classification structure. The first tier scores individual AI responses using features derived from the sycophancy research literature: lexical patterns (agreement density, praise, enthusiasm inflation), structural signals (absence of dissent, caveat omission, opening flattery, person-directed praise), and contextual signals (preference echo, framing adoption). Each response maps to a 5-band scale: Balanced, Mostly balanced, Agreeable, Very agreeable, or Sycophantic.

The second tier operates at the conversation level. Once two or more turns exist, Yar looks for cross-turn patterns that are invisible in any single response: capitulation under pressure, validation language increasing over successive turns, the model flipping from cautious to endorsing after the user pushes back. These conversation-level features enrich the per-turn scores and produce an overall conversation rating.

I used Elicit for literature review and surveyed over 40 papers on sycophancy to identify measurable patterns. The training data combines hand-labeled conversations with publicly available sycophancy datasets. The full list of references is in the extension's settings page.

What I learned

I originally planned to use an embeddings-based approach, packaging an embeddings model with the extension and leveraging open-source datasets directly. It didn't work. Sycophancy shows up in how a model says something, not what it says. Two responses can deliver the same advice, but one opens with flattery, omits caveats, and praises your thinking, while the other gives it to you straight. Embeddings compress away exactly those signals, so I built features that target them directly.
The research literature was the most valuable resource. Many of the features that ended up mattering most came directly from patterns identified in published work: formulaic question-praise openers (Noshin et al.), capitulation under pressure (Liu et al.), and multi-turn validation trends (Hong et al.).
Classification UX is harder than classification. Early versions showed a numerical sycophancy score on every message. Testers were confused about what the numbers meant. I tried low/medium/high risk labels, but risk felt wrong for sycophancy. At one point the score was calculated to three decimal places. Eventually I landed on a 5-band scale with plain-language explanations. Technically less precise, but far more useful.
Everything runs in your browser. No data leaves your device. The extension stores only numeric scores and short snippets, never full conversations. This was a hard constraint from the start, and it shaped every architectural decision.

Strata

Voice-driven multi-perspective thinking

A guided, web-based version of the multiple perspectives approach behind xyz. Type a question, get probing questions from five perspectives, respond via voice or text, rate what resonates, then go deeper. It's more of an art project than a productivity tool, but I also added a "Product" mode that produces PRDs, pre-mortems, and project pitches to demonstrate how the multiple-perspective approach is useful there too. Creating a PRD with voice is pretty satisfying.

More info

Why I built it

I'm interested in AI interaction patterns that don't require explicit prompting. With chat, how you frame the question matters a lot. You can introduce bias or signal which direction you're leaning, and the AI tends to agree with you. I wanted an experience that benefits from AI's generative nature without putting the burden of perfect prompting on the user.

How it works

You get initial perspectives, respond via voice or text, and rate what resonates. The deeper lenses are generated based on your responses and ratings. Voice matters because speaking lets you ramble, contradict yourself, and backtrack in ways that typing doesn't. And rating works better than explaining your reaction. It gives the AI signal without requiring you to articulate something you might not have words for yet.

Why voice

Voice makes the experience more intimate. You don't have to think in complete sentences or construct your thoughts perfectly. You can just word-vomit and the AI assembles that into something coherent you can react to.

The science

Most AI models tend to agree with users. Strata structures each perspective to probe a specific angle, so you're more likely to get pushback than confirmation.

What surprised me

Combining perspectives, responses, and ratings produces more thoughtful results than I expected.

The Soundscape of This American Life

30 years of recurring music, visualized

I've been listening to This American Life for 15 years and always noticed they reuse the same music. I wanted to know: how often? Do the same clips show up in similar emotional contexts? This is a hobby project to explore that. Some clusters, like "The Rhythmic Pivot" (my own name), have appeared in episodes from 2001 to 2023.

More info

Why I built it

I had $300 in GCP credits I wanted to spend, and this was a good use of cloud compute. Fingerprinting 877 episodes and running multiple clustering passes wasn't something I could do on my own machine. This was a hobby project I wouldn't have paid for myself, but was happy to build with free credits.

How it works

Audio fingerprinting with Chromaprint, clustering similar clips, then manual curation to group related segments. Cluster names generated by Gemini. The whole pipeline ran on GCP VMs in the same region as my bucket.

What I learned

I accidentally spent $53 on data egress in one debugging session. I was running Modal (which executes in the US) against my GCS bucket while I was in Asia. Turns out it's not free to download 550 gigabytes of audio across the world. You always hear about cloud costs sneaking up on you, but I'd never experienced it firsthand.
I needed to manually curate some clusters, so I built a whole curation interface. Before agentic coding, I would have done it tediously by hand. Now it's worth building proper tooling even for personal projects.

Kaleidoscope

AI clipboard for Mac

I wanted to see if I could build a native Mac app using agentic coding tools. Most of what I see from vibe coding is web apps. Kaleidoscope is a Swift app that uses Apple's Foundation Models to transform clipboard content in various ways: reframing copied text, extracting full conversations from AI chat links, and reformatting content based on where you're pasting it.

More info

Demo

Core features

Reframes. Generate alternative perspectives on copied content with customizable prompts. The original idea: using AI as a kaleidoscope for reframing.
AI Conversation Capture. Paste a ChatGPT, Claude, Gemini, or Grok share link and extract the full conversation. Generates summaries for sharing and "resume prompts" that let you continue the conversation in a different AI without losing context.
Smart Paste. Detects the destination app and reformats accordingly. Strips markdown for iMessage, preserves it for Notion.

What I learned

Native apps are different from web apps when building with AI agents. Swift has implicit defaults that made it harder for agents to know when to remove code rather than add it. Teaching models when to subtract is harder than teaching them what to add.
Apple's Foundation Models are still too limited for serious use. Small context windows and aggressive content moderation get in the way. But if on-device AI improves to match API quality, that would be a significant shift. I'm more interested in on-device AI than browser-based AI, and I'm excited to see how this space develops.
I wrote more about building Kaleidoscope in this post and about my overall agentic coding setup in Dream team.