Rendered at 16:25:11 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
hash0 7 hours ago [-]
You mention two very specific aspects of AI producing code:
- "It hyper-focuses on the current task and couldn't care less if its changes break other parts of the system."
- "Long context = instant brain damage"
This is why I quickly discovered that I had to turn AI into a knowledgable, patient tutor rather than let it code for me. I am thoroughly at the helm for all decisions big and small - I don't let AI touch the code anymore.
coffeefirst 4 hours ago [-]
Anthropic did a study where people who worked like this understood their systems a lot better than and were basically as fast as AI maximalists.
Plus it’s so much cheaper… that has to matter.
softwaredoug 3 hours ago [-]
The study looks at junior developers unfamiliar with what they're implementing. Post-hoc they broke down the AI group and grouped some into tutoring->hand coding. Another group was AI maximalist. We should keep in mind the comparison of these groups is very low n. Tutoring + hand coding seemed to have the best speed + understanding. This was all conducted in Jan 2026.
Where I'd push back drawing too many conclusions from this study: arguably most successful AI usage is senior developers that know the programming environment they're working in. Know how far to trust the AI. And carefully review / understand outputs.
Nevertheless, the study's still interesting, and I wish they'd replicate with a much higher n per group. Junior developers (undergrads?) are a more abundant group and not particularly specialized yet. They've also spent years hand-coding at University, but probably could adapt to AI tooling pretty easily.
Wait, hold on. So you're telling me that Anthropic is out there hyping their newest, most powerful LLM like crazy... but really it's just a token-selling scheme?
softwaredoug 5 hours ago [-]
You can also do this better using just ChatGPT because it forces you to ask better questions independent of your code
And it is a lot cheaper in the end :)
sollawen 4 hours ago [-]
[dead]
coffeefirst 5 hours ago [-]
So stop.
I’m serious. Treat it like any other tool. When it helps solves problems, use it. When it makes problems, don’t use it.
There are a lot of people and an enormous amount of money trying to make hands off agentic happen, but the happiest and most effective enthusiasts I know do not give up control: they go function by function and class by class, generating or writing as they see fit.
The goal is to make useful software. At least, I think that’s still true?
sollawen 4 hours ago [-]
Couldn't agree more. Honestly, I don't believe in 'loop engineering' after working with AI these past few months, I've seen a lot of its decisions are really unreliable — and if you let it loop-engineer all night long, it's definitely going to leave you a pile of shit.
mawadev 6 hours ago [-]
I have to say the LLMs probably can do a lot of these things properly if you had access to the entire infrastructure for a single project but as you said, it is simply not economical to use. I'd rather have people maintain systems before lock in happens and someone hikes the prices and rug pulls me. In almost all cases you need someone experienced anyway to sign off on the changes it makes or to keep guidelines and guardrails intact.
I don't believe the "you are holding it wrong", "works on my machine", "works on this model" or "do this spec structure" type of arguments to compensate the fundamental issues. The tech simply does not do what is advertised and claimed as it is.
d-yoda 5 hours ago [-]
I agree with most of this. I think code quality has been improving though. Compared to a year ago, the difference is pretty noticeable.
I ran into many of the same issues, and they motivated me to experiment with a linter that flags duplication and architectural problems across a codebase. It’s still a work in progress though:
Yes I've experienced everything you stated. Here's what helped me:
Problem 1: "Obsessed with reinventing the wheel" " three duplicate functions":
Suggestion: plan then implement.
Tell LLM to scan your project and crete markdown file plan to solve the task first. DO NOT try to selve tasks in a single shot without planning. Review the plan file then, IN A NEW SESSION with clean context, tell LLM to read the implementation plan file and implement the plan according to the file.
---
Problem 2: "hyper-focuses on the current task and couldn't care less if its changes break other parts of the system"
Suggestion: add instructions to AGENTS.md file teaching LLMs how to run unit tests and other kinds of tests so it can make sure nothing broke. And also add to AGENTS.md that LLMs MUST run tests before marking the task done.
---
Problem 3: "you'll hit the 200k token limit in no time" "Long context = instant brain damage"
Suggestion: use 1 million context window LLMs. Also plan then implement will keep your context shorter.
If you can, use better LLM services which offer 1million context window. If you can't afford Anthropic or OpenAI, use DeepSeek V4 Flash or MiMo 2.5 for example. A $10/mo OpenCode Go subscription plan offers $60 in LLM credits which is A LOT for these cheap LLMs.
Also, planning phase is when the LLM has to scan the entire project to understand what needs changing. This is where the context bloat comes from. If you split tasks into planning + implementation, the scanning phase is condensed into a single markdown file which keeps context lean.
Bonus tip: Tell LLMs to use subagents when doing exploration.
---
Problem 4: The longer the context, the more incoherent its responses.
Suggestion: yeah, LLMs get dumber as their working memory fills up (just like me). If your session reaches 200k+ tokens, it's usually a sign you could have planned the feature better or split it up. It might be worth restarting with more clarification.
SyneRyder 6 hours ago [-]
>Problem 3: "you'll hit the 200k token limit..." ... Suggestion: use 1 million context window LLMs.
Yes, if the model someone is using only has 200k token limit, that would immediately suggest to me that it really isn't a sophisticated enough model.
Most of my coding sessions end up being about 350k tokens long when I finish, it wouldn't even fit in a 200k context. And that isn't counting the cache-reads by subagents, etc.
It's worth spending some time with the best Opus / GPT model, to at least get a sense of what the frontier is like.
jeffyaw 6 hours ago [-]
minimax m3 has a 1M token context window so not sure how op is hitting this 200k. maybe the plan they're on? or some setting in some layer of whatever their dev tooling is using.
bel8 4 hours ago [-]
Yeah it's probably some free or entry level LLM service.
Even DeepSeek v4 Flash has 1million context size.
FearNotDaniel 9 hours ago [-]
Sorry you're struggling. There are tons of resources out there from people who've been through the same pain and built up techniques to mitigate them. Matt Pocock's YouTube channel is one good starting place, there are many others. One key tip though is - you own the architecture of your application. If your files have become bloated over several rounds of LLM-generated code, that is primarily your responsibility to observe and push back on that, to ensure your repo as a brief but firm AGENTS.md or CLAUDE.md to describe architectural non-negotiables and some kind of /code-review skill that you give to a second agent to review the first agent's work against those standards. Worried about LLM changes introducing regressions? That's why you have a test suite, and you read what the agent changes to ensure the tests themselves are still checking for correct behaviour.
cedws 5 hours ago [-]
Your comment comes off as patronising, I don’t know if you intended that or not. These issues are not a matter of ‘holding it wrong’, they’re fundamental to agentic coding.
K0balt 3 hours ago [-]
The problem I’m having is not this. I’ve got it pretty finely honed at creating concise ,correct code to specifation, but using it is a nightmare cycle of make a decision, wait 3-5 minutes, make another decision, wait 3-5 minutes… and it’s not enough to really wear you out so you can work 16 hours in a day. It quickly becomes a hellscape of productivity.
PaiDxng 8 hours ago [-]
The most painful part is the “add instead of change/delete” habit. The real test for AI coding assistants isn’twhether they can generate code, but whether they can understand the existing system, reuse the right abstraction,remove bad code, and own the whole call chain after the change.
aristofun 4 hours ago [-]
LLMs are only as good as the average codebase they were trained on. What else did you expect?
jeffyaw 8 hours ago [-]
try using a harness. i provide one in Yaw Mode and you can copy it and modify and use a modified if you wanted to learn and tweak yours.
i use the skills /yaw-review excessively sometimes multiple times in a row on the same pr or session. followed by most often /yaw-address-all and then /yaw-coverage to add tests and /yaw-ship-ready to make production ready.
after a few rounds of these they are not needed every time on the same codebase.
if you are desperately wishing programming to go back to the before times it will never. or it will always be there but expect to be incredibly less productive than your peers.
dunnock 2 hours ago [-]
You are not suggesting harness will replace the need of programmer's review aren't you? The thing might go in the totally wrong direction and no harness will stop it. Surely unless you write harness for every specific task at hand but then why is it any better than writing or at least reviewing the code itself?
jeffyaw 31 minutes ago [-]
it is not, it just helps to not go in the wrong direction as much. even in my comment i mentioned reviewing the code multiple times. so i do agree reviewing is essential.
taffydavid 8 hours ago [-]
Spec driven development can help some, even for brownfield projects. You can have an LLM swallow up your entire project and spit out a spec, and then review that yourself.
For any issue, start a brand new context, point it to the spec, explain the issue and explain if it's a regression.
Also on it might seem like an obvious one, the more test coverage you have, the more your llm can tell if something has broken or if there's been a regression without needing to eat up context.
All of these things can help but there's no perfect solution.
softwaredoug 4 hours ago [-]
My biggest challenge is what it does to my ability to pay attention
1. Set the Agent off on some task
2. Go scroll social media
<15 minutes later>
Get back to whatever the agent was doing.
AI coding feels very anti-flow.
minibucket 9 hours ago [-]
Yeah. That drives me crazy too!
function addData() {.....
function addDataNew() {....
function addDataForAddData {....
function addDataForNewAddData {...
jessedu29260 8 hours ago [-]
How did you find way to fix, i have the same issue, been trying to implement rules above it, repo focused memory, etc and still having the same issue over, as you say it always prefer create new code or files instead or modifying the existing ones then end up creatomg problems that wasnt there before…
cedws 5 hours ago [-]
Agree with all that. You only have to look at the Claude Code source leak as proof that AI cannot write a clean codebase.
nullc 11 hours ago [-]
Worst negative pattern I've seen is hyper defensive programing. E.g.
But, of course, it depends a lot on which models you're using and how you instruct them.
YuriNiyazov 13 hours ago [-]
What model are you running? With what settings?
sollawen 9 hours ago [-]
mostly I am using GLM5, with miniMax-3 to do some easy work.
munksbeer 7 hours ago [-]
I haven't tried this model. We have a corporate plan for several models, and they are liberal with our spend, because we're always in an arms race with our competitors in our industry. So any advantage we can get, we need to take, or we lose edge/market share.
We have access to anthropic models, openai models and google models.
I run all my sessions on their best models with max thinking, because I don't care to optimise token usage at this stage. We are still learning every day about how to optimise our workflows, but I will say that I don't typically experience what you're describing.
I have very opinionated AGENTS.md files at the repo level, and at various other levels in the repo where more specialised rules are needed but I don't want those in my context unless that specific section of the codebase is going to be used or touched. I make a lot of use of skills. And my sessions are almost all "spec driven" in the sense that I type out an opinionated requirement to the LLM, tell it to challenge my thinking, to push back, to iterate on its own thinking, then to formulate a plan, then once done, go over it again to find any issues. I will then review the plan, or wing it, depending on the task. I then look at the overall code structure and design it has done. I have strong, opinionated coding rules in my AGENTS file. I have strong testing requirements (mostly end-to-end, not unit style).
I get really good results from this. But, I will say we're working in a highly opinionated codebase. We have the fundamentals in place already, where there are rules for how you do everything. The agent follows those rules pretty well. I'm not sure how well it would work on a codebase that is messy with a lot of conflicting design principles.
jeffyaw 6 hours ago [-]
if you want to try minimax m3 with a lot of optimizations and a custom tui check out typed. and for that harness you can use it in yaw mode.
you can also use within claude code tui by running:
typed cli off
i should probably change that to typed tui off/on. anyway.
MarvinYork 9 hours ago [-]
I don't have those experiences with Codex (5.5 xhigh).
denn-gubsky 12 hours ago [-]
If you model starves because of small context window (your described symptoms), then I would suggest:
a) Split the job between agents, each with it's own context window;
b) Use advanced model as orchestrator for multiple coding/testing/reviewing agents;
c) Use code indexers like https://github.com/ory/lumen or/and https://github.com/defendend/Claude-ast-index-search
d) Use planning and detailed specifications preparation before the coding phase.
coldtea 7 hours ago [-]
>Why are files so bloated in the first place? Because AI prefers adding new code over modifying existing code, and it rarely deletes anything.
Did you ask it to delete stuff or consolidate functionality? Did you ask it to reuse certain available implementations? Or do you use it as a black box, letting it do all the design and code, and not caring to steer it, except with some high level request ("build x")?
If it's the latter, if we treat a (non-actually-intelligent, generative) AI as a hands-off developer and remove ourselves from the loop, we get exactly what you mention in the rant.
But nothing forces us to use it like this (except the craze of "vibe coding"). Use it as a carefully monitored and steered coding assistant.
It's slower that way? That slowness is a requirement to ingest the expertise/knowledge/taste of a human developer in the mix. It's what avoids an avalance of slop to be commited and become part of the codebase unchecked.
That's why all the focus on maximazing speed, and removing the humans from the loop, are misguided.
When LLMs are ready to removed humans from the loop, we'll know: we'd be out of a job. As long as we have one, our role is to act as a quality bottlenect, not to open the floodgates.
adammarples 5 hours ago [-]
Can you get, for example in python, a tool that parses the ast and builds a very concise graph of function calls into a map, with file names, so that AI can do tool calls on this before updating any code? Instead of the grepping which is what we have.
pbgcp2026 8 hours ago [-]
What a pitiful state of programming affairs ... Just couple of years ago we have been discussing new libraries and tech architectures and algos. Today we discuss how to "shoehorn" some 3rd party crap into making your life miserable. Best case scenario: it will get smart(er) and replace you. (Yeah, the cope of "owning your architecture") Worst case: you redo all that shite by hand and still get replaced.
And the tragedy?
We ceased being programmers. We don't even qualify as code monkeys anymore ...
quintes 11 hours ago [-]
My convention file says no mass rewrites. Plan. Assess. Spec. Code review. Depends on your model I guess
tamrix 12 hours ago [-]
Sounds like more of a rant than a question and for that you get a ranty reply. Ask the AI how to improve your results.
sph 9 hours ago [-]
The post is decidedly a question. It is a forum, people come to interact with humans and exchange opinions.
madikz 3 hours ago [-]
[flagged]
chrisadam 5 hours ago [-]
[dead]
sanju3026 8 hours ago [-]
[flagged]
bluebird2026 8 hours ago [-]
[flagged]
fxthoorens 8 hours ago [-]
[dead]
663344 7 hours ago [-]
this is what i do. i start with correct code and use ai for translation.
You are right, my solution is to implement real world software engineering process, something closer to a real Software engineering team.I split multi agents into one repo and gives them different tasks"requirements check, product/spec, architecture, coding, review, tests/evals, and overall management.
Scope adjudication is extremely important in vibe coding, or agent can easily break your whole system with not applicable features.
- "It hyper-focuses on the current task and couldn't care less if its changes break other parts of the system." - "Long context = instant brain damage"
This is why I quickly discovered that I had to turn AI into a knowledgable, patient tutor rather than let it code for me. I am thoroughly at the helm for all decisions big and small - I don't let AI touch the code anymore.
Plus it’s so much cheaper… that has to matter.
Where I'd push back drawing too many conclusions from this study: arguably most successful AI usage is senior developers that know the programming environment they're working in. Know how far to trust the AI. And carefully review / understand outputs.
Nevertheless, the study's still interesting, and I wish they'd replicate with a much higher n per group. Junior developers (undergrads?) are a more abundant group and not particularly specialized yet. They've also spent years hand-coding at University, but probably could adapt to AI tooling pretty easily.
The study: https://www.anthropic.com/research/AI-assistance-coding-skil...
And it is a lot cheaper in the end :)
I’m serious. Treat it like any other tool. When it helps solves problems, use it. When it makes problems, don’t use it.
There are a lot of people and an enormous amount of money trying to make hands off agentic happen, but the happiest and most effective enthusiasts I know do not give up control: they go function by function and class by class, generating or writing as they see fit.
The goal is to make useful software. At least, I think that’s still true?
I don't believe the "you are holding it wrong", "works on my machine", "works on this model" or "do this spec structure" type of arguments to compensate the fundamental issues. The tech simply does not do what is advertised and claimed as it is.
I ran into many of the same issues, and they motivated me to experiment with a linter that flags duplication and architectural problems across a codebase. It’s still a work in progress though:
https://github.com/ludo-technologies/pyscn
Problem 1: "Obsessed with reinventing the wheel" " three duplicate functions":
Suggestion: plan then implement.
Tell LLM to scan your project and crete markdown file plan to solve the task first. DO NOT try to selve tasks in a single shot without planning. Review the plan file then, IN A NEW SESSION with clean context, tell LLM to read the implementation plan file and implement the plan according to the file.
---
Problem 2: "hyper-focuses on the current task and couldn't care less if its changes break other parts of the system"
Suggestion: add instructions to AGENTS.md file teaching LLMs how to run unit tests and other kinds of tests so it can make sure nothing broke. And also add to AGENTS.md that LLMs MUST run tests before marking the task done.
---
Problem 3: "you'll hit the 200k token limit in no time" "Long context = instant brain damage"
Suggestion: use 1 million context window LLMs. Also plan then implement will keep your context shorter.
If you can, use better LLM services which offer 1million context window. If you can't afford Anthropic or OpenAI, use DeepSeek V4 Flash or MiMo 2.5 for example. A $10/mo OpenCode Go subscription plan offers $60 in LLM credits which is A LOT for these cheap LLMs.
Also, planning phase is when the LLM has to scan the entire project to understand what needs changing. This is where the context bloat comes from. If you split tasks into planning + implementation, the scanning phase is condensed into a single markdown file which keeps context lean.
Bonus tip: Tell LLMs to use subagents when doing exploration.
---
Problem 4: The longer the context, the more incoherent its responses.
Suggestion: yeah, LLMs get dumber as their working memory fills up (just like me). If your session reaches 200k+ tokens, it's usually a sign you could have planned the feature better or split it up. It might be worth restarting with more clarification.
Yes, if the model someone is using only has 200k token limit, that would immediately suggest to me that it really isn't a sophisticated enough model.
Most of my coding sessions end up being about 350k tokens long when I finish, it wouldn't even fit in a 200k context. And that isn't counting the cache-reads by subagents, etc.
It's worth spending some time with the best Opus / GPT model, to at least get a sense of what the frontier is like.
Even DeepSeek v4 Flash has 1million context size.
i use the skills /yaw-review excessively sometimes multiple times in a row on the same pr or session. followed by most often /yaw-address-all and then /yaw-coverage to add tests and /yaw-ship-ready to make production ready.
after a few rounds of these they are not needed every time on the same codebase.
if you are desperately wishing programming to go back to the before times it will never. or it will always be there but expect to be incredibly less productive than your peers.
For any issue, start a brand new context, point it to the spec, explain the issue and explain if it's a regression.
Also on it might seem like an obvious one, the more test coverage you have, the more your llm can tell if something has broken or if there's been a regression without needing to eat up context.
All of these things can help but there's no perfect solution.
1. Set the Agent off on some task
2. Go scroll social media
<15 minutes later>
Get back to whatever the agent was doing.
AI coding feels very anti-flow.
We have access to anthropic models, openai models and google models.
I run all my sessions on their best models with max thinking, because I don't care to optimise token usage at this stage. We are still learning every day about how to optimise our workflows, but I will say that I don't typically experience what you're describing.
I have very opinionated AGENTS.md files at the repo level, and at various other levels in the repo where more specialised rules are needed but I don't want those in my context unless that specific section of the codebase is going to be used or touched. I make a lot of use of skills. And my sessions are almost all "spec driven" in the sense that I type out an opinionated requirement to the LLM, tell it to challenge my thinking, to push back, to iterate on its own thinking, then to formulate a plan, then once done, go over it again to find any issues. I will then review the plan, or wing it, depending on the task. I then look at the overall code structure and design it has done. I have strong, opinionated coding rules in my AGENTS file. I have strong testing requirements (mostly end-to-end, not unit style).
I get really good results from this. But, I will say we're working in a highly opinionated codebase. We have the fundamentals in place already, where there are rules for how you do everything. The agent follows those rules pretty well. I'm not sure how well it would work on a codebase that is messy with a lot of conflicting design principles.
you can also use within claude code tui by running: typed cli off
i should probably change that to typed tui off/on. anyway.
Did you ask it to delete stuff or consolidate functionality? Did you ask it to reuse certain available implementations? Or do you use it as a black box, letting it do all the design and code, and not caring to steer it, except with some high level request ("build x")?
If it's the latter, if we treat a (non-actually-intelligent, generative) AI as a hands-off developer and remove ourselves from the loop, we get exactly what you mention in the rant.
But nothing forces us to use it like this (except the craze of "vibe coding"). Use it as a carefully monitored and steered coding assistant.
It's slower that way? That slowness is a requirement to ingest the expertise/knowledge/taste of a human developer in the mix. It's what avoids an avalance of slop to be commited and become part of the codebase unchecked.
That's why all the focus on maximazing speed, and removing the humans from the loop, are misguided.
When LLMs are ready to removed humans from the loop, we'll know: we'd be out of a job. As long as we have one, our role is to act as a quality bottlenect, not to open the floodgates.
this is where i start (c2l code i translate at https://c2l.puter.site):
define..sq.x[* x x
(define (sq x) (* x x))
ai step:
translate this code to python
Scope adjudication is extremely important in vibe coding, or agent can easily break your whole system with not applicable features.