Why Do LLMs Design Mediocre Architecture?

LLMs create mediocre architecture that compounds technical debt. By "mediocre architecture" I mean code that's functionally correct but hard to maintain and expand. In our example it's duplicated code without opinionated abstractions.

Where should we expect them to fail? How tech leads can benefit from increases in velocity without eventually crumbling under tech debt? How should teams change their working style as a result?

Worked Example: Implementing Email Notifications

I ran Claude Code in the FastAPI Example App with the following prompt:

Add a notification system to this FastAPI app that sends email alerts when users perform certain actions (user registration, password reset, important data changes). The system should support multiple notification channels in the future (SMS, push notifications) and handle failures gracefully with retry logic.

Claude Code's Implementation

Claude found an existing function for email notifications generate_new_account_email and followed a similar pattern in defining the new functions: generate_admin_account_status_change_email, generate_admin_profile_update_email, generate_email_change_notification, generate_new_account_email, and generate_profile_update_email.

Here's an example for one of them:

def generate_admin_account_status_change_email(
    email_to: str,
    full_name: str | None,
    status_change: str,
    admin_email: str,
    timestamp: str,
    reason: str | None = None
) -> EmailData:
    project_name = settings.PROJECT_NAME
    subject = f"{project_name} - Account Status Update"
    html_content = render_email_template(
        template_name="admin_account_status_change.html",
        context={
            "project_name": project_name,
            "full_name": full_name or "User",
            "status_change": status_change,
            "admin_email": admin_email,
            "timestamp": timestamp,
            "reason": reason,
            "login_link": settings.FRONTEND_HOST,
            "support_link": f"{settings.FRONTEND_HOST}/support",
        },
    )
    return EmailData(html_content=html_content, subject=subject)

And here's how it's used:

def update_user(
    *,
    session: SessionDep,
    user_id: uuid.UUID,
    user_in: UserUpdate,
    current_user: CurrentUser,
) -> Any:
    # 20 lines of endpoint handling code
 
    # All code until `return` is new notification logic
    if settings.emails_enabled:
        user_data = user_in.model_dump(exclude_unset=True)
        changed_fields = get_changed_fields_display(old_data, user_data)
        timestamp = format_timestamp()
 
        # Check if account status was changed
        if "is_active" in user_data and user_data["is_active"] != old_data["is_active"]:
            try:
                status_change = "Activated" if user_data["is_active"] else "Deactivated"
                email_data = generate_admin_account_status_change_email(
                    email_to=db_user.email,
                    full_name=db_user.full_name,
                    status_change=status_change,
                    admin_email=current_user.email,
                    timestamp=timestamp,
                )
                send_email(
                    email_to=db_user.email,
                    subject=email_data.subject,
                    html_content=email_data.html_content,
                )
            except Exception as e:
                logger.info(f"Failed to send status change email to {db_user.email}: {e}")
 
        elif changed_fields:
            try:
                email_data = generate_admin_profile_update_email(
                    email_to=db_user.email,
                    full_name=db_user.full_name,
                    changed_fields=changed_fields,
                    admin_email=current_user.email,
                    timestamp=timestamp,
                )
                send_email(
                    email_to=db_user.email,
                    subject=email_data.subject,
                    html_content=email_data.html_content,
                )
            except Exception as e:
                logger.info(f"Failed to send admin update email to {db_user.email}: {e}")
 
    return db_user

Here Claude has successfully understood the existing code structure and has created a new function that follows a similar pattern as before. The functions to generate emails and their usage are nearly identical with the existing code.

But here's the thing, if this were a PR, I'd reject it.

Why This Doesn't Work

François Chollet X post saying: Software engineers shouldn't fear being replaced by Al. They should fear being asked to maintain the sprawling mess of Al-generated legacy code their employer's systems will soon run on. Because that one will actually happen.

François gets it

The existing approach to sending notification emails has reached its critical mass. We now have 6 duplicate functions doing essentially the same thing. About 60% of the endpoint code is just handling notification emails.

There are also two dangerous gotchas in the code:

The code catches bare Exceptions and logs them at info level - a surefire way to leave logical bugs undiscovered forever.
The HTML templates used by generator methods don't actually exist.

Conflicting Optimization Objectives

A better design would be to remove the existing email function and replace it with an abstraction. But that means making opinionated decisions about the codebase.

This is RLHF-induced sycophancy at work. Not only does it respond to you with "you're perfectly right!", it considers your code "perfectly right". Well, unless you're terribly off course. Claude saw one way of doing emails and assumed that was the way.

From a product perspective, this makes sense. I rarely want an LLM coding agent making radical changes to my codebase.

Instead of being a problem, this is an opportunity to delineate responsibilities.

This is How Thinking Looks Like

In 1960, J. C. R. Licklider sketched his vision for humans and machines working together in Man-Computer Symbiosis. There he outlines a vision for AI handling the heavy lifting that's required for intellectual tasks, freeing humans to do the actual thinking.

About 85 per cent of my 'thinking' time was spent getting into a position to think...

-- J. C. R. Licklider, Man-Computer Symbiosis (1960)

For me, this is how the promise of AI looks like.

When LLMs handle the implementation details, our role as reviewers fundamentally changes. We're no longer scanning lines to find typos or narrow oversights. Instead, we get to focus on high level decisions that push the boundaries of our systems. Are we solving the right problem? Is this the right abstraction? Will this scale?

The first line of defence is a good ol' whiteboard meeting. I've noticed myself pulling a colleague aside for a design discussion more frequently. Especially as I know that once I have a solid high-level design, the (LLM-assisted) implementation will be easy.

Sometimes, just implementing a feature is faster than having a long discussion around it. Normally, I'm hesitant to suggest drastic architectural changes in PRs. The tradeoff between clean software architecture and time taken to implement it is rarely worth it in a startup. Especially if there's an existing solution. However, as the code can be easily generated, that's no longer a valid concern. That's why, when reviewing PRs, I now pay more attention to high-level design decisions.

As a reviewer, it is difficult to think about both the big picture design and line-by-line implementation all at once. That's why I outsource most of the low-level thinking to automated tooling. Tests and classic standard analyzers are helpful here. Obviously we use Recurse ML as our safety net.

However, one issue I find myself battling with is that LLMs tend to repeat the same mistakes over and over again. This is where custom rules are particularly helpful. Think of them as code review comments that automatically check every PR. When the tool spots code that violates these patterns, it leaves a comment explaining the issue.

I've learned from Nate @ Continue.dev the following rule of thumb:

Whenever I leave a comment in a PR, I ask myself, should this be an automatic rule? Most of the time the answer is "yes" and I create it.

For example, I created bare_exceptions.md to prevent the bare Exception case from making it into our production codebase.

Conclusion

LLMs create mediocre architecture because they optimize for consistency with existing patterns, not optimal design.

This limitation is an opportunity. When LLMs handle implementation, we get to focus on the hard architectural decisions that actually matter.

The practical takeaways:

Design first, implement second. Pull a colleague aside for that whiteboard session.
Review PRs for architecture, not syntax. Let tools like Recurse.ml, tests, and linters scan the lines for you.
Turn repeated review comments into automated rules. If you're saying it twice, automate it.
Don't hesitate to suggest refactoring. When implementation is cheap, good design becomes affordable.

Embrace the promise of LLMs to free humans from rote work towards creative high-level tasks. Designing elegant software is hard. It requires understanding existing code, having a mental model of people who will maintain it and the environment it will run in. However, it's also fun. It's creative in the same way that writing or visual art is creative. The space of viable solutions is vast. Greatness is highly subjective and context-dependent. But creating something that other people appreciate and build upon is one of the most rewarding parts of the human experience.

Why Do LLMs Design Mediocre Architecture?

Worked Example: Implementing Email Notifications

Claude Code's Implementation

Why This Doesn't Work

Conflicting Optimization Objectives

This is How Thinking Looks Like

Conclusion

READY TO MAKE A JUMP?

Start catching bugs before they happen

Resources

Legal