Software Architecture in an AI World – O’Reilly


Like almost any question about AI, “How does AI impact software architecture?” has two sides to it: how AI changes the practice of software architecture and how AI changes the things we architect.

These questions are coupled; one can’t really be discussed without the other. But to jump to the conclusion, we can say that AI hasn’t had a big effect on the practice of software architecture, and it may never. But we expect the software that architects design will be quite different. There are going to be new constraints, requirements, and capabilities that architects will need to take into account.


Learn faster. Dig deeper. See farther.

We see tools like Devin that promise end-to-end software development, delivering everything from the initial design to a finished project in one shot. We expect to see more tools like this. Many of them will prove to be helpful. But do they make any fundamental changes to the profession? To answer that, we must think about what that profession does. What does a software architect spend time doing? Slinging around UML diagrams instead of grinding out code? It’s not that simple.

The bigger change will be in the nature and structure of the software we build, which will be different from anything that has gone before. The customers will change, and so will what they want. They’ll want software that summarizes, plans, predicts, and generates ideas, with user interfaces ranging from the traditional keyboard to human speech, maybe even virtual reality. Architects will play a leading role in understanding those changes and designing that new generation of software. So, while the fundamentals of software architecture remain the same—understanding customer requirements and designing software that meets those requirements—the products will be new.

AI as an Architectural Tool

AI’s success as a programming tool can’t be understated; we’d estimate that over 90% of professional programmers, along with many hobbyists, are using generative tools including GitHub Copilot, ChatGPT, and many others. It’s easy to write a prompt for ChatGPT, Gemini, or some other model, paste the output into a file, and run it. These models can also write tests (if you’re very careful about describing exactly what you want to test). Some can run the code in a sandbox, generating new versions of the program until it passes. Generative AI eliminates a lot of busywork: looking up functions and methods in documentation or wading through questions and answers on Stack Overflow to find something that might be appropriate, for example. There’s been a lot of discussion about whether this increases productivity significantly (it does, but not as much as you might think), improves the quality of the generated code (probably not that well, though humans also write a lot of horrid code), compromises security, and other issues.

But programming isn’t software architecture, a discipline that often doesn’t require writing a single line of code. Architecture deals with the human and organizational side of software development: talking to people about the problems they want solved and designing a solution to those problems. That doesn’t sound so hard, until you get into the details—which are often unspoken. Who uses the software and why? How does the proposed software integrate with the customer’s other applications? How does the software integrate with the organization’s business plans? How does it address the markets that the organization serves? Will it run on the customer’s infrastructure, or will it require new infrastructure? On-prem or in the cloud? How often will the new software need to be modified or extended? (This may have a bearing on whether you decide to implement microservices or a monolithic architecture.) The list of questions architects need to ask is endless.

These questions lead to complex decisions that require knowing a lot of context and don’t have clear, well-defined answers. “Context” isn’t just the number of bytes that you can shove into a prompt or a conversation; context is detailed knowledge of an organization, its capabilities, its needs, its structure, and its infrastructure. In some future, it might be possible to package all of this context into a set of documents that can be fed into a database for retrieval-augmented generation (RAG). But, although it’s very easy to underestimate the speed of technological change, that future isn’t upon us. And remember—the important task isn’t packaging the context but discovering it.

The answers to the questions architects need to ask aren’t well-defined. An AI can tell you how to use Kubernetes, but it can’t tell you whether you should. The answer to that question could be “yes” or “no,” but in either case, it’s not the kind of judgment call we’d expect an AI to make. Answers almost always involve trade-offs. We were all taught in engineering school that engineering is all about trade-offs. Software architects are constantly staring these trade-offs down. Is there some magical solution in which everything falls into place? Maybe on rare occasions. But as Neal Ford said, software architecture isn’t about finding the best solution—it’s about finding the “least worst solution.”

That doesn’t mean that we won’t see tools for software architecture that incorporate generative AI. Architects are already experimenting with models that can read and generate event diagrams, class diagrams, and many other kinds of diagrams in formats like C4 and UML. There will no doubt be tools that can take a verbal description and generate diagrams, and they’ll get better over time. But that fundamentally mistakes why we want these diagrams. Look at the home page for the C4 model. The diagrams are drawn on whiteboards—and that shows precisely what they are for. Programmers have been drawing diagrams since the dawn of computing, going all the way back to flow charts. (I still have a flow chart stencil lying around somewhere.) Standards like C4 and UML define a common language for these diagrams, a standard for unambiguous communications. While there have long been tools for generating boilerplate code from diagrams, that misses the point, which is facilitating communications between humans.

An AI that can generate C4 or UML diagrams based on a prompt would undoubtedly be useful. Remembering the details of proper UML can be dizzying, and eliminating that busywork would be just as important as saving programmers from looking up the names and signatures of library functions. An AI that could help developers understand large bodies of legacy code would help in maintaining legacy software—and maintaining legacy code is most of the work in software development. But it’s important to remember that our current diagramming tools are relatively low-level and narrow; they look at patterns of events, classes, and structures within classes. Helpful as that software would be, it’s not doing the work of an architect, who needs to understand the context, as well as the problem being solved, and connect that context to an implementation. Most of that context isn’t encoded within the legacy codebase. Helping developers understand the structure of legacy code will save a lot of time. But it’s not a game changer.

There will undoubtedly be other AI-driven tools for software architects and software developers. It’s time to start imagining and implementing them. Tools that promise end-to-end software development, such as Devin, are intriguing, though it’s not clear how well they’ll deal with the fact that every software project is unique, with its own context and set of requirements. Tools for reverse engineering an older codebase or loading a codebase into a knowledge repository that can be used throughout an organization—those are no doubt on the horizon. What most people who worry about the death of programming forget is that programmers have always built tools to help them, and what generative AI gives us is a new generation of tooling.

Every new generation of tooling lets us do more than we could before. If AI really delivers the ability to complete projects faster—and that’s still a big if—the one thing that doesn’t mean is that the amount of work will decrease. We’ll be able to take the time saved and do more with it: spend more time understanding the customers’ requirements, doing more simulations and experiments, and maybe even building more complex architectures. (Yes, complexity is a problem, but it won’t go away, and it’s likely to increase as we become even more dependent on machines.)

To someone used to programming in assembly language, the first compilers would have looked like AI. They certainly increased programmer productivity at least as much as AI-driven code generation tools like GitHub Copilot. These compilers (Autocode in 1952, Fortran in 1957, COBOL1 in 1959) reshaped the still-nascent computing industry. While there were certainly assembly language programmers who thought that high-level languages represented the end of programming, they were clearly wrong. How much of the software we use today would exist if it had to be written in assembly? High-level languages created a new era of possibilities, made new kinds of applications conceivable. AI will do the same—for architects as well as programmers. It will give us help generating new code and understanding legacy code. It may indeed help us build more complex systems or give us a better understanding of the complex systems we already have. And there will be new kinds of software to design and develop, new kinds of applications that we’re only starting to imagine. But AI won’t change the fundamentally human side of software architecture, which is understanding a problem and the context into which the solution must fit.

The Challenge of Building with AI

Here’s the challenge in a nutshell: Learning to build software in smaller, clearer, more concise units. If you take a step back and look at the entire history of software engineering, this theme has been with us from the beginning. Software architecture is not about high performance, fancy algorithms, or even security. All of those have their place, but if the software you build isn’t understandable, everything else means little. If there’s a vulnerability, you’ll never find it if the code is incomprehensible. Code that has been tweaked to the point of incomprehension (and there were some very bizarre optimizations back in the early days) might be fine for version 1, but it’s going to be a maintenance nightmare for version 2. We’ve learned to do better, even if clear, understandable code is often still an aspiration rather than reality. Now we’re introducing AI. The code may be small and compact, but it isn’t comprehensible. AI systems are black boxes: we don’t really understand how they work. From this historical perspective, AI is a step in the wrong direction—and that has big implications for how we architect systems.

There’s a famous illustration in the paper “Hidden Technical Debt in Machine Learning Systems.” It’s a block diagram of a machine learning application, with a tiny box labeled ML in the center. This box is surrounded by several much bigger blocks: data pipelines, serving infrastructure, operations, and much more. The meaning is clear: in any real-world application, the code that surrounds the ML core dwarfs the core itself. That’s an important lesson to learn.

This paper is a bit old, and it’s about machine learning, not artificial intelligence. How does AI change the picture? Think about what building with AI means. For the first time (arguably with the exception of distributed systems), we’re dealing with software whose behavior is probabilistic, not deterministic. If you ask an AI to add 34,957 to 70,764, you might not get the same answer every time—you might get 105,621,2 a feature of AI that Turing anticipated in his groundbreaking paper “Computing Machinery and Intelligence.” If you’re just calling a math library in your favorite programming language, of course you’ll get the same answer each time, unless there’s a bug in the hardware or the software. You can write tests to your heart’s content and be sure that they’ll all pass, unless someone updates the library and introduces a bug. AI doesn’t give you that assurance. That problem extends far beyond mathematics. If you ask ChatGPT to write my biography, how will you know which facts are correct and which aren’t? The errors won’t even be the same every time you ask.

But that’s not the whole problem. The deeper problem here is that we don’t know why. AI is a black box. We don’t understand why it does what it does. Yes, we can talk about Transformers and parameters and training, but when your model says that Mike Loukides founded a multibillion-dollar networking company in the 1990s (as ChatGPT 4.0 did—I wish), the one thing you cannot do is say, “Oh, fix these lines of code” or “Oh, change these parameters.” And even if you could, fixing that example would almost certainly introduce other errors, which would be equally random and hard to track down. We don’t know why AI does what it does; we can’t reason about it.3 We can reason about the mathematics and statistics behind Transformers but not about any specific prompt and response. The issue isn’t just correctness; AI’s ability to go off the rails raises all kinds of problems of security and safety.

I’m not saying that AI is useless because it can give you wrong answers. There are many applications where 100% accuracy isn’t required—probably more than we realize. But now we have to start thinking about that tiny box in the “Technical Debt” paper. Has AI’s black box grown bigger or smaller? The amount of code it takes to build a language model is miniscule by modern standards—just a few hundred lines, even less than the code you’d use to implement many machine learning algorithms. But lines of code doesn’t address the real issue. Nor does the number of parameters, the size of the training set, or the number of GPUs it will take to run the model. Regardless of the size, some nonzero percentage of the time, any model will get basic arithmetic wrong or tell you that I’m a billionaire or that you should use glue to hold the cheese on your pizza. So, do we want the AI at the core of our diagram to be a tiny black box or a gigantic black box? If we’re measuring lines of code, it’s small. If we’re measuring uncertainties, it’s very large.

The blackness of that black box is the challenge of building and architecting with AI. We can’t just let it sit. To deal with AI’s essential randomness, we need to surround it with more software—and that’s perhaps the most important way in which AI changes software architecture. We need, minimally, two new components:

  • Guardrails that inspect the AI module’s output and ensure that it doesn’t get off track: that the output isn’t racist, sexist, or harmful in any of dozens of ways.
    Designing, implementing, and managing guardrails is an important challenge—especially since there are many people out there for whom forcing an AI to say something naughty is a pastime. It isn’t as simple as enumerating likely failure modes and testing for them, especially since inputs and outputs are often unstructured.
  • Evaluations, which are essentially test suites for the AI.
    Test design is an important part of software architecture. In his newsletter, Andrew Ng writes about two kinds of evaluations: relatively straightforward evaluations of knowable facts (Does this application for screening résumés pick out the applicant’s name and current job title correctly?), and much more problematic evals for output where there’s no single, correct response (almost any free-form text). How do we design these?

Do these components go inside the box or outside, as their own separate boxes? How you draw the picture doesn’t really matter, but guardrails and evals have to be there. And remember: as we’ll see shortly, we’re increasingly talking about AI applications that have multiple language models, each of which will need its own guardrails and evals. Indeed, one strategy for building AI applications is to use one model (typically a smaller, less expensive one) to respond to the prompt and another (typically a larger, more comprehensive one) to check that response. That’s a useful and increasingly popular pattern, but who checks the checkers? If we go down that path, recursion will quickly blow out any conceivable stack.

On O’Reilly’s Generative AI in the Real World podcast, Andrew Ng points out an important issue with evaluations. When it’s possible to build the core of an AI application in a week or two (not counting data pipelines, monitoring, and everything else), it’s depressing to think about spending several months running evals to see whether you got it right. It’s even more depressing to think about experiments, such as evaluating with a different model—although trying another model might yield better results or lower operating costs. Again, nobody really understands why, but no one should be surprised that all models aren’t the same. Evaluation will help uncover the differences if you have the patience and the budget. Running evals isn’t fast, and it isn’t cheap, and it’s likely to become more expensive the closer you get to production.

Neal Ford has said that we may need a new layer of encapsulation or abstraction to accommodate AI more comfortably. We need to think about fitness and design architectural fitness functions to encapsulate descriptions of the properties we care about. Fitness functions would incorporate issues like performance, maintainability, security, and safety. What levels of performance are acceptable? What’s the probability of error, and what kinds of errors are tolerable for any given use case? An autonomous vehicle is much more safety-critical than a shopping app. Summarizing meetings can tolerate much more latency than customer service. Medical and financial data must be used in accordance with HIPAA and other regulations. Any kind of enterprise will probably need to deal with compliance, contractual issues, and other legal issues, many of which have yet to be worked out. Meeting fitness requirements with plain old deterministic software is difficult—we all know that. It will be much more difficult with software whose operation is probabilistic.

Is all of this software architecture? Yes. Guardrails, evaluations, and fitness functions are fundamental components of any system with AI in its value chain. And the questions they raise are far more difficult and fundamental than saying that “you need to write unit tests.” They get to the heart of software architecture, including its human side: What should the system do? What must it not do? How do we build a system that achieves those goals? And how do we monitor it to know whether we’ve succeeded? In “AI Safety Is Not a Model Property,” Arvind Narayanan and Sayash Kapoor argue that safety issues inherently involve context, and models are always insufficiently aware of context. As a result, “defenses against misuse must primarily be located outside of models.” That’s one reason that guardrails aren’t part of the model itself, although they’re still part of the application, and are unaware of how or why the application is being used. It’s an architect’s responsibility to have a deep understanding of the contexts in which the application is used.

If we get fitness functions right, we may no longer need “programming as such,” as Matt Welsh has argued. We’ll be able to describe what we want and let an AI-based code generator iterate until it passes a fitness test. But even in that scenario, we’ll still have to know what the fitness functions need to test. Just as with guardrails, the most difficult problem will be encoding the contexts in which the application is used.

The process of encoding a system’s desired behavior begs the question of whether fitness tests are yet another formal language layered on top of human language. Will fitness tests be just another way of describing what humans want a computer to do? If so, do they represent the end of programming or the triumph of declarative programming? Or will fitness tests just become another problem that’s “solved” by AI—in which case, we’ll need fitness tests to assess the fitness of the fitness tests? In any case, while programming as such may disappear, understanding the problems that software needs to solve won’t. And that is software architecture.

New Ideas, New Patterns

AI presents new possibilities in software design. We’ll introduce some simple patterns to get a handle on the high-level structure of the systems that we’ll be building.

RAG

Retrieval-augmented generation, a.k.a. RAG, may be the oldest (though not the simplest) pattern for designing with AI. It’s very easy to describe a superficial version of RAG: you intercept users’ prompts, use the prompt to look up relevant items in a database, and pass those items along with the original prompt to the AI, possibly with some instructions to answer the question using material included in the prompt.

RAG is useful for many reasons:

  • It minimizes hallucinations and other errors, though it doesn’t entirely eliminate them.
  • It makes attribution possible; credit can be given to sources that were used to create the answer.
  • It enables users to extend the AI’s “knowledge”; adding new documents to the database is orders of magnitude simpler and faster than retraining the model.

It’s also not as simple as that definition implies. As anyone familiar with search knows, “look up relevant items” usually means getting a few thousand items back, some of which have minimal relevance and many others that aren’t relevant at all. In any case, stuffing all of them into a prompt would blow out all but the largest context windows. Even in these days of huge context windows (1M tokens for Gemini 1.5, 200K for Claude 3), too much context greatly increases the time and expense of querying the AI—and there are valid questions about whether providing too much context increases or decreases the probability of a correct answer.

A more realistic version of the RAG pattern looks like a pipeline:

It’s common to use a vector database, though a plain old relational database can serve the purpose. I’ve seen arguments that graph databases may be a better choice. Relevance ranking means what it says: ranking the results returned by the database in order of their relevance to the prompt. It probably requires a second model. Selection means taking the most relevant responses and dropping the rest; reevaluating relevance at this stage rather than just taking the “top 10” is a good idea. Trimming means removing as much irrelevant information from the selected documents as possible. If one of the documents is an 80-page report, cut it down to the paragraphs or sections that are most relevant. Prompt construction means taking the user’s original prompt, packaging it with the relevant data and possibly a system prompt, and finally sending it to the model.

We started with one model, but now we have four or five. However, the added models can probably be smaller, relatively lightweight models like Llama 3. A big part of architecture for AI will be optimizing cost. If you can use smaller models that can run on commodity hardware rather than the giant models provided by companies like Google and OpenAI, you will almost certainly save a lot of money. And that is absolutely an architectural issue.

The Judge

The judge pattern,4 which appears under various names, is simpler than RAG. You send the user’s prompt to a model, collect the response, and send it to a different model (the “judge”). This second model evaluates whether or not the answer is correct. If the answer is incorrect, it sends it back to the first model. (And we hope it doesn’t loop indefinitely—solving that is a problem that’s left for the programmer.)

This pattern does more than simply filter out incorrect answers. The model that generates the answer can be relatively small and lightweight, as long as the judge is able to determine whether it is correct. The model that serves as the judge can be a heavyweight, such as GPT-4. Letting the lightweight model generate the answers and using the heavyweight model to test them tends to reduce costs significantly.

Choice of Experts

Choice of experts is a pattern in which one program (possibly but not necessarily a language model) analyzes the prompt and determines which service would be best able to process it correctly. It’s similar to mixture of experts (MOE), a strategy for building language models in which several models, each with different capabilities, are combined to form a single model. The highly successful Mixtral models implement MOE, as do GPT-4 and other very large models. Tomasz Tunguz calls choice of experts the router pattern, which may be a better name.

Whatever you call it, looking at a prompt and deciding which service would generate the best response doesn’t have to be internal to the model, as in MOE. For example, prompts about corporate financial data could be sent to an in-house financial model; prompts about sales situations could be sent to a model that specializes in sales; questions about legal issues could be sent to a model that specializes in law (and that is very careful not to hallucinate cases); and a large model, like GPT, can be used as a catch-all for questions that can’t be answered effectively by the specialized models.

It’s frequently assumed that the prompt will eventually be sent to an AI, but that isn’t necessarily the case. Problems that have deterministic answers—for example, arithmetic, which language models handle poorly at best—could be sent to an engine that only does arithmetic. (But then, a model that never makes arithmetic mistakes would fail the Turing test.) A more sophisticated version of this pattern could be able to handle more complex prompts, where different parts of the prompt are sent to different services; then another model would be needed to combine the individual results.

As with the other patterns, choice of experts can deliver significant cost savings. The specialized models that process different kinds of prompts can be smaller, each with its own strengths, and each giving better results in its area of expertise than a heavyweight model. The heavyweight model is still important as a catch-all, but it won’t be needed for most prompts.

Agents and Agent Workflows

Agents are AI applications that invoke a model more than once to produce a result. All of the patterns discussed so far could be considered simple examples of agents. With RAG, a chain of models determines what data to present to the final model; with the judge, one model evaluates the output of another, possibly sending it back; choice of experts chooses between several models.

Andrew Ng has written an excellent series about agentic workflows and patterns. He emphasizes the iterative nature of the process. A human would never sit down and write an essay start-to-finish without first planning, then drafting, revising, and rewriting. An AI shouldn’t be expected to do that either, whether those steps are included in a single complex prompt or (better) a series of prompts. We can imagine an essay-generator application that automates this workflow. It would ask for a topic, important points, and references to external data, perhaps making suggestions along the way. Then it would create a draft and iterate on it with human feedback at each step.

Ng talks about four patterns, four ways of building agents, each discussed in an article in his series: reflection, tool use, planning, and multiagent collaboration. Doubtless there are more—multiagent collaboration feels like a placeholder for a multitude of sophisticated patterns. But these are a good start. Reflection is similar to the judge pattern: an agent evaluates and improves its output. Tool use means that the agent can acquire data from external sources, which seems like a generalization of the RAG pattern. It also includes other kinds of tool use, such as GPT’s function calling. Planning gets more ambitious: given a problem to solve, a model generates the steps needed to solve the problem and then executes those steps. Multiagent collaboration suggests many different possibilities; for example, a purchasing agent might solicit bids for goods and services and might even be empowered to negotiate for the best price and bring back options to the user.

All of these patterns have an architectural side. It’s important to understand what resources are required, what guardrails need to be in place, what kinds of evaluations will show us that the agent is working properly, how data safety and integrity are maintained, what kind of user interface is appropriate, and much more. Most of these patterns involve multiple requests made through multiple models, and each request can generate an error—and errors will compound as more models come into play. Getting error rates as low as possible and building appropriate guardrails to detect problems early will be critical.

This is where software development genuinely enters a new era. For years, we’ve been automating business systems, building tools for programmers and other computer users, discovering how to deploy ever more complex systems, and even making social networks. We’re now talking about applications that can make decisions and take action on behalf of the user—and that needs to be done safely and appropriately. We’re not concerned about Skynet. That worry is often just a feint to keep us from thinking about the real damage that systems can do now. And as Tim O’Reilly has pointed out, we’ve already had our Skynet moment. It didn’t require language models, and it could have been prevented by paying attention to more fundamental issues. Safety is an important part of architectural fitness.

Staying Safe

Safety has been a subtext throughout: in the end, guardrails and evals are all about safety. Unfortunately, safety is still very much a research topic.

The problem is that we know little about generative models and how they work. Prompt injection is a real threat that can be used in increasingly subtle ways—but as far as we know, it’s not a problem that can be solved. It’s possible to take simple (and ineffective) measures to detect and reject hostile prompts. Well-designed guardrails can prevent inappropriate responses (though they probably can’t eliminate them).

But users quickly tire of “As an AI, I’m not allowed to…,” especially if they’re making requests that seem reasonable. It’s easy to understand why an AI shouldn’t tell you how to murder someone, but shouldn’t you be able to ask for help writing a murder mystery? Unstructured human language is inherently ambiguous and includes phenomena like humor, sarcasm, and irony, which are fundamentally impossible in formal programming languages. It’s unclear whether AI can be trained to take irony and humor into account. If we want to talk about how AI threatens human values, I’d worry much more about training humans to eliminate irony from human language than about paperclips.

Protecting data is important on many levels. Of course, training data and RAG data must be protected, but that’s hardly a new problem. We know how to protect databases (even though we often fail). But what about prompts, responses, and other data that’s in-flight between the user and the model? Prompts might contain personally identifiable information (PII), proprietary information that shouldn’t be submitted to AI (companies, including O’Reilly, are creating policies governing how employees and contractors use AI), and other kinds of sensitive information. Depending on the application, responses from a language model may also contain PII, proprietary information, and so on. While there’s little danger of proprietary information leaking5 from one user’s prompt to another user’s response, the terms of service for most large language models allow the model’s creator to use prompts to train future models. At that point, a previously entered prompt could be included in a response. Changes in copyright case law and regulation present another set of safety challenges: What information can or can’t be used legally?

These information flows require an architectural decision—perhaps not the most complex decision but a very important one. Will the application use an AI service in the cloud (such as GPT or Gemini), or will it use a local model? Local models are smaller, less expensive to run, and less capable, but they can be trained for the specific application and don’t require sending data offsite. Architects designing any application that deals with finance or medicine will have to think about these issues—and with applications that use multiple models, the best decision may be different for each component.

There are patterns that can help protect restricted data. Tomasz Tunguz has suggested a pattern for AI security that looks like this:

The proxy intercepts queries from the user and “sanitizes” them, removing PII, proprietary information, and anything else inappropriate. The sanitized query is passed through the firewall to the model, which responds. The response passes back through the firewall and is cleaned to remove any inappropriate information.

Designing systems that can keep data safe and secure is an architect’s responsibility, and AI adds to the challenges. Some of the challenges are relatively simple: reading through license agreements to determine how an AI provider will use data you submit to it. (AI can do a good job of summarizing license agreements, but it’s still best to consult with a lawyer.) Good practices for system security are nothing new, and have little to do with AI: good passwords, multifactor authentication, and zero trust networks need to be standard. Proper management (or elimination) of default passwords is mandatory. There’s nothing new here and nothing specific to AI—but security needs to be part of the design from the start, not something added in when the project is mostly done.

Interfaces and Experiences

How do you design a user’s experience? That’s an important question, and something that often escapes software architects. While we expect software architects to put in time as programmers and to have a good understanding of software security, user experience design is a different specialty. But user experience is clearly a part of the overall architecture of a software system. Architects may not be designers, but they must be aware of design and how it contributes to the software project as a whole—particularly when the project involves AI. We often speak of a “human in the loop,” but where in the loop does the human belong? And how does the human interact with the rest of the loop? Those are architectural questions.

Many of the generative AI applications we’ve seen haven’t taken user experience seriously. Star Trek’s fantasy of talking to a computer appeared to come to life with ChatGPT, so chat interfaces have become the de facto standard. But that shouldn’t be the end of the story. While chat certainly has a role, it isn’t the only option, and sometimes, it’s a poor one. One problem with chat is that it gives attackers who want to drive a model off its rails the most flexibility. Honeycomb, one of the first companies to integrate GPT into a software product, decided against a chat interface: it gave attackers too many opportunities and was too likely to expose users’ data. A simple Q&A interface might be better. A highly structured interface, like a form, would function similarly. A form would also provide structure to the query, which might increase the likelihood of a correct, nonhallucinated answer.

It’s also important to think about how applications will be used. Is a voice interface appropriate? Are you building an app that runs on a laptop or a phone but controls another device? While AI is very much in the news now, and very much in our collective faces, it won’t always be that way. Within a few years, AI will be embedded everywhere: we won’t see it and we won’t think about it any more than we see or think about the radio waves that connect our laptops and phones to the internet. What kinds of interfaces will be appropriate when AI becomes invisible? Architects aren’t just designing for the present; they’re designing applications that will continue to be used and updated many years into the future. And while it isn’t wise to incorporate features that you don’t need or that someone thinks you might need at some vague future date, it’s helpful to think about how the application might evolve as technology advances.

Projects by IF has an excellent catalog of interface patterns for handling data in ways that build trust. Use it.

Everything Changes (and Remains the Same)

Does generative AI usher in a new age of software architecture?

No. Software architecture isn’t about writing code. Nor is it about writing class diagrams. It’s about understanding problems and the context in which those problems arise in depth. It’s about understanding the constraints that the context places on the solution and making all the trade-offs between what’s desirable, what’s possible, and what’s economical. Generative AI isn’t good at doing any of that, and it isn’t likely to become good at it any time soon. Every solution is unique; even if the application looks the same, every organization building software operates under a different set of constraints and requirements. Problems and solutions change with the times, but the process of understanding remains.

Yes. What we’re designing will have to change to incorporate AI. We’re excited by the possibility of radically new applications, applications that we’ve only begun to imagine. But these applications will be built with software that’s not really comprehensible: we don’t know how it works. We will have to deal with software that isn’t 100% reliable: What does testing mean? If your software for teaching grade school arithmetic occasionally says that 2+2=5, is that a bug, or is that just what happens with a model that behaves probabilistically? What patterns address that kind of behavior? What does architectural fitness mean? Some of the problems that we’ll face will be the same old problems, but we’ll need to view them in a different light: How do we keep data safe? How do we keep data from flowing where it shouldn’t? How do we partition a solution to use the cloud where it’s appropriate and run on-premises where that’s appropriate? And how do we take it a step farther? In O’Reilly’s recent Generative AI Success Stories Superstream, Ethan Mollick explained that we have to “embrace the weirdness”: learn how to deal with systems that might want to argue rather than answer questions, that might be creative in ways that we don’t understand, and that might be able to synthesize new insights. Guardrails and fitness tests are necessary, but a more important part of the software architect’s function may be understanding just what these systems are and what they can do for us. How do software architects “embrace the weirdness”? What new kinds of applications are waiting for us?

With generative AI, everything changes—and everything stays the same.


Acknowledgments

Thanks to Kevlin Henney, Neal Ford, Birgitta Boeckeler, Danilo Sato, Nicole Butterfield, Tim O’Reilly, Andrew Odewahn, and others for their ideas, comments, and reviews.


Footnotes

  1. COBOL was intended, at least in part, to allow regular business people to replace programmers by writing their own software. Does that sound similar to the talk about AI replacing programmers? COBOL actually increased the need for programmers. Business people wanted to do business, not write software, and better languages made it possible for software to solve more problems.
  2. Turing’s example. Do the arithmetic if you haven’t already (and don’t ask ChatGPT). I’d guess that AI is particularly likely to get this sum wrong. Turing’s paper is no doubt in the training data, and that’s clearly a high-quality source, right?
  3. OpenAI and Anthropic recently released research in which they claim to have extracted “concepts” (features) from their models. This could be an important first step toward interpretability.
  4. If you want more info, search for “LLM as a judge” (at least on Google); this search gives relatively clean results. Other likely searches will find many documents about legal applications.
  5. Reports that information can “leak” sideways from a prompt to another user appear to be urban legends. Many versions of that legend start with Samsung, which warned engineers not to use external AI systems after discovering that they had sent proprietary information to ChatGPT. Despite rumors, there isn’t any evidence that this information ended up in the hands of other users. However, it could have been used to train a future version of ChatGPT.