AI is everywhere, but is it really changing how we build software?
Sure, it’s making developers faster, powering chatbots that can answer questions like a real person, and helping users find information faster than ever. But here’s the question I keep asking myself:
What new things can we actually build with general, consumer-grade AI tools?
Let me clarify what I mean by “consumer-level AI.” I’m talking about AI tools and services that are pre-built and ready to use—no training your own models or diving into deep learning theory or building the next AlphaFold. I’m talking about tools software developers like you and me, people who want to build better apps faster, can use that are already available.
Think: Azure’s Document Intelligence, GPT-based agents, or Microsoft’s MCP stack. These tools are powerful, but too often they’re just bolted on as shallow features.
Right now, the typical AI integration I’m seeing everywhere looks like this: add a chatbot, maybe throw in an MCP server just to have it or add a call to an LLM to summarize some text, and slap that coveted “Powered by AI” badge on your homepage that makes the marketing team go wild.
If AI is just an afterthought… if it’s not changing the shape of your system or unlocking something new… it’s a missed opportunity.
You don’t need to invent new AI to build something truly AI-first. You need to think differently. The trick I’ve seen work best is when developers combine tools and design around the AI from day one and not just sprinkle it in later.
Think about replacing some of your business logic with reasoning, summarization, or data extraction. Think about personalizing the app based on the user instead of rigid UI flows. Think about systems that listen,translate, predict, and act—without needing dozens of forms and wizards powered by complex backend business logic.
Stop asking:
“How can I add AI to my app?”
Start asking:
“What hard problem in my app could AI actually solve?”
Instead of thinking in terms of chatbots or tools, think in actions:
Generate. Analyze. Validate. Detect. Rewrite. Update. Route.
These are things AI is good at. And more importantly, these are the kinds of actions where friction is highest in traditional systems, friction that AI can remove.
Recently, I joined a team tasked with building a product that puts AI at the core, not by building new AI systems, but by orchestrating existing ones in smarter ways.
Our goal?
Create a case management system that helps case workers spend less time using software and more time helping people.
Sounds odd, right? The less people use the app, the more it does its job.
Our vision is that the app should disappear, a quiet, reliable assistant that collects and organizes data without requiring hours of manual entry or constant interaction.
So how does this look in practice?
First, our data model is built for flexibility. While we still use a strong SQL schema, many fields are dynamic, stored as JSON with defined schemas. These are populated by AI, extracting insights from documents, conversations, and voice input.
We use tools like:
- Azure Document Intelligence to extract structured data from files
- Semantic Kernel to analyze and contextualize input
- Azure Cognitive Services to transcribe and analyze multilingual conversations
- And yes, we use GPT agents to summarize, interpret, and validate input across the board
This collection of APIs replaces what used to be hundreds of lines of business logic, forms, validations, and user configuration. Now, a user can speak, upload, or type in natural language, and the AI organizes the data behind the scenes.
Just a few years ago, building something like this would have required a small army of developers and endless meetings about business rules spanning years of work. Now, it’s a handful of well written prompts, smart API orchestration and a team of developers who have spent a fraction of that time to get to a viable product.
Of course, it’s not all sunshine and auto-summarized rainbows.
One of the hardest parts of building AI-first systems is deciding how much trust to give to AI. It’s non-deterministic by nature. Ask the same question twice, and you might get two different answers. Worse, sometimes it confidently gets things wrong.
We’ve mitigated a lot of this through:
- Strong prompting best practices
- Validation layers
- Clear user visibility into AI-generated data
But the truth is, AI will occasionally mess up. So, ask yourself… what happens if this fails?
In our case, AI helps case workers collect and interpret data, but users still have the opportunity to review everything. For higher-risk use cases, like financial systems or medical records, you may need more guardrails, or reconsider if AI should even be involved at all.
Another challenge you’ll run into quickly: testing and evaluating AI is just fundamentally different than testing traditional code.
- You can’t write unit tests that assert getAnswer() == “42” because AI’s output changes based on wording, context, temperature, and even time.
- Sometimes changing a single word in your prompt results in significantly better (or worse) performance.
- Comparing models? That’s a full-on UX and data analysis job, not just benchmarking for speed or cost.
You can’t just A/B test AI prompts like you would traditional business logic. You need real user interactions, good logging, and qualitative feedback loops. And you need to embrace iterative tuning, because “prompt engineering” is more like UX design than software engineering. You write something, you evaluate, then you iterate over and over, and you will probably never reach ‘perfect’, at least not with today’s AI.
We haven’t built a full test harness for AI (yet), but we’ve started thinking about what it might look like, and the answers are… unconventional.
One idea?
Use AI to test the AI (please don’t stop reading just yet!)
Think about a summarization task. How do you know if the summary is good? You could try to match it against keywords with string comparisons, but that’s fragile and often misses the point.
Instead, we’ve considered feeding the original text and the AI-generated summary back into another AI model and asking:
- “Does this summary accurately reflect the input?”
- “On a scale from 1–10, how complete is this summary? How confident are you?”
- “What details, if any, are missing or inaccurate?”
In other words, let the AI self-grade its work and explain why. It’s not perfect, but it’s a good way to spot red flags or regression when iterating on prompts.
We’re also experimenting with:
- Prompt comparisons: Sending the same input through multiple prompt styles to see which produces better output
- Synthetic test sets: Crafting examples that are tricky or edge cases, then using those to stress-test prompts, does it fail gracefully on weird inputs? Does it hallucinate more on edge cases?
- Logging everything: Capturing inputs and outputs so we can manually review and improve them over time
And that’s really the point… we didn’t invent new AI or create a breakthrough algorithm. We simply asked better questions about how we want to use AI, used what was already there, and built something around it that feels new, helpful, and genuinely impactful.
So next time you’re tempted to add a basic chatbot or voice assistant to your app, take a step back and ask:
Is this really adding value? What’s a problem I couldn’t solve before, but can now with the help of AI?
That’s where the real opportunity lies.