This is the second of a three-part series by Markus Eisele. Part 1 can be found here. Stay tuned for part 3.

Many AI projects fail. The reason is often simple. Teams try to rebuild last decade’s applications but add AI on top: A CRM system with AI. A chatbot with AI. A search engine with AI. The pattern is the same: “X, but now with AI.” These projects usually look fine in a demo, but they rarely work in production. The problem is that AI doesn’t just extend old systems. It changes what applications are and how they behave. If we treat AI as a bolt-on, we miss the point.

What AI Changes in Application Design

Traditional enterprise applications are built around deterministic workflows. A service receives input, applies business logic, stores or retrieves data, and responds. If the input is the same, the output is the same. Reliability comes from predictability.

AI changes this model. Outputs are probabilistic. The same question asked twice may return two different answers. Results depend heavily on context and prompt structure. Applications now need to manage data retrieval, context building, and memory across interactions. They also need mechanisms to validate and control what comes back from a model. In other words, the application is no longer just code plus a database. It’s code plus a reasoning component with uncertain behavior. That shift makes “AI add-ons” fragile and points to a need for entirely new designs.

Defining AI-Infused Applications

AI-infused applications aren’t just old applications with smarter text boxes. They have new structural elements:

Context pipelines: Systems need to assemble inputs before passing them to a model. This often includes retrieval-augmented generation (RAG), where enterprise data is searched and embedded into the prompt. But also hierarchical, per user memory.

Memory: Applications need to persist context across interactions. Without memory, conversations reset on every request. And this memory might need to be stored in different ways. In process, midterm and even long-term memory. Who wants to start support conversations by saying your name and purchased products over and over again?

Guardrails: Outputs must be checked, validated, and filtered. Otherwise, hallucinations or malicious responses leak into business workflows.

Agents: Complex tasks often require coordination. An agent can break down a request, call multiple tools or APIs or even other agents, and assemble complex results. Executed in parallel or synchronously. Instead of workflow driven, agents are goal driven. They try to produce a result that satisfies a request. Business Process Model and Notation (BPMN) is turning toward goal-context–oriented agent design.

These are not theoretical. They’re the building blocks we already see in modern AI systems. What’s important for Java developers is that they can be expressed as familiar architectural patterns: pipelines, services, and validation layers. That makes them approachable even though the underlying behavior is new.

Models as Services, Not Applications

One foundational thought: AI models should not be part of the application binary. They are services. Whether they’re served through a container locally, served via vLLM, hosted by a model cloud provider, or deployed on private infrastructure, the model is consumed through a service boundary. For enterprise Java developers, this is familiar territory. We have decades of experience consuming external services through fast protocols, handling retries, applying backpressure, and building resilience into service calls. We know how to build clients that survive transient errors, timeouts, and version mismatches. This experience is directly relevant when the “service” happens to be a model endpoint rather than a database or messaging broker.

Related work from others:  O'Reilly Media - Educating a New Generation of Workers

By treating the model as a service, we avoid a major source of fragility. Applications can evolve independently of the model. If you need to swap a local Ollama model for a cloud-hosted GPT or an internal Jlama deployment, you change configuration, not business logic. This separation is one of the reasons enterprise Java is well positioned to build AI-infused systems.

Java Examples in Practice

The Java ecosystem is beginning to support these ideas with concrete tools that address enterprise-scale requirements rather than toy examples.

Retrieval-augmented generation (RAG): Context-driven retrieval is the most common pattern for grounding model answers in enterprise data. At scale this means structured ingestion of documents, PDFs, spreadsheets, and more into vector stores. Projects like Docling handle parsing and transformation, and LangChain4j provides the abstractions for embedding, retrieval, and ranking. Frameworks such as Quarkus then extend those concepts into production-ready services with dependency injection, configuration, and observability. The combination moves RAG from a demo pattern into a reliable enterprise feature.
LangChain4j as a standard abstraction: LangChain4j is emerging as a common layer across frameworks. It offers CDI integration for Jakarta EE and extensions for Quarkus but also supports Spring, Micronaut, and Helidon. Instead of writing fragile, low-level OpenAPI glue code for each provider, developers define AI services as interfaces and let the framework handle the wiring. This standardization is also beginning to cover agentic modules, so orchestration across multiple tools or APIs can be expressed in a framework-neutral way.

Cloud to on-prem portability: In enterprises, portability and control matter. Abstractions make it easier to switch between cloud-hosted providers and on-premises deployments. With LangChain4j, you can change configuration to point from a cloud LLM to a local Jlama model or Ollama instance without rewriting business logic. These abstractions also make it easier to use more and smaller domain-specific models and maintain consistent behavior across environments. For enterprises, this is critical to balancing innovation with control.

These examples show how Java frameworks are taking AI integration from low-level glue code toward reusable abstractions. The result is not only faster development but also better portability, testability, and long-term maintainability.

Testing AI-Infused Applications

Testing is where AI-infused applications diverge most sharply from traditional systems. In deterministic software, we write unit tests that confirm exact results. With AI, outputs vary, so testing has to adapt. The answer is not to stop testing but to broaden how we define it.

Unit tests: Deterministic parts of the system—context builders, validators, database queries—are still tested the same way. Guardrail logic, which enforces schema correctness or policy compliance, is also a strong candidate for unit tests.

Integration tests: AI models should be tested as opaque systems. You feed in a set of prompts and check that outputs meet defined boundaries: JSON is valid, responses contain required fields, values are within expected ranges.

Related work from others:  Latest from MIT Tech Review - The AI Hype Index: The White House’s war on “woke AI”

Prompt testing: Enterprises need to track how prompts perform over time. Variation testing with slightly different inputs helps expose weaknesses. This should be automated and included in the CI pipeline, not left to ad hoc manual testing.

Because outputs are probabilistic, tests often look like assertions on structure, ranges, or presence of warning signs rather than exact matches. Hamel Husain stresses that specification-based testing with curated prompt sets is essential, and that evaluations should be problem-specific rather than generic. This aligns well with Java practices: We design integration tests around known inputs and expected boundaries, not exact strings. Over time, this produces confidence that the AI behaves within defined boundaries, even if specific sentences differ.

Collaboration with Data Science

Another dimension of testing is collaboration with data scientists. Models aren’t static. They can drift as training data changes or as providers update versions. Java teams cannot ignore this. We need methodologies to surface warning signs and detect sudden drops in accuracy on known inputs or unexpected changes in response style. They need to be fed back into monitoring systems that span both the data science and the application side.

This requires closer collaboration between application developers and data scientists than most enterprises are used to. Developers must expose signals from production (logs, metrics, traces) to help data scientists diagnose drift. Data scientists must provide datasets and evaluation criteria that can be turned into automated tests. Without this feedback loop, drift goes unnoticed until it becomes a business incident.

Domain experts play a central role here. Looking back at Husain, he points out that automated metrics often fail to capture user-perceived quality. Java developers shouldn’t leave evaluation criteria to data scientists alone. Business experts need to help define what “good enough” means in their context. A clinical assistant has very different correctness criteria than a customer service bot. Without domain experts, AI-infused applications risk delivering the wrong things.

Guardrails and Sensitive Data

Guardrails belong under testing as well. For example, an enterprise system should never return personally identifiable information (PII) unless explicitly authorized. Tests must simulate cases where PII could be exposed and confirm that guardrails block those outputs. This is not optional. While a best practice on the model training side, especially RAG and memory carry a lot of risks for exactly that personal identifiable information to be carried across boundaries. Regulatory frameworks like GDPR and HIPAA already enforce strict requirements. Enterprises must prove that AI components respect these boundaries, and testing is the way to demonstrate it.

By treating guardrails as testable components, not ad hoc filters, we raise their reliability. Schema checks, policy enforcement, and PII filters should all have automated tests just like database queries or API endpoints. This reinforces the idea that AI is part of the application, not a mysterious bolt-on.

Edge-Based Scenarios: Inference on the JVM

Not all AI workloads belong in the cloud. Latency, cost, and data sovereignty often demand local inference. This is especially true at the edge: in retail stores, factories, vehicles, or other environments where sending every request to a cloud service is impractical.

Related work from others:  Latest from MIT Tech Review - These two new AI benchmarks could help make models less biased

Java is starting to catch up here. Projects like Jlama allow language models to run directly inside the JVM. This makes it possible to deploy inference alongside existing Java applications without adding a separate Python or C++ runtime. The advantages are clear: lower latency, no external data transfer, and simpler integration with the rest of the enterprise stack. For developers, it also means you can test and debug everything inside one environment rather than juggling multiple languages and toolchains.

Edge-based inference is still new, but it points to a future where AI isn’t just a remote service you call. It becomes a local capability embedded into the same platform you already trust.

Performance and Numerics in Java

One reason Python became dominant in AI is its excellent math libraries like NumPy and SciPy. These libraries are backed by native C and C++ code, which delivers strong performance. Java has historically lacked first-rate numerics libraries of the same quality and ecosystem adoption. Libraries like ND4J (part of Deeplearning4j) exist, but they never reached the same critical mass.

That picture is starting to change. Project Panama is an important step. It gives Java developers efficient access to native libraries, GPUs, and accelerators without complex JNI code. Combined with ongoing work on vector APIs and Panama-based bindings, Java is becoming much more capable of running performance-sensitive tasks. This evolution matters because inference and machine learning won’t always be external services. In many cases, they’ll be libraries or models you want to embed directly in your JVM-based systems.

Why This Matters for Enterprises

Enterprises cannot afford to live in prototype mode. They need systems that run for years, can be supported by large teams, and fit into existing operational practices. AI-infused applications built in Java are well positioned for this. They are:

Closer to business logic: Running in the same environment as existing services

More auditable: Observable with the same tools already used for logs, metrics, and traces

Deployable across cloud and edge: Capable of running in centralized data centers or at the periphery, where latency and privacy matter

This is a different vision from “add AI to last decade’s application.” It’s about creating applications that only make sense because AI is at their core.

In Applied AI for Enterprise Java Development, we go deeper into these patterns. The book provides an overview of architectural concepts, shows how to implement them with real code, and explains how emerging standards like the Agent2Agent Protocol and Model Context Protocol fit in. The goal is to give Java developers a road map to move beyond demos and build applications that are robust, explainable, and ready for production.

The transformation isn’t about replacing everything we know. It’s about extending our toolbox. Java has adapted before, from servlets to EJBs to microservices. The arrival of AI is the next shift. The sooner we understand what these new types of applications look like, the sooner we can build systems that matter.

Share via
Copy link
Powered by Social Snap