AI Application Architecture: LLM + Memory + Tools
The Problem
You’ve built a chatbot. It works great for simple questions. But when users ask about your product’s specific features, it hallucinates. When they want to perform actions (check order status, update settings), it can’t help. When they return the next day, it has no memory of previous conversations.
You’ve hit the wall that every AI developer encounters:
An LLM alone is not an application. It’s a component.
This is where most AI projects fail. Teams treat the LLM as the solution, not as one piece of a larger system. They build prompts, not architectures.
In this article, we’ll explore the anatomy of production AI applications. You’ll learn the core components, how they interact, and the architectural patterns that separate working systems from demos.
The Core Architecture: Beyond the LLM
Here’s the fundamental insight:
LLM is the brain. But a brain without senses, memory, and hands can’t accomplish much.
A complete AI application has five core components:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
┌─────────────────────────────────────────────────────────┐
│ AI Application │
│ │
│ ┌─────────────┐ │
│ │ LLM │ ← The reasoning engine │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Memory │ ← Conversation history, user state │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Tools │ ← APIs, databases, external services │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Knowledge │ ← RAG, documents, domain data │
│ └──────┬──────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Context │ ← Instructions, constraints, format │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Let’s examine each component.
Component 1: The LLM (Reasoning Engine)
What It Does
The LLM is the reasoning core. It:
- Interprets user input
- Decides what actions to take
- Generates responses
- Orchestrates other components
What It Doesn’t Do
The LLM does NOT:
- Store information between calls
- Access real-time data
- Execute actions
- Guarantee accuracy
Implementation Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class LLMComponent:
def __init__(self, model_name, temperature=0.7):
self.model_name = model_name
self.temperature = temperature
def generate(self, messages, tools=None):
"""
Generate a response, optionally with tool calling.
"""
response = call_llm_api(
model=self.model_name,
messages=messages,
temperature=self.temperature,
tools=tools # Optional: available tools
)
return response
def generate_with_retry(self, messages, max_retries=3):
"""
Generate with retry logic for reliability.
"""
for attempt in range(max_retries):
try:
return self.generate(messages)
except RateLimitError:
wait_exponential(attempt)
raise MaxRetriesExceeded()
Key Considerations
| Consideration | Guidance |
|---|---|
| Model selection | Balance cost vs. capability |
| Temperature | Lower for deterministic tasks, higher for creative |
| Token limits | Design prompts with context window in mind |
| Cost monitoring | Track tokens per request, implement budgets |
Component 2: Memory (State Management)
What It Does
Memory provides continuity. It:
- Stores conversation history
- Maintains user preferences
- Tracks session state
- Enables personalization over time
Types of Memory
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
┌─────────────────────────────────────────┐
│ Memory Hierarchy │
├─────────────────────────────────────────┤
│ Short-term (Context Window) │
│ - Current conversation │
│ - Immediate context │
│ - Limited by token window │
├─────────────────────────────────────────┤
│ Long-term (Persistent Storage) │
│ - User preferences │
│ - Historical interactions │
│ - Learned information │
│ - Unlimited (database-backed) │
├─────────────────────────────────────────┤
│ Episodic (Session-based) │
│ - Specific conversations │
│ - Task progress │
│ - Temporary state │
└─────────────────────────────────────────┘
Implementation Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class MemoryManager:
def __init__(self, user_id, db_connection):
self.user_id = user_id
self.db = db_connection
self.context_window = []
def add_message(self, role, content):
"""Add a message to short-term memory."""
self.context_window.append({
"role": role,
"content": content,
"timestamp": datetime.now()
})
# Persist to long-term storage
self.db.save_message(self.user_id, role, content)
def get_context(self, max_tokens=4000):
"""
Get conversation context, respecting token
"""
# Simple approach: take last N messages
# Advanced: summarize older messages
messages = []
tokens = 0
for msg in reversed(self.context_window):
msg_tokens = count_tokens(msg["content"])
if tokens + msg_tokens > max_tokens:
break
messages.insert(0, msg)
tokens += msg_tokens
return messages
def get_user_preferences(self):
"""Retrieve long-term user preferences."""
return self.db.get_preferences(self.user_id)
def summarize_session(self):
"""Create a summary of the current session."""
# Use LLM to summarize, store for future context
summary = llm.generate(f"""
Summarize this conversation:
{self.context_window}
Key points, decisions, and action items.
""")
self.db.save_session_summary(self.user_id, summary)
return summary
Memory Management Strategies
Strategy 1: Sliding Window
1
2
3
def get_sliding_window(messages, window_size=10):
"""Keep only the last N messages."""
return messages[-window_size:]
Strategy 2: Summarization
1
2
3
4
5
6
7
8
9
10
11
12
def get_summarized_context(messages, max_tokens=4000):
"""Summarize older messages to save tokens."""
if count_tokens(messages) <= max_tokens:
return messages
# Keep recent messages, summarize older ones
recent = messages[-5:]
older = messages[:-5]
summary = llm.generate(f"Summarize: {older}")
return [{"role": "system", "content": f"Previous conversation summary: {summary}"}] + recent
Strategy 3: Selective Retrieval
1
2
3
4
5
6
7
8
9
def get_relevant_context(messages, query, db):
"""Retrieve only relevant past messages."""
# Embed current query
query_embedding = embed(query)
# Find similar past messages
relevant = db.similarity_search(query_embedding, top_k=5)
return relevant + messages[-3:] # Add recent context
Component 3: Tools (Action Execution)
What It Does
Tools provide capabilities. They:
- Execute actions (API calls, database operations)
- Access real-time data
- Perform computations
- Interact with external systems
The Tool Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class Tool:
def __init__(self, name, description, function):
self.name = name
self.description = description
self.function = function
def to_llm_schema(self):
"""Convert to LLM's expected tool schema."""
return {
"type": "function",
"function": {
"name": self.name,
"description": self.description,
"parameters": self.get_parameter_schema()
}
}
def execute(self, **kwargs):
"""Execute the tool with given parameters."""
return self.function(**kwargs)
# Example tools
tools = [
Tool(
name="get_weather",
description="Get current weather for a location",
function=get_weather_api,
),
Tool(
name="search_database",
description="Search the product database",
function=search_products,
),
Tool(
name="create_ticket",
description="Create a support ticket",
function=create_support_ticket,
),
Tool(
name="calculate",
description="Perform mathematical calculations",
function=execute_python_code,
),
]
Tool Calling Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
┌─────────────────────────────────────────────────────────┐
│ Tool Calling Loop │
├─────────────────────────────────────────────────────────┤
│ │
│ 1. User Input │
│ ↓ │
│ 2. LLM receives input + tool definitions │
│ ↓ │
│ 3. LLM decides: respond OR call tool │
│ ↓ │
│ 4. If tool call: │
│ - LLM outputs tool name + arguments │
│ - System executes tool │
│ - Tool result returned to LLM │
│ - LLM generates final response │
│ ↓ │
│ 5. Return response to user │
│ │
└─────────────────────────────────────────────────────────┘
Implementation Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class ToolExecutor:
def __init__(self, tools):
self.tools = {tool.name: tool for tool in tools}
def process_with_tools(self, user_input, memory):
"""Process input with tool calling support."""
messages = memory.get_context()
messages.append({"role": "user", "content": user_input})
# Get LLM response with tool calling enabled
response = llm.generate(
messages=messages,
tools=[t.to_llm_schema() for t in self.tools]
)
# Check if LLM wants to call a tool
if response.tool_calls:
tool_call = response.tool_calls[0]
tool = self.tools[tool_call.name]
# Execute the tool
result = tool.execute(**tool_call.arguments)
# Add tool result to context
messages.append(response)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result)
})
# Get final response from LLM
final_response = llm.generate(messages=messages)
return final_response.content
return response.content
Component 4: Knowledge (RAG)
What It Does
Knowledge provides grounded information. It:
- Supplies domain-specific data
- Enables accurate responses about your product/content
- Reduces hallucinations
- Extends beyond training cutoff
RAG Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
┌─────────────────────────────────────────────────────────┐
│ RAG (Retrieval-Augmented Generation) │
├─────────────────────────────────────────────────────────┤
│ │
│ Indexing Phase: │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Documents│ → │ Chunk │ → │ Embed │ │
│ │ │ │ & Clean │ │ & Store │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────┐ │
│ │ Vector Database │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Query Phase: │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ User │ → │ Embed │ → │ Similarity│ │
│ │ Question │ │ Question │ │ Search │ │
│ └──────────┘ └──────────┘ └────┬─────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────┐ │
│ │ Retrieved Context + Question → LLM │ │
│ └──────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Implementation Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class RAGSystem:
def __init__(self, vector_db, embedding_model):
self.vector_db = vector_db
self.embedding_model = embedding_model
def index_documents(self, documents):
"""Index documents for retrieval."""
for doc in documents:
# Chunk the document
chunks = self.chunk_document(doc)
for chunk in chunks:
# Create embedding
embedding = self.embedding_model.encode(chunk.text)
# Store in vector DB
self.vector_db.upsert(
id=chunk.id,
embedding=embedding,
metadata={
"text": chunk.text,
"source": doc.source,
"page": chunk.page
}
)
def query(self, question, top_k=5):
"""Retrieve relevant context for a question."""
# Embed the question
query_embedding = self.embedding_model.encode(question)
# Search for similar documents
results = self.vector_db.similarity_search(
query_embedding,
top_k=top_k
)
# Format context for LLM
context = "\n\n".join([r.metadata["text"] for r in results])
return context
def chunk_document(self, document, chunk_size=500, overlap=50):
"""Split document into overlapping chunks."""
chunks = []
start = 0
while start < len(document.text):
end = start + chunk_size
chunk_text = document.text[start:end]
chunks.append(Chunk(
id=f"{document.id}_{start}",
text=chunk_text,
page=document.page
))
start += chunk_size - overlap
return chunks
RAG Query Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def answer_with_rag(question, rag_system):
"""Answer a question using RAG."""
# Retrieve relevant context
context = rag_system.query(question)
# Generate answer based on context
prompt = f"""
Based on the following context, answer the question.
If the answer is not in the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:
"""
return llm.generate(prompt)
Component 5: Context (Instructions & Constraints)
What It Does
Context provides guidance. It:
- Sets behavior and tone
- Defines constraints and rules
- Specifies output format
- Provides examples (few-shot)
Context Layers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
┌─────────────────────────────────────────┐
│ Context Hierarchy │
├─────────────────────────────────────────┤
│ System Instructions (Always active) │
│ - Role definition │
│ - Behavioral constraints │
│ - Safety guidelines │
├─────────────────────────────────────────┤
│ Task Instructions (Per-request) │
│ - Specific task description │
│ - Output format requirements │
│ - Examples (few-shot) │
├─────────────────────────────────────────┤
│ Conversation Context (Session) │
│ - Previous messages │
│ - User preferences │
│ - Current task state │
├─────────────────────────────────────────┤
│ Retrieved Context (RAG) │
│ - Relevant documents │
│ - Knowledge base entries │
│ - External data │
└─────────────────────────────────────────┘
Implementation Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
class ContextBuilder:
def __init__(self):
self.system_instructions = []
self.task_instructions = []
self.examples = []
def add_system_instruction(self, instruction):
"""Add a system-level instruction."""
self.system_instructions.append(instruction)
return self
def add_task_instruction(self, instruction):
"""Add a task-specific instruction."""
self.task_instructions.append(instruction)
return self
def add_example(self, input_text, output_text):
"""Add a few-shot example."""
self.examples.append({
"input": input_text,
"output": output_text
})
return self
def build(self, user_input):
"""Build the complete context."""
messages = []
# System instructions
if self.system_instructions:
system_prompt = "\n".join(self.system_instructions)
messages.append({"role": "system", "content": system_prompt})
# Task instructions
if self.task_instructions:
task_prompt = "\n".join(self.task_instructions)
messages.append({"role": "system", "content": task_prompt})
# Few-shot examples
for example in self.examples:
messages.append({"role": "user", "content": example["input"]})
messages.append({"role": "assistant", "content": example["output"]})
# User input
messages.append({"role": "user", "content": user_input})
return messages
# Usage
context = (ContextBuilder()
.add_system_instruction("You are a helpful customer support assistant.")
.add_system_instruction("Always be polite and professional.")
.add_task_instruction("Respond in 2-3 sentences maximum.")
.add_task_instruction("If you don't know, offer to connect with a human.")
.add_example(
"My order is late",
"I apologize for the delay. Let me look up your order status. Could you provide your order number?"
)
.build("Where is my package?")
)
Putting It All Together: Complete Architecture
The AI Application Class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class AIApplication:
def __init__(self, user_id, config):
self.user_id = user_id
self.llm = LLMComponent(config["model"])
self.memory = MemoryManager(user_id, config["database"])
self.tools = ToolExecutor(config["tools"])
self.rag = RAGSystem(config["vector_db"], config["embedding_model"])
self.context_builder = self._build_default_context()
def _build_default_context(self):
"""Build default context builder."""
return (ContextBuilder()
.add_system_instruction("You are a helpful assistant.")
.add_system_instruction("Use tools when needed for accurate information.")
.add_system_instruction("If unsure, acknowledge uncertainty."))
def process(self, user_input):
"""Process user input through the complete system."""
# 1. Store input in memory
self.memory.add_message("user", user_input)
# 2. Check if RAG is needed
if self._needs_knowledge_retrieval(user_input):
context = self.rag.query(user_input)
self.context_builder.add_task_instruction(
f"Use this context when relevant: {context}"
)
# 3. Build messages
messages = self.context_builder.build(user_input)
# 4. Process with tools if needed
if self._needs_tool_execution(user_input):
response = self.tools.process_with_tools(user_input, self.memory)
else:
response = self.llm.generate(messages)
# 5. Store response in memory
self.memory.add_message("assistant", response)
return response
def _needs_knowledge_retrieval(self, input_text):
"""Determine if RAG retrieval is needed."""
# Simple heuristic: check for domain-specific terms
# Advanced: use classifier or LLM to decide
domain_keywords = ["product", "feature", "pricing", "policy"]
return any(keyword in input_text.lower() for keyword in domain_keywords)
def _needs_tool_execution(self, input_text):
"""Determine if tool execution is needed."""
# Simple heuristic: check for action verbs
# Advanced: use LLM to classify intent
action_verbs = ["check", "update", "create", "delete", "search"]
return any(input_text.lower().startswith(verb) for verb in action_verbs)
Architectural Patterns
Pattern 1: Simple Chatbot
1
2
3
User → LLM → Response
↑
Context
Use when: General conversation, no special knowledge needed.
Pattern 2: RAG Chatbot
1
User → Embed → Vector DB → Context → LLM → Response
Use when: Domain-specific Q&A, product support.
Pattern 3: Agent with Tools
1
User → LLM → Tool Decision → Execute → LLM → Response
Use when: Actions needed (API calls, database operations).
Pattern 4: Full System
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
┌──────────┐
│ User │
└────┬─────┘
↓
┌──────────┐
│ Memory │ ← Conversation history
└────┬─────┘
↓
┌──────────┴──────────┐
↓ ↓
┌──────────┐ ┌──────────┐
│ RAG │ │ Tools │
│ (Knowledge) │ (Actions)│
└────┬─────┘ └────┬─────┘
↓ ↓
┌──────────┐
│ LLM │
│(Reasoning)│
└────┬─────┘
↓
┌──────────┐
│ Response │
└──────────┘
Use when: Production applications requiring all capabilities.
Key Takeaways
- LLM is one component, not the entire application.
- Memory provides continuity across conversations.
- Tools enable actions and real-time data access.
- Knowledge (RAG) grounds responses in your domain.
- Context guides behavior and output format.
- Choose the pattern that matches your use case complexity.
Next Article
In Article 5: Why AI Applications Are Ecosystems, we’ll zoom out from architecture to explore how AI systems interact with users, evolve over time, and create feedback loops. We’ll examine why AI applications are more like living ecosystems than traditional software.
| *This is the fourth article in the “Software Engineering in the LLM Era” series. Read Article 1 | Read Article 2 | Read Article 3.* |
💬 What’s your experience building AI applications? Which component has been most challenging in your projects? Share in the comments! 🚀