Post

LLM Strengths and Limitations: A Practical Framework

The Problem

After reading the first two articles, you understand what LLMs are and how they generalize. But you still face the daily question:

“Should I use an LLM for this task?”

The industry hype says “AI everything.” The skeptics say “LLMs are unreliable.” Both are wrong.

The truth is nuanced: LLMs excel at some tasks and fail catastrophically at others. The key is knowing which is which.

In this article, we’ll build a practical framework for evaluating when to use LLMs in your systems. This isn’t theoretical—it’s a decision matrix you can apply today.


The Core Insight: LLMs as Reasoning Engines, Not Calculators

Let’s start with a fundamental principle:

LLMs are excellent for probabilistic reasoning. They are terrible at deterministic computation.

This single insight explains 80% of LLM behavior.

Example: The Same Task, Different Approaches

Task: Add up a list of numbers

1
2
3
4
5
6
7
❌ Bad: "What is 237 + 892 + 156 + 445 + 723?"
→ LLM will guess based on number patterns
→ May get it wrong

✅ Good: "Write Python code to sum [237, 892, 156, 445, 723]"
→ LLM generates deterministic code
→ Code executes correctly

The LLM shouldn’t compute. It should generate the computation.

This pattern—LLM for reasoning, tools for execution—is the foundation of reliable AI systems.


The Strength Matrix: Where LLMs Excel

Based on their architecture (Article 1) and generalization capabilities (Article 2), LLMs are strong in these areas:

1. Language Understanding and Generation

Strengths:

  • Summarization
  • Translation
  • Tone adjustment
  • Content generation
  • Question answering (from provided context)

Why it works: This is the model’s native domain. It was trained on text, so text manipulation is direct pattern matching.

Example:

1
2
3
4
5
6
7
8
9
# Reliable use case
def summarize_support_ticket(ticket_text):
    prompt = f"""
    Summarize this support ticket in 2-3 sentences.
    Identify: issue type, urgency, customer sentiment.
    
    Ticket: {ticket_text}
    """
    return llm.generate(prompt)

2. Pattern Recognition and Classification

Strengths:

  • Sentiment analysis
  • Intent classification
  • Entity extraction
  • Anomaly detection (in text)
  • Categorization

Why it works: LLMs recognize patterns from training and can apply them to novel inputs.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Reliable use case
def classify_feature_request(request_text):
    prompt = f"""
    Classify this feature request:
    - Category: UI, Backend, API, Performance, Security
    - Priority: High, Medium, Low
    - Effort: Small, Medium, Large
    
    Request: {request_text}
    
    Output as JSON.
    """
    return llm.generate(prompt)

3. Cross-Domain Synthesis

Strengths:

  • Combining concepts from different fields
  • Generating analogies
  • Brainstorming solutions
  • Exploring design alternatives
  • Translating between domains (e.g., business requirements → technical specs)

Why it works: LLMs have seen patterns across many domains and can combine them in novel ways.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Reliable use case
def generate_architecture_options(requirements):
    prompt = f"""
    Given these requirements, propose 3 architecture options.
    For each, list:
    - Components
    - Trade-offs
    - When to choose this approach
    
    Requirements: {requirements}
    """
    return llm.generate(prompt)
# Output: Ideas for human evaluation

4. Code Generation (with Constraints)

Strengths:

  • Boilerplate generation
  • Common patterns (CRUD, filters, transformations)
  • Refactoring suggestions
  • Documentation
  • Test case generation

Why it works: Code has regular patterns. The model has seen millions of examples.

Example:

1
2
3
4
5
6
7
8
9
# Reliable use case
def generate_crud_endpoints(model_name, fields):
    prompt = f"""
    Generate REST API endpoints for {model_name} with fields: {fields}
    Include: GET list, GET single, POST, PUT, DELETE
    Use Python FastAPI.
    """
    return llm.generate(prompt)
# Output: Review and test before deploying

5. Explanation and Teaching

Strengths:

  • Explaining concepts at different levels
  • Generating examples
  • Creating learning materials
  • Debugging explanations
  • Documentation generation

Why it works: The model has seen countless explanations and can adapt patterns to your context.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Reliable use case
def explain_error(error_message, code_context):
    prompt = f"""
    Explain this error in simple terms:
    Error: {error_message}
    Code: {code_context}
    
    Include:
    1. What caused this
    2. How to fix it
    3. How to prevent it
    """
    return llm.generate(prompt)

The Limitation Matrix: Where LLMs Fail

Now for the critical part—knowing when NOT to use LLMs.

1. Precise Computation

Weaknesses:

  • Arithmetic beyond simple cases
  • Complex mathematical reasoning
  • Cryptographic operations
  • Financial calculations requiring precision

Why it fails: LLMs predict number tokens, they don’t compute.

Example:

1
2
3
4
5
6
7
8
9
10
# ❌ Unreliable
def calculate_compound_interest(principal, rate, years):
    prompt = f"Calculate compound interest for {principal} at {rate}% for {years} years"
    return llm.generate(prompt)  # May be wrong

# ✅ Reliable
def calculate_compound_interest(principal, rate, years):
    code = llm.generate(f"Write Python code to calculate compound interest for {principal} at {rate}% for {years} years")
    # Execute the generated code, not the LLM's answer
    return exec(code)

2. Long Multi-Step Reasoning

Weaknesses:

  • Complex logic chains
  • Problems requiring working memory
  • Tasks needing backtracking
  • Multi-constraint optimization

Why it fails: LLMs have no working memory. Each token is generated independently based on context.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
# ❌ Unreliable for complex cases
def solve_logic_puzzle(puzzle_description):
    return llm.generate(f"Solve: {puzzle_description}")
# Works for simple puzzles, fails as complexity increases

# ✅ Better approach
def solve_logic_puzzle(puzzle_description):
    # Break into steps, validate each
    step1 = llm.generate(f"Identify constraints: {puzzle_description}")
    validated1 = verify_constraints(step1)
    step2 = llm.generate(f"Given {validated1}, what's the next deduction?")
    # ... chain with validation
    return combine_results(validated1, step2, ...)

3. Factual Retrieval (Without RAG)

Weaknesses:

  • Recent events (post-training)
  • Specific facts not in training
  • Private or proprietary information
  • Precise citations and references

Why it fails: LLMs generate plausible text, they don’t retrieve facts.

Example:

1
2
3
4
5
6
7
8
9
10
# ❌ Unreliable
def get_company_revenue(company_name):
    return llm.generate(f"What is {company_name}'s revenue in 2025?")
# May hallucinate

# ✅ Reliable (RAG pattern)
def get_company_revenue(company_name):
    documents = search_knowledge_base(company_name)
    return llm.generate(f"Based on these documents: {documents}, what is {company_name}'s revenue?")
# Grounded in retrieved context

4. Deterministic Behavior

Weaknesses:

  • Tasks requiring identical outputs
  • Systems needing reproducibility
  • Validation and testing logic
  • Security-critical decisions

Why it fails: LLMs are probabilistic by nature. Same input ≠ same output.

Example:

1
2
3
4
5
6
7
8
9
10
# ❌ Unreliable
def validate_user_input(user_input):
    result = llm.generate(f"Is this input valid? {user_input}")
    return result == "valid"  # May vary between calls

# ✅ Reliable
def validate_user_input(user_input):
    # Use LLM to generate validation rules, apply deterministically
    rules = llm.generate(f"Generate validation rules for: {user_input}")
    return apply_rules_deterministically(rules, user_input)

5. Long-Term Memory and State

Weaknesses:

  • Remembering across sessions
  • Maintaining conversation state (beyond context window)
  • Learning from user interactions
  • Accumulating knowledge over time

Why it fails: LLMs are stateless. Each call is independent.

Example:

1
2
3
4
5
6
7
8
9
10
11
# ❌ Won't work
def chat_with_user(user_id, message):
    # LLM won't remember previous conversations
    return llm.generate(f"User says: {message}")

# ✅ Works (explicit memory)
def chat_with_user(user_id, message):
    history = get_conversation_history(user_id)
    response = llm.generate(f"History: {history}. User says: {message}")
    save_conversation_history(user_id, message, response)
    return response

The Decision Framework: Should You Use an LLM?

Use this checklist when evaluating a use case:

Questions to Ask

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1. Is the task probabilistic or deterministic?
   → Probabilistic: LLM suitable
   → Deterministic: Use traditional code

2. Does it require precise computation?
   → Yes: LLM generates code, code executes
   → No: LLM can handle directly

3. Is factual accuracy critical?
   → Yes: Use RAG or avoid LLM
   → No: Direct LLM generation OK

4. How long is the reasoning chain?
   → Short (1-3 steps): LLM suitable
   → Long: Break into steps with validation

5. Does it require memory/state?
   → Yes: Build external memory system
   → No: Stateless LLM works

6. What's the cost of errors?
   → High: Add validation layers
   → Low: Direct LLM use acceptable

7. Is the output verifiable?
   → Yes: LLM + validation pattern
   → No: Consider if LLM is appropriate

Decision Tree

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
                    ┌─────────────────┐
                    │    New Task     │
                    └────────┬────────┘
                             ↓
                    ┌─────────────────┐
                    │ Deterministic?  │
                    └────────┬────────┘
                         Yes │ No
                             ↓
                    ┌─────────────────┐
                    │ Don't use LLM   │
                    │ Use code        │
                    └─────────────────┘
                             
                         No
                             ↓
                    ┌─────────────────┐
                    │ Needs facts?    │
                    └────────┬────────┘
                         Yes │ No
                             ↓
                    ┌─────────────────┐
                    │ Use RAG         │
                    │ + LLM           │
                    └─────────────────┘
                             
                         No
                             ↓
                    ┌─────────────────┐
                    │ Precise math?   │
                    └────────┬────────┘
                         Yes │ No
                             ↓
                    ┌─────────────────┐
                    │ LLM generates   │
                    │ code to execute │
                    └─────────────────┘
                             
                         No
                             ↓
                    ┌─────────────────┐
                    │ Direct LLM use  │
                    │ (with validation│
                    │ if high stakes) │
                    └─────────────────┘

Architectural Patterns: Compensating for Limitations

Pattern 1: LLM + Validator

1
2
3
4
5
6
7
8
9
def process_with_validation(input_data):
    # LLM generates
    result = llm.generate(f"Process: {input_data}")
    
    # Traditional code validates
    if is_valid(result):
        return result
    else:
        return handle_error(input_data)

Use when: Output format is predictable and verifiable.


Pattern 2: LLM + Tool

1
2
3
4
5
6
7
8
def calculate_with_tool(question):
    # LLM figures out what to compute
    computation = llm.generate(f"What code computes: {question}?")
    
    # Tool executes
    result = execute_code(computation)
    
    return result

Use when: Computation is required.


Pattern 3: LLM + Memory (RAG)

1
2
3
4
5
6
7
8
def answer_with_context(question):
    # Retrieve relevant facts
    context = retrieve_from_knowledge_base(question)
    
    # LLM answers based on context
    answer = llm.generate(f"Context: {context}. Question: {question}")
    
    return answer

Use when: Factual accuracy matters.


Pattern 4: LLM Chain with Validation

1
2
3
4
5
6
7
8
9
10
def solve_complex_task(task):
    # Break into steps
    step1 = llm.generate(f"Step 1 for: {task}")
    validated1 = validate_step1(step1)
    
    step2 = llm.generate(f"Step 2 given: {validated1}")
    validated2 = validate_step2(step2)
    
    # Combine
    return combine(validated1, validated2)

Use when: Reasoning chain is long.


The Meta-Lesson: LLMs Are Components, Not Solutions

The biggest mistake engineers make is treating LLMs as complete solutions:

1
2
❌ Wrong mental model: "I'll use an LLM to solve X"
✅ Right mental model: "I'll build a system where LLM handles the parts it's good at"

LLMs are powerful components. But like any component, they need:

  • Interfaces (prompts, APIs)
  • Validation (testing, monitoring)
  • Integration (tools, memory, other services)
  • Fallbacks (error handling, alternatives)

Key Takeaways

  • LLMs excel at: language tasks, pattern recognition, cross-domain synthesis, code generation, explanation.
  • LLMs fail at: precise computation, long reasoning chains, factual retrieval (without RAG), deterministic behavior, stateful operations.
  • Use the decision framework: Ask the 7 questions before using an LLM.
  • Apply architectural patterns: LLM + Validator, LLM + Tool, LLM + Memory, LLM Chain.
  • LLMs are components, not solutions: Design systems, not just prompts.

Next Article

In Article 4: AI Application Architecture—LLM + Memory + Tools, we’ll dive deep into building complete systems. We’ll explore how to combine LLMs with memory systems, external tools, and knowledge bases to create reliable, production-ready AI applications.


*This is the third article in the “Software Engineering in the LLM Era” series. Read Article 1Read Article 2.*

💬 What’s your experience with LLM strengths and failures? Share a use case that surprised you in the comments! 🚀

This post is licensed under CC BY 4.0 by the author.