Chapter 11: News Aggregator Case Study

1. Introduction

You want to stay current with Python programming news. NewsAPI aggregates articles from TechCrunch, Medium, and major tech outlets. The Guardian provides quality journalism and technology coverage. HackerNews surfaces what developers are actually discussing. Each source offers valuable perspectives, but querying three different APIs means writing three separate integrations, handling three different data structures, and managing three different error patterns.

This isn't unique to news aggregation. Your e-commerce platform needs both Stripe and PayPal. Your analytics dashboard combines Google Analytics with Mixpanel. Your authentication system supports GitHub, Google, and Microsoft OAuth. Production systems constantly face this challenge: multiple external services, each with its own response format, failure modes, and quirks. Yet your application needs unified, consistent data.

This chapter teaches you the systematic approach professionals use for multi-API integration. You'll build a news aggregator that queries three different sources simultaneously, normalizes incompatible response formats into a single internal representation, handles failures gracefully, removes duplicates intelligently, and presents unified results. By the end, you'll have a working command-line tool searching real APIs plus the architectural knowledge to integrate any collection of external services.

What Success Looks Like

Before diving into implementation details, see the completed aggregator in action. This preview shows what you'll build:

$ python news_aggregator.py

Terminal Output

======================================================================
📰 NEWS AGGREGATOR
======================================================================

Search news from NewsAPI, The Guardian, and HackerNews

Commands:
  search <query>  - Search for news articles
  more            - Show more results from last search
  group           - Show last results grouped by source
  help            - Show this help message
  quit            - Exit application
======================================================================

aggregator> search python programming ↩

Searching for 'python programming'...

✓ NewsAPI: 5 articles
✓ Guardian: 5 articles  
✓ HackerNews: 5 articles

======================================================================
📰 NEWS AGGREGATOR RESULTS
======================================================================

Query: 'python programming'
Sources: NewsAPI, Guardian, HackerNews
Articles: 12 (3 duplicates removed)

1. Python 3.13 Performance Improvements
   📰 TechCrunch • January 15, 2025 at 02:30 PM
   ✍️  Sarah Chen
   Python's latest release brings significant speed improvements...
   🔗 https://techcrunch.com/2025/01/15/python-3-13-performance
   (techcrunch.com)

... and 9 more articles

======================================================================

One search command queries three APIs simultaneously. All sources respond successfully (notice the checkmarks). Fifteen articles became twelve after automatic deduplication. Results display in consistent format whether they came from NewsAPI, Guardian, or HackerNews. The complexity of integrating three different APIs is completely hidden from users.

The Multi-API Challenge

Here's what makes this difficult: these three services represent the same concept (a news article) in completely different ways. NewsAPI wraps articles in an articles array with nested source objects. Guardian buries them in response.results using field names like webTitle and webPublicationDate. HackerNews uses hits with Unix timestamps and minimal metadata. Same business concept. Three incompatible structures.

Without a systematic approach, your codebase becomes a mess of conditionals. Display logic needs branching: "if NewsAPI show publishedAt, if Guardian show webPublicationDate, if HackerNews convert Unix timestamp..." Every operation—storage, sorting, deduplication, filtering—requires source-specific handling. This approach doesn't scale. Adding a fourth source means updating dozens of locations throughout your code.

Production systems need architecture that isolates API differences. You normalize external formats into one internal representation at the boundary, then everything downstream works with consistent structure. Display code never knows or cares which API provided the data. This separation makes systems maintainable as APIs evolve and requirements change.

Learning Objectives

What You'll Master in This Chapter

By the end of this chapter, you'll be able to:

Systematically explore and document multiple API structures before writing extraction code
Design canonical models that unify disparate API response formats into consistent internal representations
Build source-specific normalizers that apply defensive programming patterns while handling each API's unique quirks
Implement aggregation pipelines with graceful degradation that continue operating when individual sources fail
Recognize the boundary between crash prevention (defensive programming) and quality enforcement (validation)
Apply multi-API integration patterns that scale from two sources to twenty

The Professional Approach to Multi-API Integration

Chapter 10 taught you advanced JSON processing patterns in isolation: safe navigation through nested data, flexible extraction with extract_items_and_meta(), defensive handling of optional fields. This chapter shows you applying those patterns to solve a real architectural challenge.

You'll learn the five-step workflow professionals use for multi-API integration:

1.

Explore Before Coding

Use diagnostic tools to discover actual API structures. Documentation is often incomplete or outdated. Exploration reveals the real response patterns, nesting depths, field names, and optional data your code must handle.

2.

Design Canonical Models

Define one internal format that accommodates all sources. This becomes your application's "internal language." Choose field names once, decide on required versus optional fields once, standardize formats (timestamps, URLs) once.

3.

Build Normalizers

Create functions that transform each API's quirks into your canonical format. Apply defensive patterns from Chapter 10: safe navigation, type checking, default values. Each normalizer handles one API's specifics while following a universal pattern.

4.

Handle Partial Failures

Build systems that provide value even when components fail. When one API is down, return results from working sources. Track which sources succeeded and failed. Degrade gracefully rather than crashing completely.

5.

Recognize Limitations

Understand where defensive programming stops being sufficient. Your system won't crash, but quality issues can slip through. Empty fields, malformed data, business rule violations—these need validation beyond defensive programming.

This systematic approach works whether you're integrating two APIs or twenty. The architecture applies to payment processors (Stripe + PayPal), analytics platforms (Google Analytics + Mixpanel), authentication providers (GitHub + Google + Microsoft), or any system requiring multiple external data sources. Master it once, apply it everywhere.

Why This Chapter Matters

Production applications rarely depend on a single API. You'll integrate payment processors, authentication providers, analytics services, notification systems, and data sources. Each has its own response format, error patterns, and reliability characteristics. This chapter teaches you the systematic approach that scales from two APIs to twenty.

You'll also discover where defensive programming stops being enough. The aggregator will work—articles display, searches complete—but you'll encounter quality issues that defensive programming can't prevent. This chapter sets up Chapter 12's validation techniques by showing exactly what problems remain when you rely solely on defensive patterns.

What You'll Build

The news aggregator consists of five components, each demonstrating architectural patterns you'll use in production systems:

1.

Exploration Tools

Diagnostic utilities that reveal how each API structures responses. Produce comparison tables showing where the same data lives in different APIs. Guide normalizer design with facts, not assumptions.

2.

Canonical Article Model

A single Article dataclass representing news articles consistently, regardless of source. Defines required fields (title, URL, timestamp, source) and optional fields (description, author, image). This becomes your application's internal representation—its "common language."

3.

Source-Specific Normalizers

Three functions transforming each API's unique response format into canonical model. Each applies Chapter 10's defensive patterns while handling API-specific quirks: field name mappings, nested structure navigation, timestamp conversions, optional field extraction.

4.

Aggregation Pipeline

The orchestrator fetching from all sources independently, handling failures gracefully, deduplicating results, and sorting by recency. Demonstrates production patterns: graceful degradation (system works with 2 out of 3 sources), partial results (some data beats no data), and operational monitoring.

5.

Interactive CLI

Command-line interface for searching, viewing results, and exploring articles. Demonstrates how canonical models simplify everything downstream: write formatting code once, works for all sources automatically. No conditional logic based on source.

The complete implementation (approximately 400 lines) appears in Appendix A. The chapter walks through architecture and design decisions, implementing key components while explaining the approach you'd apply to any multi-API integration. You'll understand not just what to build, but why each architectural choice matters.

Chapter Roadmap

This chapter builds a production-ready news aggregator by applying Chapter 10's patterns to real multi-API integration. Here's the journey:

1

Discovery & Design

Sections 2-3 • Foundation

Explore how three real APIs structure responses using diagnostic tools, then design the canonical Article model that unifies all sources into one clean internal format.

API Exploration Canonical Model Comparison Tables

2

Core Implementation

Sections 4-5 • Building The System

Build source-specific normalizers transforming each API into canonical format, then implement the aggregation pipeline that handles failures gracefully and deduplicates intelligently.

Normalizers Aggregation Pipeline Graceful Degradation

3

Polish & Reflection

Sections 6-7 • Completion

Create the CLI interface that makes canonical models shine, then examine where defensive programming reaches its limits—setting up Chapter 12's validation approach.

CLI Development Quality Boundaries

Appendix A contains the complete 400-line implementation for reference. The chapter focuses on architecture and design decisions you'll apply to any multi-API integration.

Prerequisites

What You Need

This chapter assumes you've completed Chapter 10 and are comfortable with: exploration tools for discovering API structures, flexible access patterns using extract_items_and_meta() and safe_get(), safe navigation through nested data, and defensive handling of optional fields. You'll apply all these patterns extensively throughout this chapter.

API Access: You'll need free API keys for NewsAPI (500 requests/day) and The Guardian (5,000 requests/day). HackerNews requires no authentication. Section 2 provides registration URLs and setup instructions.

2. Exploring Multiple API Structures

The Multi-API Challenge

You're building a news aggregator that searches three sources simultaneously: NewsAPI (commercial news outlets), The Guardian (UK newspaper), and HackerNews (tech community). Each returns articles, but the structural differences are substantial. Before writing any extraction code, you need to understand what you're working with.

This section demonstrates the professional workflow: systematic exploration first, code second. You'll use Chapter 10's exploration tools to reveal each API's structure, then document the variations in a comparison table. This documentation becomes your reference when building normalizers in Section 4.

Getting API Access

NewsAPI: Register at newsapi.org for a free API key (500 requests/day)

The Guardian: Register at open-platform.theguardian.com for a free key (5,000 requests/day)

HackerNews: No registration required, use hn.algolia.com/api directly

Store keys in environment variables: NEWSAPI_KEY and GUARDIAN_KEY

The Exploration Tool

In Chapter 10, we built an exploration tool for systematically analyzing API structures. We'll use that tool extensively in this section to discover how NewsAPI, Guardian, and HackerNews format their responses. For convenience, here's the complete implementation so you don't need to flip between chapters:

Python - API Structure Explorer

Python

import requests
import json
from typing import Any, Dict, List, Tuple, Optional

def explore_api_structure(url, headers=None, max_depth=2):
    """
    Systematically explore an API response structure.
    Returns the response data for further inspection.
    """
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        data = response.json()
        
        print(f"API Response Analysis")
        print("=" * 60)
        print(f"URL: {url}")
        print(f"Response type: {type(data).__name__}")
        
        if isinstance(data, dict):
            print(f"Top-level keys: {list(data.keys())}")
            print(f"Total keys: {len(data)}")
            
            # Look for common data container patterns
            common_containers = ['results', 'data', 'items', 'articles', 
                               'hits', 'content', 'payload', 'response']
            found_containers = [key for key in common_containers if key in data]
            if found_containers:
                print(f"Possible data containers found: {found_containers}")
            
            # Show structure of first few fields
            print("\nFirst few fields (with truncated values):")
            for i, (key, value) in enumerate(list(data.items())[:5]):
                value_type = type(value).__name__
                if isinstance(value, (dict, list)):
                    size_info = f" (length: {len(value)})" if hasattr(value, '__len__') else ""
                    print(f"  {key}: {value_type}{size_info}")
                else:
                    str_value = str(value)
                    display_value = str_value[:50] + "..." if len(str_value) > 50 else str_value
                    print(f"  {key}: {display_value}")
        
        elif isinstance(data, list):
            print(f"Array with {len(data)} items")
            if data:
                first_item = data[0]
                print(f"First item type: {type(first_item).__name__}")
                if isinstance(first_item, dict):
                    print(f"First item keys: {list(first_item.keys())}")
        
        print("\nSample structure (truncated for readability):")
        print(json.dumps(truncate_for_display(data, max_depth), indent=2))
        print("\n")
        
        return data
        
    except requests.RequestException as e:
        print(f"Error fetching data: {e}")
        return None
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return None

def truncate_for_display(obj, max_depth=2, current_depth=0):
    """Helper function to truncate nested data for readable display."""
    if current_depth >= max_depth:
        return "..."
    
    if isinstance(obj, dict):
        truncated = {}
        for i, (key, value) in enumerate(obj.items()):
            if i >= 5:
                truncated["..."] = f"({len(obj) - 5} more keys)"
                break
            truncated[key] = truncate_for_display(value, max_depth, current_depth + 1)
        return truncated
    
    elif isinstance(obj, list):
        truncated = []
        for i, item in enumerate(obj[:3]):
            truncated.append(truncate_for_display(item, max_depth, current_depth + 1))
        if len(obj) > 3:
            truncated.append(f"... ({len(obj) - 3} more items)")
        return truncated
    
    else:
        if isinstance(obj, str) and len(obj) > 50:
            return obj[:50] + "..."
        return obj

This tool performs the tedious work of traversing structures, counting elements, identifying patterns, and presenting manageable summaries. Let's apply it to each news API.

Exploring Each API

Run the exploration tool against all three APIs to discover their structures. Pay attention to where article data lives, what each API calls its fields, and how deeply nested the information is.

Systematic API Exploration

Python

import os

# Your API keys from environment variables
NEWSAPI_KEY = os.environ.get("NEWSAPI_KEY")
GUARDIAN_KEY = os.environ.get("GUARDIAN_KEY")

print("=== EXPLORING THREE NEWS APIS ===\n")

# 1. NewsAPI
print("1. NEWSAPI")
newsapi_url = f"https://newsapi.org/v2/everything?q=python&pageSize=3&apiKey={NEWSAPI_KEY}"
newsapi_data = explore_api_structure(newsapi_url)

# 2. Guardian API  
print("2. GUARDIAN API")
guardian_url = f"https://content.guardianapis.com/search?q=python&page-size=3&show-fields=all&api-key={GUARDIAN_KEY}"
guardian_data = explore_api_structure(guardian_url)

# 3. HackerNews API
print("3. HACKERNEWS API")
hn_url = "https://hn.algolia.com/api/v1/search?query=python&hitsPerPage=3"
hn_data = explore_api_structure(hn_url)

Example Output (Condensed)

=== EXPLORING THREE NEWS APIS ===

1. NEWSAPI
API Response Analysis
============================================================
Response type: dict
Top-level keys: ['status', 'totalResults', 'articles']
Possible data containers found: ['articles']

First few fields:
  status: ok
  totalResults: 12847
  articles: list (length: 3)

Sample structure:
{
  "status": "ok",
  "totalResults": 12847,
  "articles": [
    {
      "source": {"id": "...", "name": "..."},
      "author": "...",
      "title": "...",
      "url": "...",
      "publishedAt": "2025-01-15T10:30:00Z",
      "...": "(5 more keys)"
    },
    "... (2 more items)"
  ]
}

2. GUARDIAN API
API Response Analysis
============================================================
Response type: dict
Top-level keys: ['response']
Possible data containers found: ['response']

First few fields:
  response: dict (length: 7)

Sample structure:
{
  "response": {
    "status": "ok",
    "total": 8453,
    "results": [
      {
        "webTitle": "...",
        "webUrl": "...",
        "webPublicationDate": "2025-01-15T12:45:00Z",
        "fields": {
          "byline": "...",
          "trailText": "...",
          "...": "(4 more keys)"
        },
        "...": "(6 more keys)"
      },
      "... (2 more items)"
    ],
    "...": "(6 more keys)"
  }
}

3. HACKERNEWS API
API Response Analysis
============================================================
Response type: dict
Top-level keys: ['hits', 'nbHits', 'page', 'nbPages', ...]
Possible data containers found: ['hits']

First few fields:
  hits: list (length: 3)
  nbHits: 583201
  page: 0

Sample structure:
{
  "hits": [
    {
      "title": "...",
      "url": "...",
      "author": "username123",
      "created_at": "2025-01-15T14:22:31.000Z",
      "created_at_i": 1736950951,
      "points": 142,
      "...": "(8 more keys)"
    },
    "... (2 more items)"
  ],
  "nbHits": 583201,
  "...": "(10 more keys)"
}

What the Tool Reveals

The exploration immediately shows fundamental differences. NewsAPI uses articles at the root level with nested source objects. Guardian wraps everything in a response object, puts articles in results, and uses unique field names like webTitle. HackerNews uses hits with both ISO strings and Unix timestamps.

Without the exploration tool, you'd be guessing from documentation. With it, you see exactly what each API returns.

The Comparison Table

Now document where the same conceptual data lives in each API. This table becomes your normalizer design reference - it shows exactly what field mappings you'll need.

Structural Comparison: Three APIs, Same Concept

Concept	NewsAPI	Guardian	HackerNews
Container path	`articles`	`response.results`	`hits`
Title	`title`	`webTitle`	`title`
URL	`url`	`webUrl`	`url`
Timestamp	`publishedAt` (ISO string)	`webPublicationDate` (ISO string)	`created_at` (ISO) or `created_at_i` (Unix)
Source name	`source.name` (nested object)	`sectionName` (string)	"HackerNews" (implied)
Author	`author` (string, optional)	`fields.byline` (nested, optional)	`author` (username string)
Description	`description` (optional)	`fields.trailText` (nested, optional)	`story_text` (often null)
Image	`urlToImage` (optional)	`fields.thumbnail` (nested, optional)	None available
Total results	`totalResults`	`response.total`	`nbHits`

Understanding the Variations

Same conceptual information, completely different access paths. Some differences are cosmetic (field naming), others are structural (nesting depth, object vs string). The comparison table makes these differences explicit so you can design normalizers that handle each API's quirks systematically.

Notice patterns: all three have titles and URLs, but Guardian adds "web" prefixes everywhere. HackerNews provides dual timestamp formats. Guardian buries useful content in an optional fields object. These aren't bugs - they're design decisions reflecting each API's evolution and use cases.

Key Structural Patterns

From the exploration and comparison, three structural patterns emerge that your normalizers must handle:

1.

Container Variation

Articles live at different paths: articles (root level) vs response.results (nested) vs hits (root level). Your access layer must locate the container regardless of nesting.

2.

Field Naming Drift

Same data, different names: title vs webTitle, publishedAt vs webPublicationDate vs created_at. Your normalizers map these variations to canonical names.

3.

Optional Field Inconsistency

What's present in one API is missing in another. NewsAPI provides description consistently; Guardian requires checking for a fields object; HackerNews rarely has descriptions. Your model must accommodate all patterns.

Section 3 designs the canonical model that unifies these three formats. The comparison table you just built guides every design decision.

Takeaways & Next Step

Systematic API Exploration:

Explore before coding: Use diagnostic tools to reveal structure rather than guessing from documentation
Document variations: Comparison tables guide normalizer implementation by making differences explicit
Identify patterns: Container location, field naming, and optional field inconsistency are the three challenges to solve
Map concepts to paths: The same business concept (article) requires different extraction strategies per API

With structures documented, Section 3 designs the canonical article model that will unify these three different formats into a single internal representation your application can trust.

3. Designing the Canonical Model

From Three Formats to One Standard

Section 2 revealed three completely different response structures. Your aggregator needs to work with all three, but you don't want the rest of your application - search logic, display formatting, storage - dealing with this variation. The solution is a canonical model: a single internal representation that all three APIs get normalized into.

Think of it like translation. "Hello," "bonjour," and "hola" all mean the same thing - your canonical model is the "internal language" your application speaks. Once articles are normalized, every part of your codebase sees consistent structure regardless of which API provided the data.

Why Canonical Models Matter

Single interface: Display code works with one structure, not three
API changes contained: When Guardian changes field names, only the normalizer updates
Testing simplified: Test against one format instead of multiple variations
Extensions easier: Adding a fourth API means writing one normalizer, not updating all consumers

Identifying Required Fields

Start by identifying what all three APIs provide. These become your required fields - the core data that every article must have to be useful. Looking at Section 2's comparison table:

Universal Article Properties

Field	Why Required	Present In
Title	The headline - without this, there's nothing to show	All three APIs (never empty)
URL	Link to original content - the aggregator's core value	All three APIs (always present)
Timestamp	Publication time - essential for sorting and recency	All three APIs (formats differ)
Source	Where it came from - users need to assess credibility	All three (representation varies)

These four fields are universal - every normalized article will have them, guaranteed. If any are missing or empty, the article gets rejected during normalization.

One critical decision about the URL field: every Article must have a valid URL. Beyond being useful for users to access original content, URLs serve as our deduplication key. When multiple APIs return the same story (like Python 3.13 release coverage appearing in NewsAPI, Guardian, and HackerNews), the URL uniquely identifies it regardless of varying headlines. Section 5 will show the URL-based deduplication implementation, but this requirement shapes our required fields now.

Handling Optional Fields

Next, identify fields that enhance the experience but aren't universal. These become optional properties with defensive defaults. The key is balancing usefulness against availability:

Optional Article Properties

Field	Value When Present	Availability
Description	Context without clicking through	NewsAPI: usually present Guardian: in `fields.trailText` HackerNews: rarely present
Author	Credibility and attribution	NewsAPI: sometimes Guardian: in `fields.byline` HackerNews: username
Image	Visual presentation	NewsAPI: `urlToImage` Guardian: `fields.thumbnail` HackerNews: none
Content preview	Helps users decide to click	NewsAPI: sometimes Guardian: in `fields.bodyText` HackerNews: usually absent

Use None as the default for absent optional fields. This makes the distinction clear: None means "field doesn't exist," while an empty string means "field exists but is empty." Display code can check if article.author without exception handling.

Format Standardization Decisions

Some fields appear in all APIs but with different formats. Your canonical model needs to pick one standard format and let normalizers handle the conversion:

1.

Timestamps → ISO 8601

HackerNews provides Unix integers (1736950951), others use ISO strings (2025-01-15T14:30:00Z). Standard: ISO 8601 strings for consistency and human readability. Normalizers convert Unix timestamps during extraction using Python's datetime.fromtimestamp().isoformat(). You'll see this conversion implemented in Section 4's HackerNews normalizer.

2.

Source Names → Simple Strings

NewsAPI nests source in an object (source.name), Guardian uses sectionName, HackerNews is implicit. Standard: single string field. Normalizers extract from wherever it lives in each API.

3.

URLs → Full URLs Only

All three provide full URLs, but HackerNews sometimes has items without URLs (Ask HN posts). Standard: full URLs required. Normalizers construct HN links from object IDs when needed.

These standardization decisions cost conversion work in normalizers but gain predictable access patterns everywhere else in your application.

The Article Dataclass

Now formalize the canonical model using Python's dataclass. This gives you type hints, default values, validation, and a clean structure that documents your design:

Canonical Article Model

Python

from dataclasses import dataclass
from typing import Optional
from datetime import datetime

@dataclass
class Article:
    """
    Canonical article format for all news sources.
    
    Required fields are guaranteed present in all normalized articles.
    Optional fields use None as default when source doesn't provide them.
    """
    # Required fields - every article must have these
    title: str
    url: str
    published_at: str  # ISO 8601 format: "2025-01-15T14:30:00Z"
    source_name: str
    
    # Optional fields - enhance experience but not universal
    description: Optional[str] = None
    author: Optional[str] = None
    image_url: Optional[str] = None
    content_preview: Optional[str] = None
    
    def __post_init__(self):
        """Validate required fields after initialization."""
        if not self.title or not self.title.strip():
            raise ValueError("Article title cannot be empty")
        if not self.url or not self.url.strip():
            raise ValueError("Article URL cannot be empty")
        if not self.source_name or not self.source_name.strip():
            raise ValueError("Article source_name cannot be empty")
    
    def format_timestamp(self) -> str:
        """Convert ISO timestamp to human-readable format."""
        try:
            dt = datetime.fromisoformat(self.published_at.replace('Z', '+00:00'))
            return dt.strftime("%B %d, %Y at %I:%M %p")
        except (ValueError, AttributeError):
            return self.published_at
    
    def get_domain(self) -> str:
        """Extract domain from URL for display."""
        try:
            from urllib.parse import urlparse
            parsed = urlparse(self.url)
            return parsed.netloc or "unknown"
        except Exception:
            return "unknown"

Design Features

Validation in __post_init__: Catches empty required fields immediately, preventing invalid articles from entering the system
Helper methods: format_timestamp() and get_domain() provide display utilities without cluttering consumer code
Type hints: Optional[str] makes optionality explicit for tools and developers
Standardized timestamp: Always ISO 8601 string, even though HackerNews provides Unix integers

Testing the Model

Verify the model handles edge cases before building normalizers:

Model Validation Tests

Python

# Test 1: Complete article with all fields
article1 = Article(
    title="Python 3.13 Released",
    url="https://www.python.org/downloads/release/python-3130/",
    published_at="2025-01-15T14:30:00Z",
    source_name="Python.org",
    description="Latest Python release includes performance improvements",
    author="Python Core Team",
    image_url="https://www.python.org/static/img/python-logo.png",
    content_preview="Python 3.13.0 is now available..."
)
print(f"✓ Complete article: {article1.title}")
print(f"  Formatted: {article1.format_timestamp()}")
print(f"  Domain: {article1.get_domain()}")

# Test 2: Minimal article (only required fields)
article2 = Article(
    title="Breaking: Tech News",
    url="https://example.com/article",
    published_at="2025-01-15T10:00:00Z",
    source_name="Example News"
)
print(f"\n✓ Minimal article: {article2.title}")
print(f"  Description: {article2.description}")  # None
print(f"  Author: {article2.author}")  # None

# Test 3: Invalid article (empty title)
try:
    article3 = Article(
        title="",
        url="https://example.com/article",
        published_at="2025-01-15T10:00:00Z",
        source_name="Example"
    )
    print("\n✗ Should have raised ValueError")
except ValueError as e:
    print(f"\n✓ Validation caught error: {e}")

Example Output

✓ Complete article: Python 3.13 Released
  Formatted: January 15, 2025 at 02:30 PM
  Domain: www.python.org

✓ Minimal article: Breaking: Tech News
  Description: None
  Author: None

✓ Validation caught error: Article title cannot be empty

The model works correctly: accepts valid articles with any combination of optional fields, validates required fields, and provides helper methods for display. This is the foundation for Section 4's normalizers.

When NOT to Normalize

Canonical models and normalizers add complexity - abstraction layers, conversion logic, maintenance overhead. Before building them, consider whether simpler approaches suffice for your use case.

Situations Where Normalization Adds Little Value

1.

Single API Integration

If you're only using one API and don't anticipate adding others, work directly with its structure. The Guardian API's webTitle and webPublicationDate fields are fine if that's your only source. Normalization's value comes from unifying multiple formats - without that need, it's just extra code.

2.

API Structure Matches Your Needs

Sometimes an API's response format is exactly what you need for display or storage. If NewsAPI's article structure works perfectly for your application, don't normalize it into something else just for the sake of abstraction. Use it directly and move on to solving actual problems.

3.

Performance-Critical Applications

Normalization adds processing time - extracting fields, converting formats, constructing objects. For high-throughput systems processing thousands of articles per second, this overhead matters. Profile first, but if normalization becomes a bottleneck, consider working with raw responses and accepting some code duplication.

4.

Prototype/Proof-of-Concept Stage

Early in development, you're still discovering requirements. Building a canonical model before you understand what fields matter or how APIs will be used is premature optimization. Start with direct API access, identify patterns as they emerge, then normalize when the abstraction provides clear value.

5.

API-Specific Features Required

If your application needs API-specific features that don't map to other sources - HackerNews vote counts, Guardian section hierarchies, NewsAPI sentiment scores - normalizing loses information. Either work directly with each API's full structure or design a canonical model that accommodates all unique fields (which often becomes unwieldy).

The Decision Framework

Normalize when:

Integrating 2+ APIs with similar data models
Display/storage logic should be API-agnostic
APIs might be swapped or added over time
Shared business logic applies to all sources

Skip normalization when:

Single API with no plans to add others
API structure already fits your needs
Performance overhead is unacceptable
Requirements are still being discovered
API-specific features are central to functionality

This chapter demonstrates normalization because it's integrating three different news APIs - the exact scenario where canonical models provide value. The rest of this chapter assumes you've determined normalization fits your needs and focuses on doing it well.

Takeaways & Next Step

Canonical Model Design:

Required fields: Title, URL, timestamp, and source appear in all APIs and are essential to the concept
Optional fields explicit: Use Optional[Type] with None defaults for fields not all sources provide
Format standardization: Pick one format (ISO timestamps, simple strings) and let normalizers handle conversion
Validate early: __post_init__ catches invalid data at construction rather than during use
Display utilities: Helper methods prevent duplicate logic in consumers
Know when to skip: Normalization adds value for multi-API integration; simpler approaches work for single APIs or prototypes

With the canonical model defined, tested, and its appropriate use cases understood, Section 4 builds the normalizers that transform each API's response into this consistent format. You'll see one complete implementation that demonstrates the pattern, then understand how it extends to the other APIs.

4. Building Source-Specific Normalizers

The Normalizer Pattern

You have a canonical model that defines what an article should look like. You have three APIs that each format articles differently. Normalizers bridge this gap: they're functions that take a raw API response and return a list of Article objects in your canonical format.

Each normalizer handles one API's quirks - field name mappings, nested structures, timestamp conversions, optional field extraction - using Chapter 10's defensive patterns. The normalizers share the same structure and safety approach but customize the extraction logic for each API's specific layout.

Every Normalizer Does Five Things

Extract items: Find the article array using extract_items_and_meta()
Loop defensively: Process each item with try/except, skip bad items rather than crashing
Map fields: Translate source field names to canonical names
Convert formats: Standardize timestamps, clean strings, normalize URLs
Construct Article: Create canonical objects that pass validation

How to Read This Section

This section shows one complete normalizer implementation (NewsAPI) with detailed explanations, then presents the key differences for Guardian and HackerNews. The defensive pattern is identical across all three - only the field mappings change.

If you want to understand the normalizer pattern deeply, read the NewsAPI implementation carefully. If you're primarily interested in seeing how to handle different API structures, skim the Guardian and HackerNews subsections for their unique challenges, then refer to Appendix A for complete implementations.

There's no need to study all three in detail unless you're building something similar - the learning value is in understanding the pattern once, then recognizing how it adapts to structural variations.

NewsAPI Normalizer (Complete Implementation)

NewsAPI has the most straightforward structure, making it the perfect example to demonstrate the complete normalizer pattern. Study this implementation - it's the template you'll apply to Guardian and HackerNews.

Complete NewsAPI Normalizer

Python

from typing import List

def normalize_newsapi(response: dict) -> List[Article]:
    """
    Transform NewsAPI response into canonical Article format.
    
    NewsAPI structure:
    - Articles in 'articles' array at root
    - Source nested as object with 'name' field
    - Timestamp in 'publishedAt' (ISO 8601)
    - Optional: author, description, urlToImage, content
    """
    # Step 1: Extract items using Chapter 10 utility
    items, meta = extract_items_and_meta(response)
    
    articles = []
    
    # Step 2: Loop through items with defensive error handling
    for item in items:
        try:
            # Step 3: Extract and validate required fields
            title = item.get("title", "").strip()
            url = item.get("url", "").strip()
            published_at = item.get("publishedAt", "")
            
            # Skip items missing required fields
            if not title or not url or not published_at:
                continue
            
            # Step 4: Extract source name from nested object
            source_name = safe_get(item, "source.name", "Unknown Source")
            
            # Step 5: Extract optional fields with safe defaults
            description = item.get("description")
            if description:
                description = description.strip()
            
            author = item.get("author")
            if author:
                author = author.strip()
            
            image_url = item.get("urlToImage")
            
            # Content preview (truncate if too long)
            content = item.get("content")
            content_preview = None
            if content:
                content_preview = content[:200] + "..." if len(content) > 200 else content
            
            # Step 6: Create canonical article
            article = Article(
                title=title,
                url=url,
                published_at=published_at,
                source_name=source_name,
                description=description,
                author=author,
                image_url=image_url,
                content_preview=content_preview
            )
            
            articles.append(article)
            
        except (ValueError, KeyError, TypeError) as e:
            # Log and skip malformed items - don't crash entire batch
            print(f"Warning: Skipping malformed NewsAPI item: {e}")
            continue
    
    return articles

Key Defensive Patterns

Early validation: Check required fields before processing optionals - fail fast per item
Continue on failure: Skip individual bad items rather than failing entire batch
String cleaning: .strip() removes whitespace that APIs sometimes include
Safe nested access: safe_get(item, "source.name") handles missing source object
Content truncation: Preview field prevents extremely long text from cluttering display

This normalizer demonstrates the complete pattern. Now let's test it with real data:

Python

import os
import requests

# Fetch real NewsAPI data
NEWSAPI_KEY = os.environ.get("NEWSAPI_KEY")
response = requests.get(
    f"https://newsapi.org/v2/everything?q=python&pageSize=5&apiKey={NEWSAPI_KEY}",
    timeout=10
).json()

# Normalize
articles = normalize_newsapi(response)

print(f"NewsAPI: Normalized {len(articles)} articles\n")
for i, article in enumerate(articles[:3], 1):
    print(f"{i}. {article.title}")
    print(f"   Source: {article.source_name}")
    print(f"   Published: {article.format_timestamp()}")
    if article.author:
        print(f"   By: {article.author}")
    print()

Example Output

NewsAPI: Normalized 5 articles

1. Python 3.13 Performance Improvements
   Source: TechCrunch
   Published: January 15, 2025 at 02:30 PM
   By: Sarah Chen

2. Machine Learning with Python: 2025 Guide
   Source: Medium
   Published: January 14, 2025 at 11:45 AM

3. Django 5.0 Released
   Source: Django Project
   Published: January 13, 2025 at 09:00 AM
   By: Django Team

Guardian Normalizer (Key Differences)

The Guardian normalizer implementation is in Appendix A. Rather than showing you identical code with only field names changed, we'll focus on the four specific adaptations Guardian requires. The five-phase pattern from NewsAPI applies exactly. You'll recognize the same structure when you read the complete implementation.

This approach mirrors how you'd work in production: understand the universal pattern once deeply, then document only the API-specific quirks for each new integration. When you need to add a fourth news source, you won't rewrite everything. You'll identify the quirks and map them to the pattern.

The Guardian normalizer follows the same five-step pattern as NewsAPI but handles additional nesting and different field names. If you understood the NewsAPI implementation, you already know the structure. Here's how each phase adapts to Guardian's unique format:

Guardian-Specific Handling: Four Phases

1.

Field Extraction: Double Nesting & Name Mapping

Guardian wraps everything in a response object and uses "web" prefixes for standard fields. Extract through the extra layer and map to canonical names.

Python

# Guardian wraps everything in 'response'
if not isinstance(response, dict) or "response" not in response:
    return []

response_obj = response["response"]
items, meta = extract_items_and_meta(response_obj, container_hints=["results"])

# Inside the loop - extract Guardian's field names
title = item.get("webTitle", "").strip()  # Note: webTitle, not title
url = item.get("webUrl", "").strip()      # Note: webUrl, not url
published_at = item.get("webPublicationDate", "")  # Different name

# Source: use section name as identifier
source_name = item.get("sectionName", "The Guardian")

Key difference: Guardian uses webTitle, webUrl, and webPublicationDate instead of NewsAPI's simple names. Same concept, different labels.

2.

Validation Check: Same Logic, Different Fields

The validation structure is identical to NewsAPI - check required fields and skip if missing. Only the variable names change.

Python

# Skip items missing required fields - identical pattern to NewsAPI
if not title or not url or not published_at:
    continue

Key difference: None. The validation logic is identical - this is the invariant part of the pattern.

3.

Transformation Logic: Optional Fields Object

Guardian buries optional metadata in a nested fields object. Extract and validate this object before accessing its contents.

Python

# Guardian buries useful content in optional 'fields' object
fields = item.get("fields", {})
if not isinstance(fields, dict):
    fields = {}

# Extract optional fields from nested object
description = fields.get("trailText")
if description:
    description = description.strip()

author = fields.get("byline")
if author:
    author = author.strip()

image_url = fields.get("thumbnail")

# Content preview from body text
body = fields.get("bodyText")
content_preview = None
if body:
    content_preview = body[:200] + "..." if len(body) > 200 else body

Key difference: The extra fields object check adds one defensive layer. The rest (strip whitespace, truncate content) matches NewsAPI exactly.

4.

Article Construction: Identical Dataclass Call

Once fields are extracted and transformed, construction is identical to NewsAPI. Same canonical model, same field names.

Python

# Create canonical article - identical to NewsAPI
article = Article(
    title=title,
    url=url,
    published_at=published_at,
    source_name=source_name,
    description=description,
    author=author,
    image_url=image_url,
    content_preview=content_preview
)

articles.append(article)

Key difference: None. Once normalized to canonical field names, construction is identical across all sources.

Pattern Recognition

Compare Guardian's phases to NewsAPI's: (1) extraction logic changed (different field names, extra nesting), (2) validation logic stayed identical, (3) transformation logic changed (different optional field locations), (4) construction logic stayed identical. This is exactly how professional normalizers work - customize what varies, reuse what doesn't.

The structure remains identical to NewsAPI (extract items → loop → validate → map → construct), but the field access logic adapts to Guardian's schema. See the appendix for the complete implementation.

HackerNews Normalizer (Key Differences)

HackerNews presents different challenges from both NewsAPI and Guardian: minimal metadata, Unix timestamps requiring conversion, and missing URLs for some post types. The five-step pattern remains unchanged; only the field access logic differs. Here's how each phase adapts:

HackerNews-Specific Handling: Four Phases

1.

Field Extraction: URL Fallback Logic

HackerNews items sometimes lack URLs (Ask HN, Show HN posts). When missing, construct the URL from the object ID. This is the only API requiring URL construction.

Python

# Extract items - standard pattern
items, meta = extract_items_and_meta(response, container_hints=["hits"])

# Inside the loop - HackerNews uses simple field names
title = item.get("title", "").strip()
url = item.get("url", "").strip()

# HackerNews items without URLs are Ask HN or Show HN posts
# Construct URL from objectID when missing
if not url:
    object_id = item.get("objectID", "")
    if object_id:
        url = f"https://news.ycombinator.com/item?id={object_id}"
    # If still no URL after construction, we'll catch it in validation

# Source is always constant for HackerNews
source_name = "HackerNews"

Key difference: URL construction fallback is unique to HackerNews. NewsAPI and Guardian always provide URLs directly.

2.

Validation Check: Same Logic After Extraction

After URL construction, validation uses the same pattern. Note that we check for URL after the construction attempt.

Python

# Skip items missing required fields - identical pattern
if not title or not url or not published_at:
    continue

Key difference: None. The validation logic remains identical across all three normalizers.

3.

Transformation Logic: Unix Timestamp Conversion

HackerNews provides Unix timestamps that must be converted to ISO 8601. This is the only API requiring timestamp transformation.

Python

# Convert Unix integer to ISO 8601 string
created_at_unix = item.get("created_at_i")
if created_at_unix:
    try:
        dt = datetime.fromtimestamp(created_at_unix, tz=timezone.utc)
        published_at = dt.isoformat().replace("+00:00", "Z")
    except (ValueError, OSError):
        # Fallback to string timestamp if available
        published_at = item.get("created_at", "")
else:
    published_at = item.get("created_at", "")

# HackerNews has minimal metadata - most fields are sparse
description = item.get("story_text")
if description:
    description = description.strip()

author = item.get("author")  # Username string
if author:
    author = author.strip()

image_url = None  # No images in HackerNews
content_preview = None

Key difference: Unix timestamp conversion is unique to HackerNews. Also note the minimal metadata - most optional fields are None.

4.

Article Construction: Identical Dataclass Call

Despite the Unix timestamps and URL construction, the final construction call is identical once fields are normalized.

Python

# Create canonical article - identical to NewsAPI and Guardian
article = Article(
    title=title,
    url=url,
    published_at=published_at,  # Now in ISO 8601 format
    source_name=source_name,
    description=description,
    author=author,
    image_url=image_url,
    content_preview=content_preview
)

articles.append(article)

Key difference: None. The construction call is byte-for-byte identical to NewsAPI and Guardian.

Pattern Recognition Across All Three

Compare all three normalizers side-by-side: Phase 2 (validation) and Phase 4 (construction) are identical. Only Phase 1 (extraction) and Phase 3 (transformation) change per API. This is the power of canonical models - most of your code stays constant, only the API-specific mapping logic varies.

NewsAPI → simple fields, nested source object
Guardian → "web" prefixes, extra fields object, double nesting
HackerNews → URL construction, Unix timestamps, sparse metadata

Same pattern, different quirks. When you add a fourth API, you'll follow this exact same four-phase approach.

Again, the same defensive structure (extract → loop → validate → map → construct), but the mapping logic adapts to HackerNews's Unix timestamps, URL construction, and sparse metadata. Complete implementation in the appendix.

Testing All Three Normalizers

Verify that all three normalizers produce articles in the same canonical format:

Unified Output Validation

Python

def test_all_normalizers():
    """Test all three normalizers produce canonical output."""
    print("=== TESTING ALL NORMALIZERS ===\n")
    
    # NewsAPI
    newsapi_response = requests.get(
        f"https://newsapi.org/v2/everything?q=python&pageSize=2&apiKey={NEWSAPI_KEY}",
        timeout=10
    ).json()
    newsapi_articles = normalize_newsapi(newsapi_response)
    print(f"✓ NewsAPI: {len(newsapi_articles)} articles normalized")
    
    # Guardian
    guardian_response = requests.get(
        f"https://content.guardianapis.com/search?q=python&page-size=2&show-fields=all&api-key={GUARDIAN_KEY}",
        timeout=10
    ).json()
    guardian_articles = normalize_guardian(guardian_response)
    print(f"✓ Guardian: {len(guardian_articles)} articles normalized")
    
    # HackerNews
    hn_response = requests.get(
        "https://hn.algolia.com/api/v1/search?query=python&hitsPerPage=2",
        timeout=10
    ).json()
    hn_articles = normalize_hackernews(hn_response)
    print(f"✓ HackerNews: {len(hn_articles)} articles normalized")
    
    # Verify all produce Article instances with required fields
    all_articles = newsapi_articles + guardian_articles + hn_articles
    print(f"\nTotal: {len(all_articles)} articles")
    print(f"All are Article instances: {all(isinstance(a, Article) for a in all_articles)}")
    print(f"All have required fields: {all(a.title and a.url and a.published_at and a.source_name for a in all_articles)}")

test_all_normalizers()

Example Output

=== TESTING ALL NORMALIZERS ===

✓ NewsAPI: 2 articles normalized
✓ Guardian: 2 articles normalized
✓ HackerNews: 2 articles normalized

Total: 6 articles
All are Article instances: True
All have required fields: True

Perfect. All three normalizers successfully transform their respective API formats into identical canonical structures. Every article has the required fields, uses the same Article class, and is ready for the aggregation pipeline in Section 5.

The Normalizer Pattern Summary

All three normalizers share the same defensive approach while customizing field access:

Aspect	NewsAPI	Guardian	HackerNews
Container	`articles`	`response.results`	`hits`
Title field	`title`	`webTitle`	`title`
Timestamp	`publishedAt` (ISO)	`webPublicationDate` (ISO)	`created_at_i` → ISO conversion
Source	`source.name` (nested)	`sectionName`	Constant: "HackerNews"
Special handling	Nested source object	Optional `fields` object	URL construction, Unix timestamps
Error handling	Identical: try/except per item, continue on failure

Pattern Consistency

Despite different field mappings, all three normalizers use identical defensive techniques: early validation of required fields, continue on item failures, string cleaning, safe extraction of optional fields, and consistent None defaults. This isn't accidental - it's the result of applying Chapter 10's patterns systematically.

Takeaways & Next Step

Source-Specific Normalization:

Consistent pattern: All normalizers extract, validate, map, and construct using the same defensive approach
Customized mappings: Each normalizer knows its API's field names and nesting structure
Format conversions: Timestamp standardization, string cleaning, URL construction handled internally
Defensive throughout: Chapter 10 patterns prevent crashes from missing fields or type errors
Validated output: Article dataclass ensures only valid canonical objects leave normalizers

With normalizers built and tested, Section 5 describes the aggregation pipeline that coordinates multiple sources, handles partial failures, and produces unified search results. Section 6 builds the display layer, and Section 7 examines limitations that motivate Chapter 12's validation.

5. The Aggregation Pipeline

Orchestrating Multiple Sources

You have three normalizers that transform API responses into canonical articles. Now build the aggregator that orchestrates everything: fetching from multiple sources, handling failures gracefully when one API is down, combining results, and presenting unified output.

The aggregator demonstrates a critical production pattern: graceful degradation. When one news source fails - maybe NewsAPI is over its rate limit, or Guardian's API is temporarily unavailable - the aggregator continues with the sources that work. Users get partial results rather than complete failure. This resilience separates hobby projects from professional applications.

The Pipeline Architecture

The aggregation pipeline coordinates multiple sources with independent failure handling. When one source fails, the others continue:

    [Search Query: "python programming"]
                 |
         ┌────────┴─────────┬───────────────┐
         ↓                ↓              ↓
    [NewsAPI]        [Guardian]    [HackerNews]
         |                |              |
         ↓                ↓              ↓
   [Normalize]      [Normalize]    [Normalize]
         |                |              |
         ↓                ↓              ↓
    5 articles      ✗ Failed        3 articles
         |                               |
         └─────────────────┬───────────────┘
                         ↓
                  [Combine: 8 articles]
                         ↓
                  [Deduplicate: 6 articles]
                         ↓
                  [Sort by recency]
                         ↓
                  [Display Results]

This graceful degradation is what separates production systems from prototypes. Users get results even when individual components fail. The system tracks what succeeded and what failed for diagnostics.

The NewsAggregator Class

The aggregator encapsulates all coordination logic in a single class. Each source gets a fetch method, a normalizer reference, and independent error handling:

Core Aggregator Structure

Python

import os
import requests
from typing import List, Tuple, Dict

class NewsAggregator:
    """
    Aggregates news from multiple sources with graceful failure handling.
    Each source can fail independently without affecting others.
    """
    
    def __init__(self):
        """Initialize with API credentials from environment."""
        self.newsapi_key = os.environ.get("NEWSAPI_KEY")
        self.guardian_key = os.environ.get("GUARDIAN_KEY")
    
    def fetch_newsapi(self, query: str, max_results: int = 10) -> dict:
        """Fetch articles from NewsAPI."""
        if not self.newsapi_key:
            raise ValueError("NewsAPI key not configured")
        
        url = "https://newsapi.org/v2/everything"
        params = {
            "q": query,
            "pageSize": max_results,
            "apiKey": self.newsapi_key,
            "language": "en",
            "sortBy": "publishedAt"
        }
        response = requests.get(url, params=params, timeout=10)
        response.raise_for_status()
        return response.json()
    
    def fetch_guardian(self, query: str, max_results: int = 10) -> dict:
        """Fetch articles from Guardian API."""
        if not self.guardian_key:
            raise ValueError("Guardian API key not configured")
        
        url = "https://content.guardianapis.com/search"
        params = {
            "q": query,
            "page-size": max_results,
            "show-fields": "all",
            "api-key": self.guardian_key,
            "order-by": "newest"
        }
        response = requests.get(url, params=params, timeout=10)
        response.raise_for_status()
        return response.json()
    
    def fetch_hackernews(self, query: str, max_results: int = 10) -> dict:
        """Fetch articles from HackerNews via Algolia API."""
        url = "https://hn.algolia.com/api/v1/search"
        params = {
            "query": query,
            "hitsPerPage": max_results,
            "tags": "story"
        }
        response = requests.get(url, params=params, timeout=10)
        response.raise_for_status()
        return response.json()
    
    def search(self, query: str, max_per_source: int = 10) -> Tuple[List[Article], Dict]:
        """
        Search all available sources and return combined results.
        
        Returns:
            (articles, stats) where stats contains success/failure info
        """
        all_articles = []
        stats = {
            "query": query,
            "sources_attempted": [],
            "sources_succeeded": [],
            "sources_failed": [],
            "total_articles": 0
        }
        
        # Define sources with their fetch and normalize functions
        sources = [
            ("NewsAPI", self.fetch_newsapi, normalize_newsapi),
            ("Guardian", self.fetch_guardian, normalize_guardian),
            ("HackerNews", self.fetch_hackernews, normalize_hackernews)
        ]
        
        # Try each source independently
        for source_name, fetch_fn, normalize_fn in sources:
            stats["sources_attempted"].append(source_name)
            
            try:
                # Fetch and normalize
                response = fetch_fn(query, max_per_source)
                articles = normalize_fn(response)
                
                # Record success
                all_articles.extend(articles)
                stats["sources_succeeded"].append(source_name)
                print(f"✓ {source_name}: {len(articles)} articles")
                
            except ValueError as e:
                # Configuration error (missing API key)
                stats["sources_failed"].append(f"{source_name} (not configured)")
                print(f"⊘ {source_name}: {e}")
                
            except requests.RequestException as e:
                # Network or API error
                stats["sources_failed"].append(f"{source_name} (network error)")
                print(f"✗ {source_name}: {e}")
                
            except Exception as e:
                # Unexpected error
                stats["sources_failed"].append(f"{source_name} (error: {type(e).__name__})")
                print(f"✗ {source_name}: Unexpected error - {e}")
        
        stats["total_articles"] = len(all_articles)
        return all_articles, stats

Graceful Degradation Pattern

Independent failures: Each source wrapped in try/except; one failure doesn't stop others
Error categorization: Configuration errors vs network errors vs unexpected errors
Success tracking: Stats dict records what worked and what didn't for diagnostics
Partial results: Returns whatever articles were successfully fetched

Seeing It Work

Test the aggregator with a real search query to see all three sources working together:

Complete Aggregation Example

Python

# Create aggregator and search
aggregator = NewsAggregator()

print("=== NEWS AGGREGATOR TEST ===\n")
print("Searching for 'python programming'...\n")

articles, stats = aggregator.search("python programming", max_per_source=5)

# Display statistics
print(f"\n--- Search Statistics ---")
print(f"Sources attempted: {len(stats['sources_attempted'])}")
print(f"Sources succeeded: {len(stats['sources_succeeded'])}")
print(f"Sources failed: {len(stats['sources_failed'])}")
if stats['sources_failed']:
    print(f"Failed sources: {', '.join(stats['sources_failed'])}")
print(f"Total articles: {stats['total_articles']}")

# Display sample results
print(f"\n--- Sample Results ---")
for i, article in enumerate(articles[:3], 1):
    print(f"\n{i}. {article.title}")
    print(f"   📰 {article.source_name} • {article.format_timestamp()}")
    if article.author:
        print(f"   ✍️  {article.author}")
    print(f"   🔗 {article.url}")

Example Output

=== NEWS AGGREGATOR TEST ===

Searching for 'python programming'...

✓ NewsAPI: 5 articles
✓ Guardian: 5 articles
✓ HackerNews: 5 articles

--- Search Statistics ---
Sources attempted: 3
Sources succeeded: 3
Sources failed: 0
Total articles: 15

--- Sample Results ---

1. Python 3.13 Performance Improvements
   📰 TechCrunch • January 15, 2025 at 02:30 PM
   ✍️  Sarah Chen
   🔗 https://techcrunch.com/2025/01/15/python-3-13-performance

2. Python programming language gains popularity in schools
   📰 Technology • January 14, 2025 at 03:15 PM
   ✍️  Alex Hern
   🔗 https://www.theguardian.com/technology/2025/jan/14/python-schools

3. Python 3.13 is out
   📰 HackerNews • January 15, 2025 at 08:45 AM
   ✍️  throwaway2025
   🔗 https://www.python.org/downloads/release/python-3130/

The aggregator successfully coordinated all three sources. Each source was fetched, normalized, and merged into a unified list. Articles from different sources appear together, all using the same canonical format.

Testing Partial Failures

Graceful degradation is difficult to test in production. You can't intentionally break live APIs. But you can simulate failures in development to verify your error handling works correctly.

The following test demonstrates the aggregator continuing to operate when one source fails. Note: This test assumes you've implemented the complete aggregation pipeline shown in Appendix A (or your own version following this section's patterns). If you're reading linearly and haven't built the aggregator yet, bookmark this test and return after implementing the pipeline.

Resilience Under Partial Failure

Python

# Simulate Guardian API failure by using invalid key
original_key = aggregator.guardian_key
aggregator.guardian_key = "invalid_key_for_testing"

print("=== PARTIAL FAILURE TEST ===\n")
print("Simulating Guardian API failure...\n")

articles, stats = aggregator.search("python", max_per_source=3)

print(f"\n--- Results with One Source Down ---")
print(f"Sources succeeded: {stats['sources_succeeded']}")
print(f"Sources failed: {stats['sources_failed']}")
print(f"Total articles still retrieved: {stats['total_articles']}")

# Restore original key
aggregator.guardian_key = original_key

Example Output

=== PARTIAL FAILURE TEST ===

Simulating Guardian API failure...

✓ NewsAPI: 3 articles
✗ Guardian: 401 Client Error: Unauthorized
✓ HackerNews: 3 articles

--- Results with One Source Down ---
Sources succeeded: ['NewsAPI', 'HackerNews']
Sources failed: ['Guardian (network error)']
Total articles still retrieved: 6

Even with Guardian failing, the aggregator returned 6 articles from the two working sources. Users get results instead of an error message. The stats clearly show which source failed so developers can investigate. This is graceful degradation in action.

Operational Monitoring: Using Aggregation Statistics

Your aggregator tracks success and failure statistics for each source, but these metrics are only useful if you know how to interpret them and take action. In production systems, monitoring aggregation health helps you detect API issues, rate limiting, authentication problems, and service degradation before users complain.

This section shows you how to use the statistics you're already collecting to maintain a reliable multi-API system.

Statistics Structure & Collection

First, ensure your aggregation pipeline collects comprehensive statistics per source:

Python

def search_all_sources(query: str) -> tuple[List[Article], dict]:
    """
    Search all sources and collect detailed statistics for monitoring.
    """
    all_articles = []
    stats = {
        "query": query,
        "total_articles": 0,
        "unique_articles": 0,
        "duplicates_removed": 0,
        "sources": []
    }
    
    # Fetch from each source with individual error handling
    sources = [
        ("NewsAPI", fetch_newsapi, query),
        ("Guardian", fetch_guardian, query),
        ("HackerNews", fetch_hackernews, query)
    ]
    
    for source_name, fetch_func, search_query in sources:
        try:
            articles = fetch_func(search_query)
            all_articles.extend(articles)
            
            # Record success
            stats["sources"].append({
                "name": source_name,
                "status": "success",
                "articles_found": len(articles),
                "error": None
            })
            
        except requests.Timeout:
            # Network timeout - API is slow or unreachable
            stats["sources"].append({
                "name": source_name,
                "status": "timeout",
                "articles_found": 0,
                "error": "Request timed out"
            })
            
        except requests.HTTPError as e:
            # HTTP error - could be rate limiting, auth failure, or API error
            status_code = e.response.status_code if e.response else "unknown"
            stats["sources"].append({
                "name": source_name,
                "status": "http_error",
                "articles_found": 0,
                "error": f"HTTP {status_code}: {str(e)}"
            })
            
        except Exception as e:
            # Unexpected error - normalization failure, parsing error, etc.
            stats["sources"].append({
                "name": source_name,
                "status": "error",
                "articles_found": 0,
                "error": str(e)
            })
    
    # Deduplication
    stats["total_articles"] = len(all_articles)
    unique_articles = deduplicate_articles(all_articles)
    stats["unique_articles"] = len(unique_articles)
    stats["duplicates_removed"] = stats["total_articles"] - stats["unique_articles"]
    
    # Sort by recency
    unique_articles.sort(key=lambda a: a.published_at, reverse=True)
    
    return unique_articles, stats

This structure gives you everything needed for operational monitoring: which sources succeeded or failed, how many articles each returned, and what specific errors occurred.

Interpreting Aggregation Statistics

Now that you're collecting statistics, you need to recognize patterns that indicate problems. Here are the key patterns to watch for:

1.

Pattern: Consistent Failures from One Source

If Guardian consistently fails while NewsAPI and HackerNews succeed, you have a Guardian-specific problem.

Example Statistics

{
  "sources": [
    {"name": "NewsAPI", "status": "success", "articles_found": 5},
    {"name": "Guardian", "status": "http_error", "error": "HTTP 401: Unauthorized"},
    {"name": "HackerNews", "status": "success", "articles_found": 5}
  ]
}

Diagnosis: HTTP 401 means authentication failure. Your Guardian API key is invalid, expired, or not being sent correctly.

Action:

Verify your GUARDIAN_KEY environment variable is set
Check that the key hasn't expired (Guardian keys are valid indefinitely but can be revoked)
Test the key directly: curl "https://content.guardianapis.com/search?api-key=YOUR_KEY"
If expired, register for a new key at open-platform.theguardian.com

2.

Pattern: Rate Limiting (HTTP 429)

If you see HTTP 429 errors, you've exceeded the API's request limits.

Example Statistics

{
  "sources": [
    {"name": "NewsAPI", "status": "http_error", "error": "HTTP 429: Too Many Requests"},
    {"name": "Guardian", "status": "success", "articles_found": 5},
    {"name": "HackerNews", "status": "success", "articles_found": 5}
  ]
}

Diagnosis: You've exceeded NewsAPI's rate limit (500 requests/day on the free tier).

Action:

Implement request caching to avoid re-fetching the same queries
Add rate limiting to your client (e.g., max 1 request per minute during testing)
Consider upgrading to a paid tier if you need higher limits
For development, cache responses to disk and work with cached data

3.

Pattern: Timeouts from One Source

Consistent timeouts suggest network issues or an overloaded API server.

Example Statistics

{
  "sources": [
    {"name": "NewsAPI", "status": "success", "articles_found": 5},
    {"name": "Guardian", "status": "timeout", "error": "Request timed out"},
    {"name": "HackerNews", "status": "success", "articles_found": 5}
  ]
}

Diagnosis: Guardian's API is slow or unreachable. Could be temporary server issues, network problems, or your timeout is too aggressive.

Action:

Check Guardian's status page for reported outages
Increase timeout from 10 seconds to 30 seconds and retry
If persistent, implement retry logic with exponential backoff
Consider the aggregator still provides value with 2 out of 3 sources

4.

Pattern: Success But Zero Articles

API responds successfully but returns no results. Could indicate a query formatting issue or an empty result set.

Example Statistics

{
  "sources": [
    {"name": "NewsAPI", "status": "success", "articles_found": 5},
    {"name": "Guardian", "status": "success", "articles_found": 0},
    {"name": "HackerNews", "status": "success", "articles_found": 5}
  ]
}

Diagnosis: Guardian returned successfully but found no articles. This might be legitimate (no matches) or indicate a normalization problem where all articles are being filtered out.

Action:

Try a more common query (e.g., "politics") to test if it's a query issue
Check normalizer logs for skipped articles due to missing required fields
Inspect raw API response to verify articles are present before normalization
Verify your normalizer's field mapping matches Guardian's current API structure

Automated Health Monitoring

Rather than manually inspecting statistics after each search, implement automated health checks that flag problems:

Python

def check_aggregator_health(stats: dict) -> List[str]:
    """
    Analyze aggregation statistics and return list of issues found.
    
    Returns:
        List of warning messages. Empty list means all sources healthy.
    """
    warnings = []
    
    # Check each source
    for source in stats["sources"]:
        source_name = source["name"]
        
        # Critical: Authentication failures
        if source["status"] == "http_error" and "401" in source.get("error", ""):
            warnings.append(
                f"⚠️  {source_name}: Authentication failed. Check API key."
            )
        
        # Critical: Rate limiting
        elif source["status"] == "http_error" and "429" in source.get("error", ""):
            warnings.append(
                f"⚠️  {source_name}: Rate limited. Reduce request frequency."
            )
        
        # Warning: Timeouts
        elif source["status"] == "timeout":
            warnings.append(
                f"⚠️  {source_name}: Request timed out. API may be slow or down."
            )
        
        # Warning: Unexpected errors
        elif source["status"] == "error":
            warnings.append(
                f"⚠️  {source_name}: Unexpected error: {source['error']}"
            )
        
        # Info: Success but no results (might be legitimate)
        elif source["status"] == "success" and source["articles_found"] == 0:
            warnings.append(
                f"ℹ️  {source_name}: No articles found. Query may be too specific."
            )
    
    # Check overall health
    successful_sources = [s for s in stats["sources"] if s["status"] == "success"]
    
    if len(successful_sources) == 0:
        warnings.append("🚨 CRITICAL: All sources failed. Check network connectivity.")
    elif len(successful_sources) < len(stats["sources"]):
        working = [s["name"] for s in successful_sources]
        warnings.append(f"⚠️  Partial failure. Working sources: {', '.join(working)}")
    
    return warnings

Integrate health checks into your CLI to display warnings automatically:

Python

def search_command(query: str):
    """Execute search and display results with health monitoring."""
    print(f"\nSearching for '{query}'...\n")
    
    # Perform aggregation
    articles, stats = search_all_sources(query)
    
    # Check health and display warnings
    warnings = check_aggregator_health(stats)
    if warnings:
        print("HEALTH WARNINGS:")
        for warning in warnings:
            print(f"  {warning}")
        print()
    
    # Display results
    if articles:
        display_article_list(articles, stats)
    else:
        print("No articles found. Try a different query or check source health.")
        print(f"\nSource Status:")
        for source in stats["sources"]:
            status_symbol = "✓" if source["status"] == "success" else "✗"
            print(f"  {status_symbol} {source['name']}: {source['status']}")

Example Output With Issues

Searching for 'python programming'...

HEALTH WARNINGS:
  ⚠️  Guardian: Authentication failed. Check API key.
  ⚠️  Partial failure. Working sources: NewsAPI, HackerNews

======================================================================
📰 NEWS AGGREGATOR RESULTS
======================================================================

Query: 'python programming'
Sources: NewsAPI, HackerNews
Articles: 8 (2 duplicates removed)

1. Python 3.13 Performance Improvements
   📰 TechCrunch • January 15, 2025 at 02:30 PM
   ...

Production Monitoring Practices

In production systems, you'd extend this monitoring to:

Log statistics: Write stats to a file or logging service for historical analysis
Track trends: Monitor failure rates over time (e.g., "Guardian has failed 15 times in the last hour")
Alert on thresholds: Send notifications when failure rate exceeds acceptable levels
Dashboard visualization: Display source health, success rates, and response times in real-time

For this learning project, the CLI warnings provide immediate feedback. The pattern scales to sophisticated monitoring infrastructure as your applications grow.

Decision Framework: When to Act

Not every failure requires immediate action. Use this framework to decide when to investigate versus when to accept graceful degradation:

Situation	Urgency	Action
1 of 3 sources failing intermittently	Low	Monitor for pattern. Investigate if persists >1 hour.
1 of 3 sources consistently failing	Medium	Check API key, rate limits, and API status page. Fix within 24 hours.
2 of 3 sources failing	High	Investigate immediately. Aggregator value is severely degraded.
All sources failing	Critical	Check network connectivity, DNS resolution, and firewall rules immediately.
Rate limiting (HTTP 429)	Medium	Implement caching and rate limiting. Upgrade API tier if needed.
Authentication failures (HTTP 401)	High	Verify environment variables and API key validity. Fix immediately.

The beauty of graceful degradation is that 1 or even 2 failing sources still provides value to users. Your monitoring helps you prioritize which issues need immediate attention versus which can wait for the next maintenance window.

Additional Pipeline Features

The complete aggregator (see Appendix A) includes two additional features that enhance the user experience:

1.

Deduplication

Multiple sources often return the same article, especially for major stories. The deduplicator uses URL comparison to identify and remove duplicate articles, ensuring each story appears only once in the results.

2.

Sorting Options

Users can view articles by recency (breaking news), by source (grouped by provider), or alphabetically. Different use cases need different orderings - the aggregator supports multiple sort strategies that work uniformly across the normalized data.

These features build on the foundation you've created. The appendix includes full implementations with the same defensive patterns you've learned throughout this chapter.

Deduplication: Handling Cross-Source Overlap

When you search three news sources simultaneously, you'll often get the same article from multiple APIs. A story about Python 3.13's release might appear in NewsAPI (from TechCrunch), Guardian (from their technology section), and HackerNews (as a submitted link). Without deduplication, users see the same article three times in results.

The aggregator needs to detect duplicates and keep only one instance per unique article. This is a fundamental multi-source integration challenge: how do you know when two items from different APIs represent the same real-world entity?

Why Not Just Use Title Matching?

Your first instinct might be comparing titles: if two articles have the same title, they're duplicates. But titles vary - "Python 3.13 Released" vs. "Python 3.13 Now Available" vs. "New Python 3.13 Release" are all the same story with different headlines. Title matching produces false negatives (misses real duplicates) because of phrasing variations.

URLs are more reliable: if two articles link to https://blog.python.org/2025/01/python-3-13-released.html, they're discussing the same content regardless of how each source titled it.

URL-Based Deduplication Implementation

The deduplication strategy uses URLs as unique identifiers. Since your canonical model requires all articles to have URLs, you can treat the URL field as the deduplication key:

Python

def deduplicate_articles(articles: List[Article]) -> List[Article]:
    """
    Remove duplicate articles using URL as unique identifier.
    
    When multiple sources report the same story, they often link to
    the same original URL. Keep the first instance encountered.
    """
    seen_urls = set()
    unique_articles = []
    
    for article in articles:
        if article.url not in seen_urls:
            seen_urls.add(article.url)
            unique_articles.append(article)
    
    return unique_articles

This approach is simple and effective: maintain a set of seen URLs, and only append articles whose URLs haven't been encountered yet. The set lookup is O(1), making deduplication efficient even with hundreds of articles.

Testing Deduplication

Verify that deduplication works by simulating overlapping results from multiple sources:

Python

# Simulate articles from different sources with some duplicates
articles = [
    Article(
        title="Python 3.13 Released",
        url="https://blog.python.org/2025/01/python-3-13.html",
        published_at="2025-01-15T10:00:00Z",
        source_name="NewsAPI"
    ),
    Article(
        title="New Python Version Available",  # Different title, same URL
        url="https://blog.python.org/2025/01/python-3-13.html",
        published_at="2025-01-15T10:00:00Z",
        source_name="Guardian"
    ),
    Article(
        title="Python 3.13 Performance Improvements",
        url="https://realpython.com/python313-new-features/",
        published_at="2025-01-15T11:00:00Z",
        source_name="HackerNews"
    ),
    Article(
        title="Python 3.13 Now Available",  # Duplicate URL again
        url="https://blog.python.org/2025/01/python-3-13.html",
        published_at="2025-01-15T10:00:00Z",
        source_name="HackerNews"
    ),
]

print(f"Before deduplication: {len(articles)} articles")
unique = deduplicate_articles(articles)
print(f"After deduplication: {len(unique)} articles")
print(f"Removed {len(articles) - len(unique)} duplicates")

print("\nUnique articles:")
for article in unique:
    print(f"  - {article.title} ({article.source_name})")

Output

Before deduplication: 4 articles
After deduplication: 2 articles
Removed 2 duplicates

Unique articles:
  - Python 3.13 Released (NewsAPI)
  - Python 3.13 Performance Improvements (HackerNews)

The algorithm correctly identifies three articles with the same URL and keeps only the first instance (from NewsAPI). The Real Python article has a different URL, so it's retained as unique.

Edge Cases & Limitations

URL-based deduplication works well for most cases but has edge cases you should understand:

1.

Tracking Parameters Create False Negatives

The same article with different tracking parameters appears as different URLs:

Example

https://techcrunch.com/article?utm_source=newsapi&utm_medium=feed
https://techcrunch.com/article?utm_source=guardian&utm_medium=api

These URLs point to identical content but differ in query parameters. Basic URL comparison treats them as different articles. To handle this, you'd need to normalize URLs by stripping tracking parameters before comparison.

2.

Protocol and Subdomain Variations

The same resource might appear with minor URL variations:

Example

http://example.com/article
https://example.com/article
https://www.example.com/article

URL normalization could unify these (lowercase domains, remove www, enforce https) but adds complexity. For this aggregator, we accept that these variations might appear as separate articles.

3.

Different URLs for Same Story

News outlets often write their own coverage of the same event, producing genuinely different URLs:

Example

https://techcrunch.com/python-3-13-released
https://theverge.com/tech/python-new-version
https://arstechnica.com/programming-language-update

These are three different articles about the same event. URL deduplication correctly treats them as distinct - each outlet wrote their own coverage. This is expected behavior, not a flaw.

4.

First-Seen Bias

The algorithm keeps whichever instance appears first in the combined list. If NewsAPI returns before Guardian, NewsAPI's version (with its metadata) is retained. This means source ordering affects which metadata survives deduplication.

Alternative: implement "best source" selection - if you prefer Guardian's descriptions over NewsAPI's, keep Guardian's instance when URLs match. This requires explicit preference rules.

When to Enhance Deduplication

For this news aggregator, simple URL comparison is sufficient. The edge cases above rarely cause problems in practice - most duplicate articles have identical URLs, and tracking parameter variations are uncommon in news APIs.

Consider URL normalization if you notice frequent false negatives in your specific domain. For e-commerce product aggregators or academic paper indexers, more sophisticated deduplication (fuzzy matching, content hashing, title similarity) might be necessary.

The principle remains: start with the simplest approach that solves 90% of cases, then add complexity only when real-world data demonstrates the need.

Integration with Aggregation Pipeline

Deduplication fits into the aggregation pipeline after fetching all sources but before sorting:

Python

def search_all_sources(query: str) -> tuple[List[Article], dict]:
    """
    Search all sources, deduplicate, and return unified results.
    """
    all_articles = []
    stats = {"sources": [], "total": 0, "duplicates_removed": 0}
    
    # 1. Fetch from all sources independently
    newsapi_articles = fetch_newsapi(query)
    guardian_articles = fetch_guardian(query)
    hn_articles = fetch_hackernews(query)
    
    # 2. Combine all results
    all_articles.extend(newsapi_articles)
    all_articles.extend(guardian_articles)
    all_articles.extend(hn_articles)
    
    stats["total"] = len(all_articles)
    
    # 3. Deduplicate
    unique_articles = deduplicate_articles(all_articles)
    stats["duplicates_removed"] = len(all_articles) - len(unique_articles)
    
    # 4. Sort by recency
    unique_articles.sort(key=lambda a: a.published_at, reverse=True)
    
    return unique_articles, stats

Track how many duplicates were removed in the stats dictionary. This gives users transparency about what the aggregator is doing and helps you monitor deduplication effectiveness over time.

Takeaways & Next Step

Aggregation Pipeline:

Graceful degradation: Independent error handling per source allows partial results when some APIs fail
Stats tracking: Record what succeeded and failed for diagnostics and user transparency
Composable design: Fetch, normalize, deduplicate, and sort are independent steps
Production resilience: System continues working when individual components fail

Section 6 briefly shows the display layer that presents these unified results to users. Section 7 then examines what can still go wrong despite all the defensive programming, motivating Chapter 12's validation approach.

6. Display & User Experience

The Payoff: Unified Results

The aggregation pipeline produces unified article lists from multiple sources. Now see what users actually experience - clean, scannable results that hide all the complexity you've handled:

What Users See

Terminal Output

======================================================================
📰 NEWS AGGREGATOR RESULTS
======================================================================

Query: 'python programming'
Sources: NewsAPI, Guardian, HackerNews
Articles: 12 (3 duplicates removed)

1. Python 3.13 Performance Improvements
   📰 TechCrunch • January 15, 2025 at 02:30 PM
   ✍️  Sarah Chen
   Python's latest release brings significant speed improvements...
   🔗 https://techcrunch.com/2025/01/15/python-3-13-performance
   (techcrunch.com)

2. Python 3.13 is out
   📰 HackerNews • January 15, 2025 at 08:45 AM
   ✍️  throwaway2025
   🔗 https://www.python.org/downloads/release/python-3130/
   (www.python.org)

3. Python programming language gains popularity in schools
   📰 Technology • January 14, 2025 at 03:15 PM
   ✍️  Alex Hern
   UK schools are increasingly teaching Python as first language...
   🔗 https://www.theguardian.com/technology/2025/jan/14/python-schools
   (www.theguardian.com)

... and 9 more articles

======================================================================

Every article displays consistently - title, source, timestamp, author, description, URL - regardless of which API provided it. This uniform presentation is possible because of the canonical model. You write the display code once and it works for all sources.

Why Canonical Models Simplify Display

Without normalization, display code would need branching logic: "if NewsAPI then show publishedAt, if Guardian then show webPublicationDate, if HackerNews then convert created_at_i..." With normalization, every article has published_at in ISO format. Display code never knows or cares which API was the source.

The Complete Application

The appendix includes a complete command-line interface that wraps the aggregator with interactive search, result viewing, and exploration commands. Users can search, refine queries, and access articles without writing any code. The interface demonstrates how canonical models enable rich user experiences - you build features on top of predictable data structures rather than fighting with API variations.

Key features of the complete application:

Interactive search: Natural language commands (search, more, group, quit)
Multiple views: Chronological listing, grouped by source, or filtered results
Source transparency: Clear indication of which APIs contributed and which failed
Stateful session: Remember last search for follow-up commands

See Appendix A for the complete implementation with all display functions and CLI logic.

Takeaways & Next Step

Display Layer Benefits:

Uniform presentation: One display format works for all sources thanks to canonical models
User transparency: Show which sources contributed, what failed, how many duplicates removed
Clean separation: Display logic never touches API-specific details

Section 7 examines what can still go wrong despite all the defensive programming you've implemented, setting up the motivation for Chapter 12's systematic validation approach.

7. When Defensive Programming Isn't Enough

The Aggregator Works - But Problems Remain

Your news aggregator is genuinely impressive. It handles three completely different API structures, degrades gracefully when sources fail, removes duplicates intelligently, and presents unified results. The canonical model design is solid. The normalizers are robust. The aggregation pipeline is resilient. Thousands of applications in production use exactly this architecture successfully.

But here's what experienced developers know: defensive programming solves one problem brilliantly (preventing crashes) while leaving another problem unsolved (enforcing quality). Your aggregator won't crash on malformed data - that's the victory. But it might show articles with empty titles, invalid URLs, or corrupted timestamps. These quality issues slip through because defensive programming checks "will this crash?" not "is this acceptable?"

This section examines what defensive programming cannot prevent, and why that matters. You'll see specific examples of quality degradation that pass all defensive checks. This isn't a flaw in your implementation - it's an inherent limitation of the defensive approach. Understanding this boundary motivates Chapter 12's validation techniques, which enforce quality standards beyond crash prevention.

Problem 1: Empty Required Fields

Your normalizers check if required fields exist, but not if they contain meaningful data. An article with title="" passes this check because the field is present and the string type is correct.

The Empty Field Problem

Python

# Simulate API response with empty required field
bad_response = {
    "articles": [
        {
            "title": "",  # Empty but present
            "url": "https://example.com/article",
            "publishedAt": "2025-01-15T10:00:00Z",
            "source": {"name": "Example Source"}
        }
    ]
}

articles = normalize_newsapi(bad_response)
print(f"Articles accepted: {len(articles)}")

Your normalizer catches this because of the .strip() call and if not title check. But this is fragile - what if you forget .strip() in one normalizer? What if the API sends a title that's whitespace longer than expected?

Defensive vs Validation

Defensive programming relies on remembering to check every edge case in every function. One forgotten .strip() and invalid data enters your system.

Validation makes these rules explicit and automatic: "titles must be non-empty strings of 5-200 characters" enforced in one place, applied everywhere.

Problem 2: Malformed Timestamps

Your normalizers accept any string in the timestamp field. Some APIs occasionally return malformed dates that look valid at first glance but cause sorting and display failures later.

Silent Timestamp Failures

Python

# Simulate various timestamp problems
timestamp_issues = {
    "articles": [
        {
            "title": "Article 1",
            "url": "https://example.com/1",
            "publishedAt": "2025-13-45T99:99:99Z",  # Invalid date/time
            "source": {"name": "Bad API"}
        },
        {
            "title": "Article 2", 
            "url": "https://example.com/2",
            "publishedAt": "yesterday",  # Relative time string
            "source": {"name": "Bad API"}
        }
    ]
}

articles = normalize_newsapi(timestamp_issues)
print(f"Articles accepted: {len(articles)}")

# Try to sort by timestamp
sorted_articles = sorted(articles, key=lambda a: a.published_at, reverse=True)
for article in sorted_articles:
    print(f"  {article.title}: {article.format_timestamp()}")

Output

Articles accepted: 2
  Article 1: 2025-13-45T99:99:99Z
  Article 2: yesterday

The articles pass through with malformed timestamps. Sorting "succeeds" because Python compares strings lexicographically, but the order is meaningless. The format_timestamp() method fails silently and returns the raw string. Users see garbage in the timestamp field, but the application doesn't crash.

Silent Failures Are Worse Than Crashes

The aggregator appears to work and displays results, but the data is nonsensical. Users can't trust publication dates, sorting by recency is meaningless, and identifying fresh content becomes impossible. Defensive programming prevented the crash but failed to prevent the quality issue.

Problem 3: Invalid URLs

Your normalizers check that URLs exist but not that they're actually valid URLs. APIs sometimes return placeholder values, malformed URLs, or internal identifiers that look URL-like but aren't.

URL Validation Gap

Python

# Simulate URL problems
url_issues = {
    "articles": [
        {
            "title": "Article with placeholder",
            "url": "http://",  # Incomplete
            "publishedAt": "2025-01-15T10:00:00Z",
            "source": {"name": "Bad Source"}
        },
        {
            "title": "Article with ID",
            "url": "article_12345",  # Not a URL at all
            "publishedAt": "2025-01-15T11:00:00Z", 
            "source": {"name": "Bad Source"}
        }
    ]
}

articles = normalize_newsapi(url_issues)
for article in articles:
    print(f"{article.title}")
    print(f"  URL: {article.url}")
    print(f"  Domain: {article.get_domain()}")

Output

Article with placeholder
  URL: http://
  Domain: unknown

Article with ID
  URL: article_12345
  Domain: unknown

Both articles pass validation. The get_domain() method returns "unknown" for invalid URLs, but the articles still appear in results. Users click these URLs and either get errors or nothing happens. The aggregator's value proposition - quick access to original content - breaks down.

Problem 4: Quality Degradation from Defaults

Defensive defaults prevent crashes but accumulate into quality problems. When every article with a missing author shows "Unknown", or every missing source shows "Unknown Source", the results lose credibility markers users need.

Default Accumulation

Python

# Simulate API response with many missing fields
sparse_response = {
    "articles": [
        {
            "title": "Article 1",
            "url": "https://example.com/1",
            "publishedAt": "2025-01-15T10:00:00Z",
            "source": {}  # Empty source object
        },
        {
            "title": "Article 2",
            "url": "https://example.com/2", 
            "publishedAt": "2025-01-15T11:00:00Z",
            "source": {"name": None}  # Null source name
        },
        {
            "title": "Article 3",
            "url": "https://example.com/3",
            "publishedAt": "2025-01-15T12:00:00Z"
            # Missing source entirely
        }
    ]
}

articles = normalize_newsapi(sparse_response)
for article in articles:
    print(f"{article.title}")
    print(f"  Source: {article.source_name}\n")

Output

Article 1
  Source: Unknown Source

Article 2
  Source: Unknown Source

Article 3
  Source: Unknown Source

Every article shows "Unknown Source". Users can't distinguish reputable sources from unreliable ones. The aggregator becomes less useful than searching each source individually, where at least the source is clear. The defensive approach prioritized "don't crash" over "provide value".

Problem 5: Business Rule Violations

Some quality issues involve relationships between fields rather than individual field validation. Your current approach can't catch these because it validates fields independently.

Cross-Field Validation Gaps

Python

# Simulate business rule violations
rule_violations = {
    "articles": [
        {
            "title": "Future Article",
            "url": "https://example.com/future",
            "publishedAt": "2026-12-31T23:59:59Z",  # Future date
            "source": {"name": "Time Traveler News"}
        },
        {
            "title": "Ancient Article",
            "url": "https://example.com/ancient",
            "publishedAt": "1990-01-01T00:00:00Z",  # 35 years old
            "source": {"name": "Archive"}
        }
    ]
}

articles = normalize_newsapi(rule_violations)

from datetime import datetime, timezone
now = datetime.now(timezone.utc)

for article in articles:
    pub_date = datetime.fromisoformat(article.published_at.replace('Z', '+00:00'))
    age_days = (now - pub_date).days
    
    print(f"{article.title}")
    print(f"  Age: {age_days} days")
    if age_days < 0:
        print(f"  ⚠️  Future date!")
    elif age_days > 365:
        print(f"  ⚠️  Over a year old!")

Output

Future Article
  Age: -696 days
  ⚠️  Future date!

Ancient Article
  Age: 12797 days
  ⚠️  Over a year old!

Both articles pass validation because the timestamps are technically valid ISO 8601 strings. But a news aggregator showing articles from 1990 or from the future makes no sense. These are valid data types with invalid business semantics.

What Chapter 12 Adds

Each problem above demonstrates the gap between defensive programming and quality enforcement. Chapter 12 closes this gap with systematic validation:

Problem	Defensive Programming	Validation (Chapter 12)
Empty fields	Must remember checks in every normalizer	Schema defines min/max length, enforced automatically
Malformed timestamps	Silently returns raw string when parse fails	Rejects at boundary with error: "invalid ISO 8601"
Invalid URLs	Accepts any string, fails at display time	Validates URL format, rejects before storage
Quality degradation	Defaults accumulate, no visibility	Logs rejections, tracks quality metrics per source
Business rules	Scattered checks, easy to forget	Centralized validators, consistent application

Complementary Approaches

Defensive programming asks "will this crash?" Validation asks "is this acceptable?" The first prevents exceptions, the second enforces standards. Your aggregator needs both: defensive programming to handle structural variation, validation to maintain quality standards.

Chapter 12 builds on this foundation by adding systematic validation to your aggregator: structural schemas, content validators, business rule enforcers, and quality metrics. You'll enhance the news aggregator with validation layers that maintain data quality while keeping the defensive patterns that handle legitimate structural variation.

Takeaways & Looking Forward

When Defensive Programming Isn't Enough:

Prevents crashes, not quality issues: Defensive code stops exceptions but accepts degraded data
Silent failures accumulate: Invalid timestamps, malformed URLs, and empty fields pass through with defaults
No visibility into patterns: Individual bad items are handled, but trends go unnoticed
Business rules scattered: Cross-field validation logic lives in multiple places
Validation provides standards: Explicit rules, automatic enforcement, clear rejection reasons

You've built a working multi-API news aggregator that demonstrates every advanced JSON processing pattern from Chapter 10. The aggregator searches real sources, handles failures gracefully, and presents unified results. You understand both its strengths (structural resilience) and its limitations (quality enforcement). Chapter 12 will complete the picture by adding the validation layer that transforms "works in production" into "maintains production quality standards."

8. Chapter Summary

You've built something impressive in this chapter: a complete multi-API news aggregator that searches three different sources simultaneously, normalizes their radically different response structures into a unified format, handles failures gracefully, removes duplicates, and presents clean results. This isn't a toy example. The code fetches live data, processes real API variations, and demonstrates production patterns you'll use throughout your career.

More importantly, you've internalized the systematic approach professionals use when integrating external services. You explored before coding, designed canonical models thoughtfully, built normalizers defensively, created resilient pipelines, and critically evaluated limitations. This workflow applies whether you're integrating two APIs or twenty, and whether you're building news aggregators, payment processors, or analytics dashboards.

Key Skills Mastered

1.

Systematic API Exploration

You learned to explore API structures before writing extraction code. Using diagnostic tools to reveal response patterns, you created comparison tables documenting where the same data lives in each API. This exploration-first approach prevents wasted effort from incorrect assumptions and produces documentation that guides implementation.

2.

Canonical Model Design

You designed the Article dataclass that unifies three incompatible API formats. You learned to balance universality (what all sources provide) with practicality (what's actually useful), using Optional types for fields not all sources include and __post_init__ validation for required fields.

3.

Source-Specific Normalization

You built normalizers that transform each API's quirks into your canonical format. Each normalizer follows the same five-step pattern (extract items, loop through, validate required fields, map to canonical names, construct objects) while customizing field access for its API's structure. The pattern is consistent; only the mappings change.

4.

Aggregation Pipeline Architecture

You implemented graceful degradation where each source operates independently, failures are contained, and the system continues with partial results. You learned to track success and failure statistics for diagnostics, deduplicate across sources, and sort unified results. This architecture separates hobby projects from production systems.

5.

Graceful Degradation Patterns

You learned that production systems must handle partial failures without crashing. When NewsAPI is down but Guardian and HackerNews work, users get results from two sources instead of an error message. Independent try/except blocks, error categorization, and success tracking create systems that degrade gracefully under pressure.

6.

Recognizing Defensive Programming Limits

You discovered where defensive programming stops being enough. Empty fields, malformed timestamps, invalid URLs, and quality degradation from defaults all pass through defensive checks. This gap between crash prevention and quality enforcement motivates Chapter 12's validation approach.

Chapter Review Quiz

Test your understanding of the multi-API integration patterns from this chapter.

Select question to reveal the answer:

Why should you explore API structures before writing extraction code?

API documentation is often incomplete, outdated, or misleading. Exploration with diagnostic tools reveals the actual response structure (nesting patterns, field names, data types) that your code must handle. This prevents wasted effort from incorrect assumptions and produces comparison tables that guide normalizer design. You discover variations (like Guardian's optional fields object) that documentation might not mention.

What is a canonical model and why is it essential for multi-API integration?

A canonical model is a single internal representation that all external formats are normalized into. For the news aggregator, the Article dataclass defines how articles are represented internally, regardless of whether they came from NewsAPI, Guardian, or HackerNews. This is essential because it allows all downstream code (display, storage, deduplication, sorting) to work with one predictable structure instead of branching on source. Without canonicalization, every feature would need conditional logic for each API.

What are the five things every normalizer does, regardless of which API it handles?

Every normalizer: (1) extracts the items array from the response using extract_items_and_meta(), (2) loops through each item with error containment via try/except, (3) validates required fields and skips items missing them, (4) maps API-specific field names to canonical names, and (5) constructs Article objects. The structure is identical across normalizers; only the field access patterns change.

What is graceful degradation and why does it matter for production systems?

Graceful degradation means the system continues operating with reduced functionality when components fail. When one API is down, the aggregator returns results from the working sources instead of crashing. This matters because production systems face unpredictable failures (network issues, rate limits, API outages). Users prefer partial results over error messages. The pattern involves independent failure handling per source, error categorization, and success/failure tracking.

What's the difference between crash prevention (defensive programming) and quality enforcement (validation)?

Defensive programming prevents crashes by handling missing fields, type errors, and structural variations. It asks "will this crash?" and uses defaults, try/except blocks, and safe navigation. Validation enforces quality standards by rejecting data that doesn't meet requirements. It asks "is this acceptable?" and applies rules like "titles must be non-empty" or "timestamps must be valid ISO 8601." Defensive code accepts an article with an empty title (no crash); validation rejects it (violates quality standard).

Why do normalizers use continue instead of raising exceptions when an item is invalid?

Using continue allows batch processing to proceed despite individual item failures. If one article in a 50-article response has a missing title, you want the other 49 valid articles, not an exception that aborts the entire batch. This is error containment: failures are isolated to the individual item. The normalizer logs the issue (for debugging) but continues processing. Exceptions would be appropriate only if the entire response is fundamentally broken.

How does the comparison table from Section 2 guide normalizer implementation?

The comparison table documents where equivalent data lives in each API: title is title in NewsAPI but webTitle in Guardian. It reveals structural patterns (Guardian's double nesting, HackerNews's Unix timestamps, NewsAPI's nested source object) that normalizers must handle. When implementing a normalizer, you consult the table to know exactly which fields to access and what transformations to apply. It's your reference specification.

What quality issues can defensive programming NOT prevent, and why?

Defensive programming cannot prevent: (1) empty required fields (an empty string is present, just useless), (2) malformed timestamps (looks like a string, but fails to parse), (3) invalid URLs (non-empty but not actually a URL), (4) quality degradation from defaults ("Unknown Source" everywhere), (5) business rule violations (future-dated articles, year-old content). These issues occur because defensive code only checks type and presence, not semantic validity. The data is technically correct but practically useless.

Looking Forward

Section 7 exposed the gap between crash prevention and quality enforcement. Your aggregator handles structural variation beautifully but still accepts malformed timestamps, invalid URLs, empty fields, and business rule violations. Chapter 12 closes this gap with systematic validation.

You'll enhance the news aggregator with validation layers that reject bad data at the boundary, log specific failures, and maintain quality standards. Chapter 12 introduces structural validation with JSON Schema for declarative format enforcement, content validation for field-level quality rules, business rule validation for cross-field logic and domain constraints, and quality metrics for tracking validation failures to identify problematic sources.

The defensive patterns you learned in this chapter remain essential. They handle legitimate structural variation (optional fields, different nesting depths, format conversions). Validation adds a complementary layer that enforces quality standards. Together, they create systems that are both resilient (don't crash on unexpected structures) and reliable (maintain data quality).

Appendix A: Complete Implementation

This appendix contains the complete, production-ready implementation of the News Aggregator. All code has been tested with live APIs and includes comprehensive error handling, deduplication, sorting, and an interactive CLI. Use this as a reference implementation or starting point for your own multi-API projects.

The implementation is organized into logical sections:

Utility Functions: Chapter 10 exploration and access tools
Canonical Model: Article dataclass with validation
Normalizers: Complete implementations for all three APIs
Aggregator: Pipeline orchestration with deduplication and sorting
CLI: Interactive command-line interface

Complete News Aggregator (400+ lines)

Python - news_aggregator.py

"""
News Aggregator - Multi-API Integration Case Study
Demonstrates advanced JSON processing patterns from Chapter 10.

Requirements:
- Python 3.10+
- requests library: pip install requests
- Environment variables: NEWSAPI_KEY, GUARDIAN_KEY

Usage:
    python news_aggregator.py
"""

import os
import requests
import json
from dataclasses import dataclass
from typing import Any, Dict, List, Tuple, Optional
from datetime import datetime, timezone
from itertools import groupby

# ============================================================================
# SECTION 1: UTILITY FUNCTIONS (Chapter 10)
# ============================================================================

def explore_api_structure(url, headers=None, max_depth=2):
    """Systematically explore an API response structure."""
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        data = response.json()
        
        print(f"API Response Analysis")
        print("=" * 60)
        print(f"URL: {url}")
        print(f"Response type: {type(data).__name__}")
        
        if isinstance(data, dict):
            print(f"Top-level keys: {list(data.keys())}")
            print(f"Total keys: {len(data)}")
            
            common_containers = ['results', 'data', 'items', 'articles', 
                               'hits', 'content', 'payload', 'response']
            found_containers = [key for key in common_containers if key in data]
            if found_containers:
                print(f"Possible data containers found: {found_containers}")
            
            print("\nFirst few fields (with truncated values):")
            for i, (key, value) in enumerate(list(data.items())[:5]):
                value_type = type(value).__name__
                if isinstance(value, (dict, list)):
                    size_info = f" (length: {len(value)})" if hasattr(value, '__len__') else ""
                    print(f"  {key}: {value_type}{size_info}")
                else:
                    str_value = str(value)
                    display_value = str_value[:50] + "..." if len(str_value) > 50 else str_value
                    print(f"  {key}: {display_value}")
        
        elif isinstance(data, list):
            print(f"Array with {len(data)} items")
            if data:
                first_item = data[0]
                print(f"First item type: {type(first_item).__name__}")
                if isinstance(first_item, dict):
                    print(f"First item keys: {list(first_item.keys())}")
        
        print("\nSample structure (truncated for readability):")
        print(json.dumps(truncate_for_display(data, max_depth), indent=2))
        print("\n")
        
        return data
        
    except requests.RequestException as e:
        print(f"Error fetching data: {e}")
        return None
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return None


def truncate_for_display(obj, max_depth=2, current_depth=0):
    """Helper function to truncate nested data for readable display."""
    if current_depth >= max_depth:
        return "..."
    
    if isinstance(obj, dict):
        truncated = {}
        for i, (key, value) in enumerate(obj.items()):
            if i >= 5:
                truncated["..."] = f"({len(obj) - 5} more keys)"
                break
            truncated[key] = truncate_for_display(value, max_depth, current_depth + 1)
        return truncated
    
    elif isinstance(obj, list):
        truncated = []
        for i, item in enumerate(obj[:3]):
            truncated.append(truncate_for_display(item, max_depth, current_depth + 1))
        if len(obj) > 3:
            truncated.append(f"... ({len(obj) - 3} more items)")
        return truncated
    
    else:
        if isinstance(obj, str) and len(obj) > 50:
            return obj[:50] + "..."
        return obj


COMMON_COLLECTION_KEYS = ["items", "results", "data", "articles", 
                          "hits", "content", "entries", "records"]


def extract_items_and_meta(
    api_response: Any,
    container_hints: Optional[List[str]] = None
) -> Tuple[List[Any], Dict[str, Any]]:
    """Return (items, metadata) and normalize pagination signals."""
    meta: Dict[str, Any] = {}

    if isinstance(api_response, list):
        return api_response, meta

    if not isinstance(api_response, dict):
        return [], meta

    keys = (container_hints or []) + COMMON_COLLECTION_KEYS
    container_key = None
    for key in keys:
        if key in api_response and isinstance(api_response[key], list):
            container_key = key
            break

    if container_key:
        items = api_response[container_key]
        meta = {k: v for k, v in api_response.items() if k != container_key}
    else:
        items = [api_response]
        meta = {}

    meta_obj = meta.get("meta") if isinstance(meta.get("meta"), dict) else {}
    
    next_token = (
        meta_obj.get("cursor") or
        meta.get("cursor") or
        meta.get("next") or
        None
    )
    
    total = (
        meta.get("total") or 
        meta.get("totalResults") or 
        meta.get("total_count") or 
        meta_obj.get("total") or 
        None
    )
    
    meta_norm = {"next_token": next_token, "total": total}
    return items, {**meta, **meta_norm}


def safe_get(obj: Any, path: str, default=None):
    """Dot-path lookup with safe defaults."""
    cur = obj
    for part in path.split("."):
        if not isinstance(cur, dict) or part not in cur:
            return default
        cur = cur[part]
    return cur


# ============================================================================
# SECTION 2: CANONICAL MODEL
# ============================================================================

@dataclass
class Article:
    """Canonical article format for all news sources."""
    title: str
    url: str
    published_at: str
    source_name: str
    description: Optional[str] = None
    author: Optional[str] = None
    image_url: Optional[str] = None
    content_preview: Optional[str] = None
    
    def __post_init__(self):
        """Validate required fields."""
        if not self.title or not self.title.strip():
            raise ValueError("Article title cannot be empty")
        if not self.url or not self.url.strip():
            raise ValueError("Article URL cannot be empty")
        if not self.source_name or not self.source_name.strip():
            raise ValueError("Article source_name cannot be empty")
    
    def format_timestamp(self) -> str:
        """Convert ISO timestamp to human-readable format."""
        try:
            dt = datetime.fromisoformat(self.published_at.replace('Z', '+00:00'))
            return dt.strftime("%B %d, %Y at %I:%M %p")
        except (ValueError, AttributeError):
            return self.published_at
    
    def get_domain(self) -> str:
        """Extract domain from URL."""
        try:
            from urllib.parse import urlparse
            parsed = urlparse(self.url)
            return parsed.netloc or "unknown"
        except Exception:
            return "unknown"


# ============================================================================
# SECTION 3: NORMALIZERS
# ============================================================================

def normalize_newsapi(response: dict) -> List[Article]:
    """Transform NewsAPI response into canonical format."""
    items, meta = extract_items_and_meta(response)
    articles = []
    
    for item in items:
        try:
            title = item.get("title", "").strip()
            url = item.get("url", "").strip()
            published_at = item.get("publishedAt", "")
            
            if not title or not url or not published_at:
                continue
            
            source_name = safe_get(item, "source.name", "Unknown Source")
            
            description = item.get("description")
            if description:
                description = description.strip()
            
            author = item.get("author")
            if author:
                author = author.strip()
            
            image_url = item.get("urlToImage")
            
            content = item.get("content")
            content_preview = None
            if content:
                content_preview = content[:200] + "..." if len(content) > 200 else content
            
            article = Article(
                title=title,
                url=url,
                published_at=published_at,
                source_name=source_name,
                description=description,
                author=author,
                image_url=image_url,
                content_preview=content_preview
            )
            
            articles.append(article)
            
        except (ValueError, KeyError, TypeError) as e:
            print(f"Warning: Skipping malformed NewsAPI item: {e}")
            continue
    
    return articles


def normalize_guardian(response: dict) -> List[Article]:
    """Transform Guardian API response into canonical format."""
    if not isinstance(response, dict) or "response" not in response:
        return []
    
    response_obj = response["response"]
    items, meta = extract_items_and_meta(response_obj, container_hints=["results"])
    articles = []
    
    for item in items:
        try:
            title = item.get("webTitle", "").strip()
            url = item.get("webUrl", "").strip()
            published_at = item.get("webPublicationDate", "")
            
            if not title or not url or not published_at:
                continue
            
            source_name = item.get("sectionName", "The Guardian")
            
            fields = item.get("fields", {})
            if not isinstance(fields, dict):
                fields = {}
            
            description = fields.get("trailText")
            if description:
                description = description.strip()
            
            author = fields.get("byline")
            if author:
                author = author.strip()
            
            image_url = fields.get("thumbnail")
            
            body = fields.get("bodyText")
            content_preview = None
            if body:
                content_preview = body[:200] + "..." if len(body) > 200 else body
            
            article = Article(
                title=title,
                url=url,
                published_at=published_at,
                source_name=source_name,
                description=description,
                author=author,
                image_url=image_url,
                content_preview=content_preview
            )
            
            articles.append(article)
            
        except (ValueError, KeyError, TypeError) as e:
            print(f"Warning: Skipping malformed Guardian item: {e}")
            continue
    
    return articles


def normalize_hackernews(response: dict) -> List[Article]:
    """Transform HackerNews API response into canonical format."""
    items, meta = extract_items_and_meta(response, container_hints=["hits"])
    articles = []
    
    for item in items:
        try:
            title = item.get("title", "").strip()
            url = item.get("url", "").strip()
            
            if not title:
                continue
            
            if not url:
                object_id = item.get("objectID", "")
                if object_id:
                    url = f"https://news.ycombinator.com/item?id={object_id}"
                else:
                    continue
            
            created_at_unix = item.get("created_at_i")
            if created_at_unix:
                try:
                    dt = datetime.fromtimestamp(created_at_unix, tz=timezone.utc)
                    published_at = dt.isoformat().replace("+00:00", "Z")
                except (ValueError, OSError):
                    published_at = item.get("created_at", "")
            else:
                published_at = item.get("created_at", "")
            
            if not published_at:
                continue
            
            source_name = "HackerNews"
            
            description = item.get("story_text")
            if description:
                description = description.strip()
            
            author = item.get("author")
            if author:
                author = author.strip()
            
            image_url = None
            
            content_preview = None
            if description:
                content_preview = description[:200] + "..." if len(description) > 200 else description
            
            article = Article(
                title=title,
                url=url,
                published_at=published_at,
                source_name=source_name,
                description=description,
                author=author,
                image_url=image_url,
                content_preview=content_preview
            )
            
            articles.append(article)
            
        except (ValueError, KeyError, TypeError) as e:
            print(f"Warning: Skipping malformed HackerNews item: {e}")
            continue
    
    return articles


# ============================================================================
# SECTION 4: AGGREGATOR WITH DEDUPLICATION
# ============================================================================

def deduplicate_articles(articles: List[Article]) -> List[Article]:
    """
    Remove duplicate articles using URL as unique identifier.
    
    When multiple sources report the same story, they often link to
    the same original URL. Keep the first instance encountered.
    
    For production systems, consider enhancements:
    - URL normalization (strip query parameters and fragments)
    - Fuzzy title matching for different URLs covering same story
    - Domain canonicalization (www vs non-www, http vs https)
    """
    if not articles:
        return []
    
    seen_urls = set()
    unique_articles = []
    
    for article in articles:
        if article.url not in seen_urls:
            seen_urls.add(article.url)
            unique_articles.append(article)
    
    return unique_articles


def sort_articles(
    articles: List[Article],
    sort_by: str = "recency"
) -> List[Article]:
    """Sort articles by various criteria."""
    if not articles:
        return []
    
    if sort_by == "recency":
        return sorted(articles, key=lambda a: a.published_at, reverse=True)
    elif sort_by == "source":
        return sorted(articles, key=lambda a: a.source_name)
    elif sort_by == "title":
        return sorted(articles, key=lambda a: a.title.lower())
    else:
        return articles


class NewsAggregator:
    """Aggregates news from multiple sources with graceful failure handling."""
    
    def __init__(self):
        self.newsapi_key = os.environ.get("NEWSAPI_KEY")
        self.guardian_key = os.environ.get("GUARDIAN_KEY")
    
    def fetch_newsapi(self, query: str, max_results: int = 10) -> dict:
        if not self.newsapi_key:
            raise ValueError("NewsAPI key not configured")
        
        url = "https://newsapi.org/v2/everything"
        params = {
            "q": query,
            "pageSize": max_results,
            "apiKey": self.newsapi_key,
            "language": "en",
            "sortBy": "publishedAt"
        }
        response = requests.get(url, params=params, timeout=10)
        response.raise_for_status()
        return response.json()
    
    def fetch_guardian(self, query: str, max_results: int = 10) -> dict:
        if not self.guardian_key:
            raise ValueError("Guardian API key not configured")
        
        url = "https://content.guardianapis.com/search"
        params = {
            "q": query,
            "page-size": max_results,
            "show-fields": "all",
            "api-key": self.guardian_key,
            "order-by": "newest"
        }
        response = requests.get(url, params=params, timeout=10)
        response.raise_for_status()
        return response.json()
    
    def fetch_hackernews(self, query: str, max_results: int = 10) -> dict:
        url = "https://hn.algolia.com/api/v1/search"
        params = {
            "query": query,
            "hitsPerPage": max_results,
            "tags": "story"
        }
        response = requests.get(url, params=params, timeout=10)
        response.raise_for_status()
        return response.json()
    
    def search(
        self,
        query: str,
        max_per_source: int = 10,
        deduplicate: bool = True,
        sort_by: str = "recency"
    ) -> Tuple[List[Article], Dict]:
        """Search all sources and return combined results."""
        all_articles = []
        stats = {
            "query": query,
            "sources_attempted": [],
            "sources_succeeded": [],
            "sources_failed": [],
            "total_articles_raw": 0,
            "total_articles_deduplicated": 0,
            "duplicates_removed": 0
        }
        
        sources = [
            ("NewsAPI", self.fetch_newsapi, normalize_newsapi),
            ("Guardian", self.fetch_guardian, normalize_guardian),
            ("HackerNews", self.fetch_hackernews, normalize_hackernews)
        ]
        
        for source_name, fetch_fn, normalize_fn in sources:
            stats["sources_attempted"].append(source_name)
            
            try:
                response = fetch_fn(query, max_per_source)
                articles = normalize_fn(response)
                all_articles.extend(articles)
                stats["sources_succeeded"].append(source_name)
                print(f"✓ {source_name}: {len(articles)} articles")
                
            except ValueError as e:
                stats["sources_failed"].append(f"{source_name} (not configured)")
                print(f"⊘ {source_name}: {e}")
                
            except requests.RequestException as e:
                stats["sources_failed"].append(f"{source_name} (network error)")
                print(f"✗ {source_name}: {e}")
                
            except Exception as e:
                stats["sources_failed"].append(f"{source_name} (error: {type(e).__name__})")
                print(f"✗ {source_name}: Unexpected error - {e}")
        
        stats["total_articles_raw"] = len(all_articles)
        
        if deduplicate:
            all_articles = deduplicate_articles(all_articles)
            stats["total_articles_deduplicated"] = len(all_articles)
            stats["duplicates_removed"] = stats["total_articles_raw"] - stats["total_articles_deduplicated"]
        
        all_articles = sort_articles(all_articles, sort_by)
        
        return all_articles, stats


# ============================================================================
# SECTION 5: DISPLAY FUNCTIONS
# ============================================================================

def display_article(article: Article, index: int = None, show_description: bool = True):
    """Display a single article with consistent formatting."""
    if index is not None:
        print(f"\n{index}. {article.title}")
    else:
        print(f"\n{article.title}")
    
    print(f"   📰 {article.source_name} • {article.format_timestamp()}")
    
    if article.author:
        print(f"   ✍️  {article.author}")
    
    if show_description:
        if article.description:
            desc = article.description
            if len(desc) > 150:
                desc = desc[:150] + "..."
            print(f"   {desc}")
        elif article.content_preview:
            preview = article.content_preview
            if len(preview) > 150:
                preview = preview[:150] + "..."
            print(f"   {preview}")
    
    domain = article.get_domain()
    print(f"   🔗 {article.url}")
    print(f"   ({domain})")


def display_article_list(
    articles: List[Article],
    stats: dict = None,
    max_display: int = 10,
    group_by_source: bool = False,
    show_descriptions: bool = True
):
    """Display a list of articles with formatting and statistics."""
    if not articles:
        print("\nNo articles found.")
        return
    
    print("\n" + "=" * 70)
    print("📰 NEWS AGGREGATOR RESULTS")
    print("=" * 70)
    
    if stats:
        print(f"\nQuery: '{stats.get('query', 'N/A')}'")
        print(f"Sources: {', '.join(stats.get('sources_succeeded', []))}")
        
        if stats.get('sources_failed'):
            print(f"Failed: {', '.join(stats.get('sources_failed', []))}")
        
        print(f"Articles: {len(articles)}", end="")
        
        if stats.get('duplicates_removed', 0) > 0:
            print(f" ({stats['duplicates_removed']} duplicates removed)", end="")
        
        print()
    
    if group_by_source:
        articles_sorted = sorted(articles, key=lambda a: a.source_name)
        
        for source_name, group in groupby(articles_sorted, key=lambda a: a.source_name):
            group_list = list(group)
            print(f"\n--- {source_name} ({len(group_list)} articles) ---")
            
            for i, article in enumerate(group_list[:max_display], 1):
                display_article(article, index=i, show_description=show_descriptions)
    else:
        display_count = min(len(articles), max_display)
        
        for i, article in enumerate(articles[:display_count], 1):
            display_article(article, index=i, show_description=show_descriptions)
        
        if len(articles) > max_display:
            remaining = len(articles) - max_display
            print(f"\n... and {remaining} more articles")
    
    print("\n" + "=" * 70)


# ============================================================================
# SECTION 6: INTERACTIVE CLI
# ============================================================================

class NewsAggregatorCLI:
    """Interactive command-line interface for news aggregator."""
    
    def __init__(self):
        self.aggregator = NewsAggregator()
        self.last_results = []
        self.last_stats = {}
    
    def display_banner(self):
        """Display application banner."""
        print("\n" + "=" * 70)
        print("📰 NEWS AGGREGATOR")
        print("=" * 70)
        print("\nSearch news from NewsAPI, The Guardian, and HackerNews")
        print("\nCommands:")
        print("  search   - Search for news articles")
        print("  more            - Show more results from last search")
        print("  group           - Show last results grouped by source")
        print("  help            - Show this help message")
        print("  quit            - Exit application")
        print("=" * 70 + "\n")
    
    def search(self, query: str, max_per_source: int = 10):
        """Execute search and display results."""
        print(f"\nSearching for '{query}'...\n")
        
        articles, stats = self.aggregator.search(
            query,
            max_per_source=max_per_source,
            deduplicate=True,
            sort_by="recency"
        )
        
        self.last_results = articles
        self.last_stats = stats
        
        display_article_list(
            articles,
            stats,
            max_display=10,
            show_descriptions=True
        )
    
    def show_more(self):
        """Show more results from last search."""
        if not self.last_results:
            print("\nNo previous search results. Use 'search ' first.")
            return
        
        display_article_list(
            self.last_results,
            self.last_stats,
            max_display=len(self.last_results),
            show_descriptions=False
        )
    
    def show_grouped(self):
        """Show last results grouped by source."""
        if not self.last_results:
            print("\nNo previous search results. Use 'search ' first.")
            return
        
        display_article_list(
            self.last_results,
            self.last_stats,
            max_display=50,
            group_by_source=True,
            show_descriptions=False
        )
    
    def run(self):
        """Run interactive session."""
        self.display_banner()
        
        while True:
            try:
                user_input = input("aggregator> ").strip()
                
                if not user_input:
                    continue
                
                parts = user_input.split(maxsplit=1)
                command = parts[0].lower()
                args = parts[1] if len(parts) > 1 else ""
                
                if command in ['quit', 'exit', 'q']:
                    print("\nGoodbye!")
                    break
                
                elif command == 'help':
                    self.display_banner()
                
                elif command == 'search':
                    if not args:
                        print("Usage: search ")
                    else:
                        self.search(args)
                
                elif command == 'more':
                    self.show_more()
                
                elif command == 'group':
                    self.show_grouped()
                
                else:
                    print(f"Unknown command: {command}")
                    print("Type 'help' for available commands")
            
            except KeyboardInterrupt:
                print("\n\nGoodbye!")
                break
            
            except Exception as e:
                print(f"\nError: {e}")
                print("Type 'help' for available commands")


# ============================================================================
# MAIN ENTRY POINT
# ============================================================================

def main():
    """Entry point for news aggregator CLI."""
    cli = NewsAggregatorCLI()
    cli.run()


if __name__ == "__main__":
    main()

Using This Code

To run the aggregator:

Set environment variables: export NEWSAPI_KEY=your_key and export GUARDIAN_KEY=your_key
Install dependencies: pip install requests
Run: python news_aggregator.py

The code is organized into six sections with comments explaining each component. Refer back to Sections 2-6 for detailed explanations of the design decisions behind each section.