Automate Document Comparison in Python: A Complete Guide

Document comparison is a critical task in many business workflows, from legal contract reviews to version control in content management. Manually comparing documents is time-consuming, error-prone, and inefficient. Fortunately, Python offers powerful libraries and tools that can automate this process, saving countless hours while ensuring accuracy.

In this comprehensive guide, we'll explore how to automate document comparison in Python using practical code examples that you can implement immediately.

Why Automate Document Comparison?

Before diving into the code, let's understand why automation matters. Document comparison automation helps organizations track changes between file versions, identify discrepancies in contracts, maintain compliance documentation, and streamline collaboration workflows. Whether you're working with text files, Word documents, or PDFs, Python provides the tools to handle these tasks efficiently.

Getting Started with Python Document Comparison

The first step in automating document comparison is choosing the right libraries. Python's ecosystem offers several excellent options depending on your document format and comparison requirements.

Comparing Plain Text Files

For basic text file comparison, Python's built-in difflib module provides a robust solution without requiring external dependencies. Here's a practical example:

python

import difflib

def compare_text_files(file1_path, file2_path):
    # Read the contents of both files
    with open(file1_path, 'r', encoding='utf-8') as f1:
        file1_lines = f1.readlines()
    
    with open(file2_path, 'r', encoding='utf-8') as f2:
        file2_lines = f2.readlines()
    
    # Create a Differ object
    differ = difflib.Differ()
    
    # Compare the files
    diff = list(differ.compare(file1_lines, file2_lines))
    
    # Display the differences
    for line in diff:
        if line.startswith('+ '):
            print(f"Added: {line[2:]}")
        elif line.startswith('- '):
            print(f"Removed: {line[2:]}")
        elif line.startswith('? '):
            print(f"Changed: {line[2:]}")
    
    return diff

# Usage
compare_text_files('document_v1.txt', 'document_v2.txt')

This function reads two text files and displays line-by-line differences, clearly marking additions, deletions, and modifications. The difflib module is particularly useful for generating unified diffs similar to version control systems.

Advanced Text Comparison with HTML Output

For more sophisticated comparisons with visual output, you can generate HTML diff reports:

python

import difflib

def generate_html_diff(file1_path, file2_path, output_path='diff_report.html'):
    # Read file contents
    with open(file1_path, 'r', encoding='utf-8') as f1:
        file1_content = f1.readlines()
    
    with open(file2_path, 'r', encoding='utf-8') as f2:
        file2_content = f2.readlines()
    
    # Generate HTML diff
    html_diff = difflib.HtmlDiff()
    html_output = html_diff.make_file(
        file1_content, 
        file2_content,
        fromdesc='Original Document',
        todesc='Modified Document'
    )
    
    # Save to HTML file
    with open(output_path, 'w', encoding='utf-8') as output:
        output.write(html_output)
    
    print(f"HTML diff report generated: {output_path}")

# Usage
generate_html_diff('contract_v1.txt', 'contract_v2.txt')

This creates a color-coded HTML report that makes it easy to visualize changes, perfect for sharing comparison results with team members or clients.

Comparing Word Documents

For Word document comparison, the python-docx library enables you to extract and compare content from DOCX files:

python

from docx import Document

def compare_word_documents(doc1_path, doc2_path):
    # Load both documents
    doc1 = Document(doc1_path)
    doc2 = Document(doc2_path)
    
    # Extract text from paragraphs
    doc1_text = [para.text for para in doc1.paragraphs]
    doc2_text = [para.text for para in doc2.paragraphs]
    
    # Use difflib for comparison
    differ = difflib.Differ()
    diff = list(differ.compare(doc1_text, doc2_text))
    
    # Count changes
    additions = sum(1 for line in diff if line.startswith('+ '))
    deletions = sum(1 for line in diff if line.startswith('- '))
    
    print(f"Total additions: {additions}")
    print(f"Total deletions: {deletions}")
    
    # Display differences
    for line in diff:
        if line.startswith('+ ') or line.startswith('- '):
            print(line)
    
    return diff

# Install with: pip install python-docx
# Usage
compare_word_documents('report_v1.docx', 'report_v2.docx')

Calculating Similarity Ratios

Sometimes you need to know how similar two documents are rather than seeing every change. The SequenceMatcher class provides similarity metrics:

python

import difflib

def calculate_document_similarity(file1_path, file2_path):
    # Read file contents
    with open(file1_path, 'r', encoding='utf-8') as f1:
        content1 = f1.read()
    
    with open(file2_path, 'r', encoding='utf-8') as f2:
        content2 = f2.read()
    
    # Calculate similarity ratio
    sequence_matcher = difflib.SequenceMatcher(None, content1, content2)
    similarity_ratio = sequence_matcher.ratio()
    
    print(f"Similarity: {similarity_ratio * 100:.2f}%")
    print(f"Difference: {(1 - similarity_ratio) * 100:.2f}%")
    
    return similarity_ratio

# Usage
calculate_document_similarity('policy_v1.txt', 'policy_v2.txt')

This function returns a value between 0 and 1, where 1 means the documents are identical. This is useful for quick assessments or filtering documents that require detailed review.

Automating Batch Comparisons

For processing multiple document pairs, create a batch comparison function:

python

import os
import difflib

def batch_compare_documents(directory, file_pairs):
    results = []
    
    for file1, file2 in file_pairs:
        file1_path = os.path.join(directory, file1)
        file2_path = os.path.join(directory, file2)
        
        with open(file1_path, 'r', encoding='utf-8') as f1:
            content1 = f1.read()
        with open(file2_path, 'r', encoding='utf-8') as f2:
            content2 = f2.read()
        
        similarity = difflib.SequenceMatcher(None, content1, content2).ratio()
        results.append({
            'pair': (file1, file2),
            'similarity': similarity
        })
    
    return results

# Usage
pairs = [('doc1_v1.txt', 'doc1_v2.txt'), ('doc2_v1.txt', 'doc2_v2.txt')]
comparison_results = batch_compare_documents('/path/to/docs', pairs)

Conclusion

Automating document comparison in Python is straightforward and highly customizable. Whether you're comparing simple text files, Word documents, or generating visual diff reports, Python's libraries provide the tools you need. Start with the built-in difflib module for basic comparisons, then expand to specialized libraries like python-docx for specific document formats. By implementing these automation solutions, you'll save time, reduce errors, and improve your document management workflow significantly.

RetryB

write a 1000 word SEO blog about how you can automate scenario analysis in python -- give code examples

Edit

Automate Scenario Analysis in Python: A Comprehensive Guide with Code Examples

Scenario analysis is a powerful technique used by financial analysts, data scientists, and business strategists to evaluate different possible outcomes based on varying assumptions. Whether you're forecasting revenue, assessing investment risks, or planning for multiple business scenarios, automating this process in Python can save time, improve accuracy, and enable more sophisticated analysis.

In this comprehensive guide, we'll explore how to automate scenario analysis in Python with practical code examples you can implement immediately in your workflow.

Understanding Scenario Analysis

Scenario analysis involves creating multiple versions of a model based on different input parameters or assumptions. Instead of relying on a single forecast, you examine best-case, worst-case, and most-likely scenarios to understand the range of possible outcomes. This approach is invaluable for risk management, strategic planning, and decision-making under uncertainty.

Setting Up Your Python Environment

Before diving into automation, ensure you have the necessary libraries installed. The core tools for scenario analysis include NumPy for numerical computations, Pandas for data manipulation, and Matplotlib for visualization.

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product

# Install with: pip install numpy pandas matplotlib

Basic Scenario Analysis Framework

Let's start with a simple revenue forecasting model that evaluates different scenarios based on varying price points and sales volumes:

python

def revenue_scenario_analysis(base_price, base_volume, scenarios):
    """
    Perform scenario analysis for revenue forecasting
    
    Parameters:
    base_price: baseline price per unit
    base_volume: baseline sales volume
    scenarios: dict with scenario names and multipliers
    """
    results = {}
    
    for scenario_name, multipliers in scenarios.items():
        price_mult = multipliers.get('price', 1.0)
        volume_mult = multipliers.get('volume', 1.0)
        
        scenario_price = base_price * price_mult
        scenario_volume = base_volume * volume_mult
        scenario_revenue = scenario_price * scenario_volume
        
        results[scenario_name] = {
            'price': scenario_price,
            'volume': scenario_volume,
            'revenue': scenario_revenue
        }
    
    return pd.DataFrame(results).T

# Define scenarios
scenarios = {
    'Best Case': {'price': 1.15, 'volume': 1.20},
    'Base Case': {'price': 1.0, 'volume': 1.0},
    'Worst Case': {'price': 0.90, 'volume': 0.80}
}

# Run analysis
results = revenue_scenario_analysis(base_price=100, base_volume=10000, scenarios=scenarios)
print(results)

This function creates a structured framework for evaluating different scenarios, returning a clean DataFrame with all results for easy comparison and analysis.

Monte Carlo Simulation for Probabilistic Scenarios

For more sophisticated scenario analysis, Monte Carlo simulation generates thousands of potential outcomes based on probability distributions. This approach is particularly useful when dealing with uncertainty:

python

def monte_carlo_scenario_analysis(iterations=10000):
    """
    Perform Monte Carlo simulation for investment returns
    """
    # Define parameters with uncertainty
    np.random.seed(42)
    
    # Simulate random variables
    market_growth = np.random.normal(0.07, 0.15, iterations)
    inflation_rate = np.random.normal(0.03, 0.02, iterations)
    initial_investment = 100000
    years = 5
    
    # Calculate outcomes
    final_values = initial_investment * (1 + market_growth - inflation_rate) ** years
    
    # Analyze results
    results = {
        'Mean': np.mean(final_values),
        'Median': np.median(final_values),
        'Std Dev': np.std(final_values),
        '5th Percentile': np.percentile(final_values, 5),
        '95th Percentile': np.percentile(final_values, 95),
        'Min': np.min(final_values),
        'Max': np.max(final_values)
    }
    
    return results, final_values

# Run simulation
stats, outcomes = monte_carlo_scenario_analysis(iterations=10000)
print("Investment Scenario Analysis:")
for key, value in stats.items():
    print(f"{key}: ${value:,.2f}")

This Monte Carlo approach provides a probabilistic view of outcomes, helping decision-makers understand the range and likelihood of different results.

Multi-Variable Scenario Analysis

Real-world scenarios often involve multiple interacting variables. Here's how to automate analysis across multiple dimensions:

python

def multi_variable_scenario_analysis(base_params, variable_ranges):
    """
    Analyze scenarios across multiple variables simultaneously
    
    Parameters:
    base_params: dict of baseline parameters
    variable_ranges: dict with variable names and possible values
    """
    # Generate all combinations
    var_names = list(variable_ranges.keys())
    var_values = list(variable_ranges.values())
    combinations = list(product(*var_values))
    
    results = []
    
    for combo in combinations:
        scenario = base_params.copy()
        scenario_name_parts = []
        
        for var_name, value in zip(var_names, combo):
            scenario[var_name] = value
            scenario_name_parts.append(f"{var_name}={value}")
        
        # Calculate business metrics
        revenue = scenario['price'] * scenario['volume']
        costs = scenario['fixed_costs'] + (scenario['variable_cost'] * scenario['volume'])
        profit = revenue - costs
        margin = (profit / revenue) * 100 if revenue > 0 else 0
        
        results.append({
            'Scenario': ' | '.join(scenario_name_parts),
            'Revenue': revenue,
            'Costs': costs,
            'Profit': profit,
            'Margin %': margin
        })
    
    return pd.DataFrame(results)

# Define baseline and variables
base_params = {
    'fixed_costs': 50000,
    'variable_cost': 30
}

variable_ranges = {
    'price': [80, 100, 120],
    'volume': [800, 1000, 1200]
}

# Run analysis
scenario_results = multi_variable_scenario_analysis(base_params, variable_ranges)
print(scenario_results.sort_values('Profit', ascending=False))

This function automatically generates all possible scenario combinations, making it easy to identify optimal parameter configurations.

Sensitivity Analysis Automation

Sensitivity analysis examines how changes in individual variables impact outcomes. Here's an automated approach:

python

def sensitivity_analysis(base_model, variables, change_range=(-0.3, 0.3), steps=10):
    """
    Perform sensitivity analysis on model variables
    
    Parameters:
    base_model: function that returns outcome given parameters
    variables: dict of variable names and base values
    change_range: tuple of (min_change, max_change) as percentages
    steps: number of steps in the analysis
    """
    results = {}
    
    for var_name, base_value in variables.items():
        sensitivity_data = []
        
        # Create range of values
        multipliers = np.linspace(1 + change_range[0], 1 + change_range[1], steps)
        
        for mult in multipliers:
            test_params = variables.copy()
            test_params[var_name] = base_value * mult
            outcome = base_model(**test_params)
            
            sensitivity_data.append({
                'change_%': (mult - 1) * 100,
                'value': test_params[var_name],
                'outcome': outcome
            })
        
        results[var_name] = pd.DataFrame(sensitivity_data)
    
    return results

# Example model function
def profit_model(price, volume, cost_per_unit, fixed_costs):
    revenue = price * volume
    variable_costs = cost_per_unit * volume
    return revenue - variable_costs - fixed_costs

# Define variables
variables = {
    'price': 100,
    'volume': 1000,
    'cost_per_unit': 60,
    'fixed_costs': 15000
}

# Run sensitivity analysis
sensitivity_results = sensitivity_analysis(profit_model, variables, change_range=(-0.2, 0.2), steps=15)

# Display results for price sensitivity
print("Price Sensitivity Analysis:")
print(sensitivity_results['price'])

Visualizing Scenario Analysis Results

Visualization makes scenario analysis insights actionable. Here's how to create compelling charts:

python

def visualize_scenarios(scenario_data, metric='profit'):
    """
    Create visualization for scenario analysis results
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Bar chart comparison
    scenarios = scenario_data['Scenario']
    values = scenario_data[metric.capitalize()]
    
    colors = ['green' if v > 0 else 'red' for v in values]
    ax1.bar(range(len(scenarios)), values, color=colors, alpha=0.7)
    ax1.set_xticks(range(len(scenarios)))
    ax1.set_xticklabels(scenarios, rotation=45, ha='right')
    ax1.set_ylabel(f'{metric.capitalize()} ($)')
    ax1.set_title(f'Scenario Comparison - {metric.capitalize()}')
    ax1.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
    ax1.grid(axis='y', alpha=0.3)
    
    # Waterfall chart for top scenarios
    sorted_data = scenario_data.nlargest(5, metric.capitalize())
    cumsum = sorted_data[metric.capitalize()].cumsum()
    
    ax2.plot(range(len(sorted_data)), cumsum, marker='o', linewidth=2)
    ax2.fill_between(range(len(sorted_data)), cumsum, alpha=0.3)
    ax2.set_xticks(range(len(sorted_data)))
    ax2.set_xticklabels(sorted_data['Scenario'], rotation=45, ha='right')
    ax2.set_ylabel(f'Cumulative {metric.capitalize()} ($)')
    ax2.set_title(f'Top 5 Scenarios - Cumulative Impact')
    ax2.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('scenario_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()

# Visualize results
visualize_scenarios(scenario_results)

Automating Scenario Reports

Finally, create automated reports that summarize your scenario analysis:

python

def generate_scenario_report(results_df, output_file='scenario_report.txt'):
    """
    Generate automated text report from scenario analysis
    """
    with open(output_file, 'w') as f:
        f.write("=" * 60 + "\n")
        f.write("SCENARIO ANALYSIS REPORT\n")
        f.write("=" * 60 + "\n\n")
        
        f.write(f"Total Scenarios Analyzed: {len(results_df)}\n\n")
        
        f.write("BEST CASE SCENARIO:\n")
        best = results_df.nlargest(1, 'Profit').iloc[0]
        f.write(f"  {best['Scenario']}\n")
        f.write(f"  Profit: ${best['Profit']:,.2f}\n")
        f.write(f"  Margin: {best['Margin %']:.2f}%\n\n")
        
        f.write("WORST CASE SCENARIO:\n")
        worst = results_df.nsmallest(1, 'Profit').iloc[0]
        f.write(f"  {worst['Scenario']}\n")
        f.write(f"  Profit: ${worst['Profit']:,.2f}\n")
        f.write(f"  Margin: {worst['Margin %']:.2f}%\n\n")
        
        f.write("SUMMARY STATISTICS:\n")
        f.write(f"  Average Profit: ${results_df['Profit'].mean():,.2f}\n")
        f.write(f"  Median Profit: ${results_df['Profit'].median():,.2f}\n")
        f.write(f"  Profit Range: ${results_df['Profit'].max() - results_df['Profit'].min():,.2f}\n")
    
    print(f"Report generated: {output_file}")

# Generate report
generate_scenario_report(scenario_results)

Conclusion

Automating scenario analysis in Python transforms how organizations approach strategic planning and risk assessment. By leveraging Python's powerful libraries and the code examples provided, you can build sophisticated scenario analysis systems that generate insights quickly and accurately. Whether you're using simple scenario comparisons, Monte Carlo simulations, or multi-variable sensitivity analysis, Python provides the flexibility and power to handle complex analytical challenges. Start with these examples and customize them to fit your specific business needs for better decision-making under uncertainty.

Subscribe to Transition from Excel to Python | Mito

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe