Automate Document Comparison in Python: A Complete Guide
Document comparison is a critical task in many business workflows, from legal contract reviews to version control in content management. Manually comparing documents is time-consuming, error-prone, and inefficient. Fortunately, Python offers powerful libraries and tools that can automate this process, saving countless hours while ensuring accuracy.
In this comprehensive guide, we'll explore how to automate document comparison in Python using practical code examples that you can implement immediately.
Why Automate Document Comparison?
Before diving into the code, let's understand why automation matters. Document comparison automation helps organizations track changes between file versions, identify discrepancies in contracts, maintain compliance documentation, and streamline collaboration workflows. Whether you're working with text files, Word documents, or PDFs, Python provides the tools to handle these tasks efficiently.
Getting Started with Python Document Comparison
The first step in automating document comparison is choosing the right libraries. Python's ecosystem offers several excellent options depending on your document format and comparison requirements.
Comparing Plain Text Files
For basic text file comparison, Python's built-in difflib
module provides a robust solution without requiring external dependencies. Here's a practical example:
python
import difflib
def compare_text_files(file1_path, file2_path):
# Read the contents of both files
with open(file1_path, 'r', encoding='utf-8') as f1:
file1_lines = f1.readlines()
with open(file2_path, 'r', encoding='utf-8') as f2:
file2_lines = f2.readlines()
# Create a Differ object
differ = difflib.Differ()
# Compare the files
diff = list(differ.compare(file1_lines, file2_lines))
# Display the differences
for line in diff:
if line.startswith('+ '):
print(f"Added: {line[2:]}")
elif line.startswith('- '):
print(f"Removed: {line[2:]}")
elif line.startswith('? '):
print(f"Changed: {line[2:]}")
return diff
# Usage
compare_text_files('document_v1.txt', 'document_v2.txt')
This function reads two text files and displays line-by-line differences, clearly marking additions, deletions, and modifications. The difflib
module is particularly useful for generating unified diffs similar to version control systems.
Advanced Text Comparison with HTML Output
For more sophisticated comparisons with visual output, you can generate HTML diff reports:
python
import difflib
def generate_html_diff(file1_path, file2_path, output_path='diff_report.html'):
# Read file contents
with open(file1_path, 'r', encoding='utf-8') as f1:
file1_content = f1.readlines()
with open(file2_path, 'r', encoding='utf-8') as f2:
file2_content = f2.readlines()
# Generate HTML diff
html_diff = difflib.HtmlDiff()
html_output = html_diff.make_file(
file1_content,
file2_content,
fromdesc='Original Document',
todesc='Modified Document'
)
# Save to HTML file
with open(output_path, 'w', encoding='utf-8') as output:
output.write(html_output)
print(f"HTML diff report generated: {output_path}")
# Usage
generate_html_diff('contract_v1.txt', 'contract_v2.txt')
This creates a color-coded HTML report that makes it easy to visualize changes, perfect for sharing comparison results with team members or clients.
Comparing Word Documents
For Word document comparison, the python-docx
library enables you to extract and compare content from DOCX files:
python
from docx import Document
def compare_word_documents(doc1_path, doc2_path):
# Load both documents
doc1 = Document(doc1_path)
doc2 = Document(doc2_path)
# Extract text from paragraphs
doc1_text = [para.text for para in doc1.paragraphs]
doc2_text = [para.text for para in doc2.paragraphs]
# Use difflib for comparison
differ = difflib.Differ()
diff = list(differ.compare(doc1_text, doc2_text))
# Count changes
additions = sum(1 for line in diff if line.startswith('+ '))
deletions = sum(1 for line in diff if line.startswith('- '))
print(f"Total additions: {additions}")
print(f"Total deletions: {deletions}")
# Display differences
for line in diff:
if line.startswith('+ ') or line.startswith('- '):
print(line)
return diff
# Install with: pip install python-docx
# Usage
compare_word_documents('report_v1.docx', 'report_v2.docx')
Calculating Similarity Ratios
Sometimes you need to know how similar two documents are rather than seeing every change. The SequenceMatcher
class provides similarity metrics:
python
import difflib
def calculate_document_similarity(file1_path, file2_path):
# Read file contents
with open(file1_path, 'r', encoding='utf-8') as f1:
content1 = f1.read()
with open(file2_path, 'r', encoding='utf-8') as f2:
content2 = f2.read()
# Calculate similarity ratio
sequence_matcher = difflib.SequenceMatcher(None, content1, content2)
similarity_ratio = sequence_matcher.ratio()
print(f"Similarity: {similarity_ratio * 100:.2f}%")
print(f"Difference: {(1 - similarity_ratio) * 100:.2f}%")
return similarity_ratio
# Usage
calculate_document_similarity('policy_v1.txt', 'policy_v2.txt')
This function returns a value between 0 and 1, where 1 means the documents are identical. This is useful for quick assessments or filtering documents that require detailed review.
Automating Batch Comparisons
For processing multiple document pairs, create a batch comparison function:
python
import os
import difflib
def batch_compare_documents(directory, file_pairs):
results = []
for file1, file2 in file_pairs:
file1_path = os.path.join(directory, file1)
file2_path = os.path.join(directory, file2)
with open(file1_path, 'r', encoding='utf-8') as f1:
content1 = f1.read()
with open(file2_path, 'r', encoding='utf-8') as f2:
content2 = f2.read()
similarity = difflib.SequenceMatcher(None, content1, content2).ratio()
results.append({
'pair': (file1, file2),
'similarity': similarity
})
return results
# Usage
pairs = [('doc1_v1.txt', 'doc1_v2.txt'), ('doc2_v1.txt', 'doc2_v2.txt')]
comparison_results = batch_compare_documents('/path/to/docs', pairs)
Conclusion
Automating document comparison in Python is straightforward and highly customizable. Whether you're comparing simple text files, Word documents, or generating visual diff reports, Python's libraries provide the tools you need. Start with the built-in difflib
module for basic comparisons, then expand to specialized libraries like python-docx
for specific document formats. By implementing these automation solutions, you'll save time, reduce errors, and improve your document management workflow significantly.
RetryB
write a 1000 word SEO blog about how you can automate scenario analysis in python -- give code examples
Edit
Automate Scenario Analysis in Python: A Comprehensive Guide with Code Examples
Scenario analysis is a powerful technique used by financial analysts, data scientists, and business strategists to evaluate different possible outcomes based on varying assumptions. Whether you're forecasting revenue, assessing investment risks, or planning for multiple business scenarios, automating this process in Python can save time, improve accuracy, and enable more sophisticated analysis.
In this comprehensive guide, we'll explore how to automate scenario analysis in Python with practical code examples you can implement immediately in your workflow.
Understanding Scenario Analysis
Scenario analysis involves creating multiple versions of a model based on different input parameters or assumptions. Instead of relying on a single forecast, you examine best-case, worst-case, and most-likely scenarios to understand the range of possible outcomes. This approach is invaluable for risk management, strategic planning, and decision-making under uncertainty.
Setting Up Your Python Environment
Before diving into automation, ensure you have the necessary libraries installed. The core tools for scenario analysis include NumPy for numerical computations, Pandas for data manipulation, and Matplotlib for visualization.
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product
# Install with: pip install numpy pandas matplotlib
Basic Scenario Analysis Framework
Let's start with a simple revenue forecasting model that evaluates different scenarios based on varying price points and sales volumes:
python
def revenue_scenario_analysis(base_price, base_volume, scenarios):
"""
Perform scenario analysis for revenue forecasting
Parameters:
base_price: baseline price per unit
base_volume: baseline sales volume
scenarios: dict with scenario names and multipliers
"""
results = {}
for scenario_name, multipliers in scenarios.items():
price_mult = multipliers.get('price', 1.0)
volume_mult = multipliers.get('volume', 1.0)
scenario_price = base_price * price_mult
scenario_volume = base_volume * volume_mult
scenario_revenue = scenario_price * scenario_volume
results[scenario_name] = {
'price': scenario_price,
'volume': scenario_volume,
'revenue': scenario_revenue
}
return pd.DataFrame(results).T
# Define scenarios
scenarios = {
'Best Case': {'price': 1.15, 'volume': 1.20},
'Base Case': {'price': 1.0, 'volume': 1.0},
'Worst Case': {'price': 0.90, 'volume': 0.80}
}
# Run analysis
results = revenue_scenario_analysis(base_price=100, base_volume=10000, scenarios=scenarios)
print(results)
This function creates a structured framework for evaluating different scenarios, returning a clean DataFrame with all results for easy comparison and analysis.
Monte Carlo Simulation for Probabilistic Scenarios
For more sophisticated scenario analysis, Monte Carlo simulation generates thousands of potential outcomes based on probability distributions. This approach is particularly useful when dealing with uncertainty:
python
def monte_carlo_scenario_analysis(iterations=10000):
"""
Perform Monte Carlo simulation for investment returns
"""
# Define parameters with uncertainty
np.random.seed(42)
# Simulate random variables
market_growth = np.random.normal(0.07, 0.15, iterations)
inflation_rate = np.random.normal(0.03, 0.02, iterations)
initial_investment = 100000
years = 5
# Calculate outcomes
final_values = initial_investment * (1 + market_growth - inflation_rate) ** years
# Analyze results
results = {
'Mean': np.mean(final_values),
'Median': np.median(final_values),
'Std Dev': np.std(final_values),
'5th Percentile': np.percentile(final_values, 5),
'95th Percentile': np.percentile(final_values, 95),
'Min': np.min(final_values),
'Max': np.max(final_values)
}
return results, final_values
# Run simulation
stats, outcomes = monte_carlo_scenario_analysis(iterations=10000)
print("Investment Scenario Analysis:")
for key, value in stats.items():
print(f"{key}: ${value:,.2f}")
This Monte Carlo approach provides a probabilistic view of outcomes, helping decision-makers understand the range and likelihood of different results.
Multi-Variable Scenario Analysis
Real-world scenarios often involve multiple interacting variables. Here's how to automate analysis across multiple dimensions:
python
def multi_variable_scenario_analysis(base_params, variable_ranges):
"""
Analyze scenarios across multiple variables simultaneously
Parameters:
base_params: dict of baseline parameters
variable_ranges: dict with variable names and possible values
"""
# Generate all combinations
var_names = list(variable_ranges.keys())
var_values = list(variable_ranges.values())
combinations = list(product(*var_values))
results = []
for combo in combinations:
scenario = base_params.copy()
scenario_name_parts = []
for var_name, value in zip(var_names, combo):
scenario[var_name] = value
scenario_name_parts.append(f"{var_name}={value}")
# Calculate business metrics
revenue = scenario['price'] * scenario['volume']
costs = scenario['fixed_costs'] + (scenario['variable_cost'] * scenario['volume'])
profit = revenue - costs
margin = (profit / revenue) * 100 if revenue > 0 else 0
results.append({
'Scenario': ' | '.join(scenario_name_parts),
'Revenue': revenue,
'Costs': costs,
'Profit': profit,
'Margin %': margin
})
return pd.DataFrame(results)
# Define baseline and variables
base_params = {
'fixed_costs': 50000,
'variable_cost': 30
}
variable_ranges = {
'price': [80, 100, 120],
'volume': [800, 1000, 1200]
}
# Run analysis
scenario_results = multi_variable_scenario_analysis(base_params, variable_ranges)
print(scenario_results.sort_values('Profit', ascending=False))
This function automatically generates all possible scenario combinations, making it easy to identify optimal parameter configurations.
Sensitivity Analysis Automation
Sensitivity analysis examines how changes in individual variables impact outcomes. Here's an automated approach:
python
def sensitivity_analysis(base_model, variables, change_range=(-0.3, 0.3), steps=10):
"""
Perform sensitivity analysis on model variables
Parameters:
base_model: function that returns outcome given parameters
variables: dict of variable names and base values
change_range: tuple of (min_change, max_change) as percentages
steps: number of steps in the analysis
"""
results = {}
for var_name, base_value in variables.items():
sensitivity_data = []
# Create range of values
multipliers = np.linspace(1 + change_range[0], 1 + change_range[1], steps)
for mult in multipliers:
test_params = variables.copy()
test_params[var_name] = base_value * mult
outcome = base_model(**test_params)
sensitivity_data.append({
'change_%': (mult - 1) * 100,
'value': test_params[var_name],
'outcome': outcome
})
results[var_name] = pd.DataFrame(sensitivity_data)
return results
# Example model function
def profit_model(price, volume, cost_per_unit, fixed_costs):
revenue = price * volume
variable_costs = cost_per_unit * volume
return revenue - variable_costs - fixed_costs
# Define variables
variables = {
'price': 100,
'volume': 1000,
'cost_per_unit': 60,
'fixed_costs': 15000
}
# Run sensitivity analysis
sensitivity_results = sensitivity_analysis(profit_model, variables, change_range=(-0.2, 0.2), steps=15)
# Display results for price sensitivity
print("Price Sensitivity Analysis:")
print(sensitivity_results['price'])
Visualizing Scenario Analysis Results
Visualization makes scenario analysis insights actionable. Here's how to create compelling charts:
python
def visualize_scenarios(scenario_data, metric='profit'):
"""
Create visualization for scenario analysis results
"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Bar chart comparison
scenarios = scenario_data['Scenario']
values = scenario_data[metric.capitalize()]
colors = ['green' if v > 0 else 'red' for v in values]
ax1.bar(range(len(scenarios)), values, color=colors, alpha=0.7)
ax1.set_xticks(range(len(scenarios)))
ax1.set_xticklabels(scenarios, rotation=45, ha='right')
ax1.set_ylabel(f'{metric.capitalize()} ($)')
ax1.set_title(f'Scenario Comparison - {metric.capitalize()}')
ax1.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
ax1.grid(axis='y', alpha=0.3)
# Waterfall chart for top scenarios
sorted_data = scenario_data.nlargest(5, metric.capitalize())
cumsum = sorted_data[metric.capitalize()].cumsum()
ax2.plot(range(len(sorted_data)), cumsum, marker='o', linewidth=2)
ax2.fill_between(range(len(sorted_data)), cumsum, alpha=0.3)
ax2.set_xticks(range(len(sorted_data)))
ax2.set_xticklabels(sorted_data['Scenario'], rotation=45, ha='right')
ax2.set_ylabel(f'Cumulative {metric.capitalize()} ($)')
ax2.set_title(f'Top 5 Scenarios - Cumulative Impact')
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('scenario_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Visualize results
visualize_scenarios(scenario_results)
Automating Scenario Reports
Finally, create automated reports that summarize your scenario analysis:
python
def generate_scenario_report(results_df, output_file='scenario_report.txt'):
"""
Generate automated text report from scenario analysis
"""
with open(output_file, 'w') as f:
f.write("=" * 60 + "\n")
f.write("SCENARIO ANALYSIS REPORT\n")
f.write("=" * 60 + "\n\n")
f.write(f"Total Scenarios Analyzed: {len(results_df)}\n\n")
f.write("BEST CASE SCENARIO:\n")
best = results_df.nlargest(1, 'Profit').iloc[0]
f.write(f" {best['Scenario']}\n")
f.write(f" Profit: ${best['Profit']:,.2f}\n")
f.write(f" Margin: {best['Margin %']:.2f}%\n\n")
f.write("WORST CASE SCENARIO:\n")
worst = results_df.nsmallest(1, 'Profit').iloc[0]
f.write(f" {worst['Scenario']}\n")
f.write(f" Profit: ${worst['Profit']:,.2f}\n")
f.write(f" Margin: {worst['Margin %']:.2f}%\n\n")
f.write("SUMMARY STATISTICS:\n")
f.write(f" Average Profit: ${results_df['Profit'].mean():,.2f}\n")
f.write(f" Median Profit: ${results_df['Profit'].median():,.2f}\n")
f.write(f" Profit Range: ${results_df['Profit'].max() - results_df['Profit'].min():,.2f}\n")
print(f"Report generated: {output_file}")
# Generate report
generate_scenario_report(scenario_results)
Conclusion
Automating scenario analysis in Python transforms how organizations approach strategic planning and risk assessment. By leveraging Python's powerful libraries and the code examples provided, you can build sophisticated scenario analysis systems that generate insights quickly and accurately. Whether you're using simple scenario comparisons, Monte Carlo simulations, or multi-variable sensitivity analysis, Python provides the flexibility and power to handle complex analytical challenges. Start with these examples and customize them to fit your specific business needs for better decision-making under uncertainty.