Skip to main content
If you’re working with legal documents, contracts, or any collaborative review process, you know how painful it is to manually track all the changes, comments, and revisions in Word documents. This guide shows you how to extract all that markup programmatically using Marker’s track_changes feature.

Overview

The track_changes feature in Marker extracts:
  • Tracked changes: insertions and deletions with author names and timestamps
  • Comments: all margin comments with author details
This allows you get a full revision history from your Word docs into clean HTML and Markdown. track_changes is perfect for legal workflows where you need to:
  • Generate redline summaries for clients
  • Identify all changes made by specific parties
  • Extract action items from comments
  • Analyze negotiation patterns across contract versions
  • Create audit trails of document revisions
To enable this feature, set the extras parameter to “track_changes” when calling the Marker API. The output will be provided in Markdown and HTML format by default, with all tracked changes and comments preserved in the markup.

Making the API Request

Here’s how to submit a Word document and extract its tracked changes:
import requests
import time
import os

API_URL = "https://www.datalab.to/api/v1/marker"
API_KEY = os.getenv("DATALAB_API_KEY")

def extract_tracked_changes(docx_path, output_format='html,markdown'):
    """
    Extract tracked changes and comments from a Word document.
    
    Args:
        docx_path: Path to the .docx file
        output_format: 'html' or 'markdown' or `html,markdown`
    
    Returns:
        Dictionary with the converted content including tracked changes
    """
    with open(docx_path, 'rb') as f:
        form_data = {
            'file': (docx_path, f, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'),
            'extras': (None, 'track_changes'),  # Enable track_changes feature
            'output_format': (None, output_format),
            'paginate': (None, False)  # Set to True if you want page breaks
        }
        
        headers = {"X-Api-Key": API_KEY}
        response = requests.post(API_URL, files=form_data, headers=headers)
        data = response.json()
    
    # Poll for completion
    check_url = data["request_check_url"]
    max_polls = 300 # Set longer if needed
    
    for i in range(max_polls):
        time.sleep(2)
        response = requests.get(check_url, headers=headers)
        result = response.json()
        
        if result["status"] == "complete":
            return result
        elif result["status"] == "failed":
            raise Exception(f"Conversion failed: {result.get('error')}")
    
    raise TimeoutError("Conversion did not complete in time")
The response will contain your document with all tracked changes preserved. Here’s what the markup looks like:
  • Insertions: <ins data-revision-author="Sandy Kwon" data-revision-datetime="2025-11-11T11:24:00">new text</ins>
  • Deletions: <del data-revision-author="Vikram Oberoi" data-revision-datetime="2025-11-11T10:34:00">old text</del>
  • Comments: <comment data-comment-author="Vikram Oberoi" data-comment-datetime="2025-11-11T10:32:00" data-comment-initial="VO" text="comment text">marked text</comment>
This markup will appear in both HTML and Markdown output.

Analyzing Changes with LLMs

Once you have the extracted markup, you can use an LLM to analyze the changes. Here’s an example using OpenRouter to generate a redline summary:
import requests
import os

OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
OPENROUTER_MODEL = os.getenv("OPENROUTER_MODEL")


def analyze_changes_with_llm(marked_up_content, analysis_type='summary'):
    """
    Use an LLM via OpenRouter to analyze tracked changes.
    
    Args:
        marked_up_content: The HTML or Markdown with tracked changes
        analysis_type: Type of analysis ('summary', 'risks', 'by_author', etc.)
    
    Returns:
        LLM analysis of the changes
    """
    
    prompts = {
        'summary': """Analyze this contract with tracked changes and provide:
1. A concise summary of all changes made
2. Key changes that materially affect the agreement
3. Any changes that shift risk or obligations between parties
4. Recommended action items for legal review

Document with tracked changes:
{content}""",
        
        'by_author': """Review this document with tracked changes and create a report organized by author:
- List each author's changes
- Categorize changes as substantive vs. stylistic
- Highlight any conflicting changes between authors

Document:
{content}""",
        
        'risks': """Analyze this contract's tracked changes for potential legal risks:
- Identify changes that increase liability or obligations
- Flag any deletions of protective language
- Note additions that could be problematic
- Assess the overall risk profile of the revisions

Document:
{content}"""
    }
    
    prompt = prompts.get(analysis_type, prompts['summary']).format(content=marked_up_content)
    
    response = requests.post(
        url="https://openrouter.ai/api/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {OPENROUTER_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": OPENROUTER_MODEL,
            "messages": [
                {
                    "role": "user",
                    "content": prompt
                }
            ]
        }
    )
    
    return response.json()['choices'][0]['message']['content']

# Example usage
result = extract_tracked_changes('nda_draft_v3.docx', output_format='html')
marked_up_doc = result['html']

# Generate different types of analysis
summary = analyze_changes_with_llm(marked_up_doc, 'summary')
risk_analysis = analyze_changes_with_llm(marked_up_doc, 'risks')
by_author = analyze_changes_with_llm(marked_up_doc, 'by_author')

print("Change Summary:")
print(summary)
print("\n" + "="*80 + "\n")
print("Risk Analysis:")
print(risk_analysis)

Pagination

For longer documents, you may want to preserve page breaks in the output so you can split them. Set paginate to True in your request:
form_data = {
    'file': (docx_path, f, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'),
    'extras': (None, 'track_changes'),
    'output_format': (None, 'html'),
    'paginate': (None, True)  # Enable pagination
}
For Markdown output, each page will preceded by a horizontal rule containing the page number:
{0}------------------------------------------------

<Page 0 Content>

{1}------------------------------------------------

<Page 1 Content>
For HTML output, each page will be wrapped in a div with the page number:
<div class="page" data-page-id="0">
  <!-- Page 1 content -->
</div>
<div class="page" data-page-id="1">
  <!-- Page 2 content -->
</div>
This makes it easy to process documents page-by-page or display them with proper pagination in your UI.

Full Code Sample

Here’s a complete example that extracts tracked changes and generates a legal review summary:
import os
import requests
import time
import json

API_URL = "https://www.datalab.to/api/v1/marker"
DATALAB_API_KEY = os.getenv("DATALAB_API_KEY")
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")

def extract_tracked_changes(docx_path, output_format='html', paginate=False):
    """Extract tracked changes from a Word document."""
    with open(docx_path, 'rb') as f:
        form_data = {
            'file': (os.path.basename(docx_path), f, 
                    'application/vnd.openxmlformats-officedocument.wordprocessingml.document'),
            'extras': (None, 'track_changes'),
            'output_format': (None, output_format),
            'paginate': (None, paginate)
        }
        
        headers = {"X-Api-Key": DATALAB_API_KEY}
        response = requests.post(API_URL, files=form_data, headers=headers)
        data = response.json()
    
    # Poll for completion
    check_url = data["request_check_url"]
    max_polls = 300
    
    for i in range(max_polls):
        time.sleep(2)
        response = requests.get(check_url, headers=headers)
        result = response.json()
        
        if result["status"] == "complete":
            return result
        elif result["status"] == "failed":
            raise Exception(f"Conversion failed: {result.get('error')}")
    
    raise TimeoutError("Conversion did not complete in time")


def analyze_with_llm(content, prompt_template):
    """Send content to LLM for analysis via OpenRouter."""
    response = requests.post(
        url="https://openrouter.ai/api/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {OPENROUTER_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": "anthropic/claude-3.5-sonnet",
            "messages": [
                {
                    "role": "user",
                    "content": prompt_template.format(content=content)
                }
            ]
        }
    )
    
    return response.json()['choices'][0]['message']['content']


def generate_legal_review(docx_path):
    """
    Complete workflow: extract tracked changes and generate legal review.
    """
    print(f"Processing {docx_path}...")
    
    # Extract tracked changes
    result = extract_tracked_changes(docx_path, output_format='html', paginate=True)
    marked_up_doc = result['html']
    
    print("Document converted with tracked changes preserved.")
    
    # Generate comprehensive legal review
    review_prompt = """You are a legal reviewer analyzing a contract with tracked changes.

Please provide:

1. **Executive Summary**: Brief overview of the document and key changes
2. **Material Changes**: List substantive changes that affect rights, obligations, or liabilities
3. **Risk Assessment**: Identify any changes that increase risk exposure
4. **Comments Analysis**: Summarize unresolved comments and action items
5. **Recommendations**: Specific next steps for legal review

Document with tracked changes:
{content}"""

    print("\nGenerating legal review with LLM...")
    review = analyze_with_llm(marked_up_doc, review_prompt)
    
    # Also generate author-specific analysis
    author_prompt = """Analyze this document's tracked changes by author.

For each author who made changes:
- Total number of insertions and deletions
- Types of changes (substantive vs. editorial)
- Key themes in their revisions
- Any patterns in their negotiation strategy

Document:
{content}"""
    
    print("Generating per-author analysis...")
    author_analysis = analyze_with_llm(marked_up_doc, author_prompt)
    
    return {
        'marked_up_document': marked_up_doc,
        'legal_review': review,
        'author_analysis': author_analysis
    }


if __name__ == "__main__":
    # Process a contract with tracked changes
    results = generate_legal_review('contract_redline_v3.docx')
    
    # Save results
    with open('legal_review.txt', 'w') as f:
        f.write("LEGAL REVIEW\n")
        f.write("="*80 + "\n\n")
        f.write(results['legal_review'])
        f.write("\n\n" + "="*80 + "\n\n")
        f.write("AUTHOR ANALYSIS\n")
        f.write("="*80 + "\n\n")
        f.write(results['author_analysis'])
    
    with open('marked_up_document.html', 'w') as f:
        f.write(results['marked_up_document'])
    
    print("\nReview complete! Results saved to:")
    print("  - legal_review.txt")
    print("  - marked_up_document.html")

Try it out

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!