Skip to content

StarRocks/doc-translator

Repository files navigation

Markdown Translator

A powerful command-line tool that uses Claude AI to translate markdown and MDX files from English to any specified language while preserving formatting and structure.

Usage at StarRocks

This code and most of the README are from the team at PlayCanvas. The only changes are:

  • StarRocks specific prompt
  • StarRocks specific dictionary
  • StarRocks specific words that should always be in English
  • the -s, --source option to allow specifying the source language as we translate from both English and Chinese.

Options

  -i, --input <pattern>   Input file path or glob pattern (e.g., "*.md",
                          "docs/**/*.md")
  -l, --language <lang>   Target language (e.g., Spanish, French, German)
  -s, --source <lang>     Source language (default: English)
  -o, --output <file>     Output file path (for single file translation)
  -d, --output-dir <dir>  Output directory (for batch translation or single
                          file)
  -k, --key <apikey>      Anthropic API key (or set ANTHROPIC_API_KEY env var)
  --flat                  Use flat structure in output directory (default:
                          preserve structure)
  --suffix <suffix>       Custom suffix for output files (default: language
                          name)
   --log-chunk-metadata    Log API metadata for each chunk
   --trace                 Log per-ID source text sent and translated text
                                       received (full content, no truncation)
  -h, --help              display help for command

The translator now uses the AST pipeline by default.

When --trace is enabled, the tool logs one JSON trace record per ID and includes the full sourceText and translatedText values. The only masking applied is replacing occurrences of the actual API key value with ***.

Interpreting AST parse failures

In AST mode, each chunk asks the model to return a strict JSON array of { id, text } items.

  • Parse errors such as Expected ',' or '}' or Expected ':' after property name usually mean the model returned malformed JSON for that chunk.
  • These are response-format failures, not semantic translation failures.
  • finishReason: STOP with parse errors means the output completed, but the JSON structure was invalid.
  • When you see json repair retry, the tool requested a strict JSON retry and recovered automatically.
  • When you see split fallback recovered X/Y missing ids, the tool retried unresolved IDs in smaller sub-batches and merged recovered results back into the chunk.

How to read the outcome:

  • AST completeness check: Translated IDs N/N - ✅ PASS means the chunk is fully recovered, even if repair notes are present.
  • Missing IDs after all retries are the only case that indicates unresolved chunk-level translation for those specific items.

Quick Start

  1. cd into the root of this repo
  2. Get an Anthropic API Key
  3. Export your Anthropic API Key like so:
    export ANTHROPIC_API_KEY="<your key here>"
  4. Install the prerequisites:
    npm install
  5. Translate an example file:
    npm run demo
  6. Check the source and destination example files (names are in the output from npm run demo). Look for our key phrases that are in our dictionaries and the terms that should always be left in English.
  7. List the options:
    node bin/cli.js translate -h

Example use on your workstation with the StarRocks repo

# Export your Anthropic API key
export ANTHROPIC_API_KEY="sk-ant-xxxxxxxxxxxxxxxx"

# in the doc-translator repo directory install the translator globally on your system:
npm install
npm link

# now in the starrocks/starrocks repo dir
# view the options:
doc-translate translate -h

# Example, translate the English architecture doc to Japanese:
doc-translate translate -s en -i docs/en/introduction/Architecture.md -l ja -o docs/ja/introduction/Architecture.md

Example use in GitHub PRs

  1. check the boxes

    image
  2. Add a /translate comment (docs-maintainers only at the moment)

    image

Features

  • 🌍 Multi-language support - Translate to 40+ languages
  • 📝 Markdown-aware - Preserves all markdown formatting (headers, links, code blocks, tables, etc.)
  • 🔄 Smart chunking - Handles large files by splitting content intelligently
  • 🎯 Selective translation - Only translates text content, keeps code and URLs intact
  • 📂 Batch processing - Translate multiple files using glob patterns (e.g., docs/**/*.md)
  • 🏗️ Structure preservation - Maintain directory structure or flatten output as needed
  • 📊 Progress tracking - Real-time progress indication with spinners for single files and batches
  • 🎨 Beautiful CLI - Colorful, user-friendly command-line interface
  • Fast processing - Optimized for speed with high-performance Claude model

Installation

Prerequisites

Note: This tool uses ES modules (ESM) and requires Node.js 16+ for full compatibility.

Install dependencies

npm install

Make CLI globally available (optional)

npm link

Or run directly with Node:

node bin/cli.js

Setup

1. Get Anthropic API Key

  1. Visit Anthropic Console
  2. Create a new API key
  3. Copy the generated key

2. Set API Key

Option A: Environment Variable (Recommended)

export ANTHROPIC_API_KEY="your-api-key-here"

Option B: Command Line Argument

doc-translate translate -i file.md -l Spanish --key your-api-key-here

Usage

Basic Translation

# Translate README.md to Spanish
doc-translate translate -i README.md -l Spanish

# Translate with custom output file
doc-translate translate -i docs/guide.md -l French -o docs/guide_fr.md

# Translate using API key argument
doc-translate translate -i file.md -l German --key your-api-key

# Translate with AST mode (default)
doc-translate translate -i examples/External_table.md -l Japanese

Batch Processing

The tool supports batch processing of multiple markdown files using glob patterns:

# Translate all .md files in current directory
doc-translate translate -i "*.md" -l Spanish -d ./spanish/

# Translate all markdown files in docs folder and subfolders
doc-translate translate -i "docs/**/*.md" -l French -d ./translations/

# Batch translate with flat structure (no subdirectories)
doc-translate translate -i "content/**/*.md" -l German -d ./output/ --flat

# Batch translate with custom suffix
doc-translate translate -i "*.md" -l ja -d ./translated/ --suffix "ja"

Available Commands

translate - Translate a markdown or MDX file

doc-translate translate [options]

Options:
  -i, --input <pattern>    Input file path or glob pattern (required)
                          Examples: "file.md", "*.md", "docs/**/*.md"
  -l, --language <lang>    Target language (required)
  -o, --output <file>      Output file path (for single file translation)
  -d, --output-dir <dir>   Output directory (for batch translation or single file)
  -k, --key <apikey>       Anthropic API key (optional)
  --flat                   Use flat structure in output directory (default: preserve structure)
  --suffix <suffix>        Custom suffix for output files (default: language name)
   --log-chunk-metadata     Log API metadata for each chunk
   --trace                  Log per-ID source text sent and translated text received

languages - List supported languages

doc-translate languages

setup - Show setup guide

doc-translate setup

--help - Show help

doc-translate --help

Supported Languages

The tool supports 40+ languages including:

  • European: Spanish, French, German, Italian, Portuguese, Dutch, Russian, Polish, Swedish, Norwegian, Danish, Finnish, Greek, Ukrainian, Czech, Hungarian, Romanian, Bulgarian, Croatian, Serbian, Slovak, Slovenian, Estonian, Latvian, Lithuanian, Catalan, Basque, Welsh, Irish
  • Asian: Chinese, Japanese, Korean, Hindi, Thai, Vietnamese, Indonesian, Malay
  • Middle Eastern: Arabic, Hebrew, Turkish

Tip

Use the two letter short code for the language if you like. For example, zh instead of "Simplified Chinese".

Examples

Single File Translation

Example 1: Basic Translation

doc-translate translate -i README.md -l es

Output: Creates README_spanish.md with Spanish translation

Example 2: Custom Output Path

doc-translate translate -i docs/api.md -l fr -o docs/fr/api.md

Output: Creates docs/fr/api.md with French translation

Example 3: Using API Key Argument

doc-translate translate -i guide.md -l German --key AIzaSyC...

Example 4: Large File Translation

The tool automatically handles large files by splitting them into chunks:

doc-translate translate -i large-document.md -l ja

Batch Translation

Example 5: Translate All Markdown Files

doc-translate translate -i "*.md" -l Spanish -d ./spanish/

Output: Translates all .md files in current directory to ./spanish/ folder

Example 6: Recursive Translation with Structure Preservation

doc-translate translate -i "docs/**/*.md" -l French -d ./translations/

Output: Translates all markdown files in docs/ and preserves directory structure in ./translations/

docs/
├── guide.md
├── api/
│   └── reference.md
└── tutorials/
    └── getting-started.md

# Becomes:
translations/
├── guide_french.md
├── api/
│   └── reference_french.md
└── tutorials/
    └── getting-started_french.md

Example 7: Flat Structure Batch Translation

doc-translate translate -i "content/**/*.md" -l German -d ./output/ --flat

Output: Translates all files but places them in a flat structure (no subdirectories)

content/
├── intro.md
├── chapters/
│   ├── chapter1.md
│   └── chapter2.md
└── appendix/
    └── notes.md

# Becomes:
output/
├── intro_german.md
├── chapter1_german.md
├── chapter2_german.md
└── notes_german.md

Example 8: Custom Suffix

doc-translate translate -i "*.md" -l ja -d ./translated/ --suffix "ja"

Output: Uses "ja" instead of "japanese" as the file suffix

What Gets Translated

Translated:

  • Heading text
  • Paragraph text
  • List items
  • Table content
  • Link text
  • Image alt text
  • Quote text

Preserved:

  • Code blocks and inline code
  • URLs and file paths
  • Markdown syntax characters
  • HTML tags
  • Mathematical expressions
  • Technical terms and proper nouns (when appropriate)

Output

The tool provides detailed progress feedback for both single file and batch processing:

Single File Translation Output

╔═══════════════════════════════════════╗
║        Markdown Translator            ║
║       Powered by Claude AI            ║
╚═══════════════════════════════════════╝

📋 Translation Details:
   Input:    /path/to/README.md
   Output:   /path/to/README_spanish.md
   Language: Spanish

⠋ Translating chunk 2/3...
✅ Translation completed successfully!

📊 Summary:
   Original length:  2,845 characters
   Translated length: 3,120 characters
   Language:         Spanish
   Output file:      /path/to/README_spanish.md

Batch Translation Output

╔═══════════════════════════════════════╗
║        Markdown Translator            ║
║       Powered by Claude AI            ║
╚═══════════════════════════════════════╝

📋 Batch Translation Details:
   Pattern:  docs/**/*.md
   Output:   /path/to/translations/
   Language: Spanish
   Structure: Preserved

⠋ [2/5] reference.md - chunk 1/2...
✅ All translations completed successfully!

📊 Summary:
   Files processed: 5
   Successful: 5
   Failed: 0
   Output directory: /path/to/translations/

Error Handling

The tool provides clear error messages for common issues:

  • Missing or invalid API key
  • File not found
  • Invalid file format
  • Network connectivity issues
  • API rate limiting

Testing

The examples/ directory contains a test corpus and an automated checker.

Test corpus: examples/StarRocksTest.md

A curated set of patterns drawn from real StarRocks documentation that have caused translation problems in the past:

Pattern Why it matters
YAML frontmatter Must be preserved exactly
HTML in Markdown table cells (<ul><li>, <br />, <code class="...">) Tags must not be translated or restructured
Tilde fence code blocks (~~~SQL) Must be converted to backtick fences cleanly
MDX import statements and <Tabs>/<TabItem> JSX Must be preserved unchanged
Template variables in code ({{ data_interval_start }}) Airflow/dbt syntax must not be touched
HTML comparison tables with colspan Full HTML blocks must pass through untranslated
Admonitions indented inside numbered lists Indentation must survive translation
<details> collapsible blocks Content indentation must be preserved
Cross-references with relative paths and anchors Only the display text is translated; the URL is not

Automated checker: examples/check_translation.js

After translation, the checker runs 13 static checks against the source/output pair and reports PASS/FAIL for each:

  • No __MTX_ placeholder leaks
  • Heading count
  • Code block count and non-comment content
  • Link URL preservation
  • HTML tags in table cells
  • Frontmatter preserved exactly
  • Import statements preserved
  • Admonition marker count
  • Admonition indentation (catches the "indented :::note gets unindented" bug)
  • Never-translate term spot-check
  • Unordered list item count
  • Table column counts

npm scripts

npm test          # Translate StarRocksTest.md → zh, then run all checks
npm run test:ja   # Translate StarRocksTest.md → ja, then run all checks
npm run check:zh  # Re-run checks on an already-translated StarRocksTest_zh.md
npm run check:ja  # Re-run checks on an already-translated StarRocksTest_ja.md

check:zh and check:ja are useful for iterating on the system prompt or dictionaries without calling the API again.

Development

Project Structure

doc-translator/
├── bin/
│   └── cli.js                    # CLI entry point
├── src/
│   ├── translator.js             # Base class and shared utilities
│   ├── translator_ast_mvp.js     # AST-based translator (default)
│   └── configs/
│       ├── system_prompt.txt     # Translation instructions for the model
│       ├── never_translate.yaml  # Terms that must never be translated
│       └── language_dicts/       # Per-language translation dictionaries
├── examples/
│   ├── StarRocksTest.md          # Test corpus
│   └── check_translation.js     # Automated output checker
├── package.json
└── README.md

Architecture

This project uses ES modules (ESM) for modern JavaScript development:

  • All files use import/export syntax instead of require/module.exports
  • package.json includes "type": "module" for ESM support
  • Compatible with the latest versions of dependencies (chalk 5.x, ora 8.x)
  • Requires Node.js 16+ for full ESM compatibility

Key Dependencies

  • @anthropic-ai/sdk - Anthropic Claude AI SDK
  • commander - Command-line interface framework
  • chalk - Terminal styling
  • ora - Progress spinners
  • fs-extra - Enhanced file system operations

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Troubleshooting

API Key Issues

  • Ensure your API key is valid and active
  • Check that you have sufficient quota in your Anthropic account
  • Verify the API key is active in the Anthropic Console

Large File Processing

  • The tool automatically chunks large files
  • Each chunk is processed with a small delay to avoid rate limiting
  • Very large files may take several minutes to process

Batch Processing

  • Use quotes around glob patterns to prevent shell expansion: "*.md" not *.md
  • The --output-dir option is required for batch translation
  • Large batches may take considerable time; use progress indicators to monitor
  • Failed files in a batch are reported individually without stopping the process

Network Issues

  • Ensure you have a stable internet connection
  • The tool will retry failed requests automatically
  • Check firewall settings if you encounter connection issues

Support

If you encounter any issues or have questions:

  1. Check the troubleshooting section above
  2. Run doc-translate setup for configuration help
  3. Create an issue on the project repository

Happy translating! 🌍✨

About

Use Anthropic to translate markdown files

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors