Skip to content
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ At the moment, the following optional dependencies are available:
* `[pdf]` Installs dependencies for PDF files
* `[outlook]` Installs dependencies for Outlook messages
* `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
* `[az-content-understanding]` Installs dependencies for Azure Content Understanding
* `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
* `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription

Expand Down Expand Up @@ -168,6 +169,83 @@ markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoin

More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0)

### Azure Content Understanding
Comment thread
chienyuanchang marked this conversation as resolved.

[Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) provides higher-quality conversion with structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers.

Install: `pip install 'markitdown[az-content-understanding]'`

#### When to use Content Understanding

Content Understanding is ideal when you need capabilities beyond what built-in or Document Intelligence converters provide:

- **Audio and video files** — CU is the only option for video, and the higher-quality cloud option for audio. Built-in converters have no video support and only basic audio transcription.
- **Structured field extraction** — Custom analyzers extract domain-specific fields (invoice amounts, receipt dates, contract clauses) serialized as YAML front matter. Neither built-in nor Doc Intel integration exposes fields.
Comment thread
chienyuanchang marked this conversation as resolved.
Outdated
- **Higher-quality document extraction** — Cloud-based layout analysis and OCR for scanned PDFs, complex tables, and multi-page documents.
- **Single API for all modalities** — One `cu_endpoint` handles documents, images, audio, and video with automatic analyzer routing.

| Capability | Built-in converters | Azure Document Intelligence | Azure Content Understanding |
|------------|---------------------|-----------------------------|-----------------------------|
| Document conversion | Offline, format-specific extraction | Cloud layout extraction | Cloud multimodal extraction |
| Structured fields | Not available | Not exposed by this integration | YAML front matter from analyzer fields |
| Custom analyzers | Not available | Not configurable in this integration | Supported with `cu_analyzer_id` |
| Audio and video | Basic audio, no video | Not supported | Audio and video analyzers |
| Cost | Local compute only | Billable Azure API calls | Billable Azure API calls |

**CLI:**

```bash
markitdown path-to-file.pdf --use-cu --cu-endpoint "<content_understanding_endpoint>"
```

**Python API:**

```python
from markitdown import MarkItDown

# Zero-config — auto-selects analyzer per file type
md = MarkItDown(cu_endpoint="<content_understanding_endpoint>")
result = md.convert("report.pdf") # documents → prebuilt-documentSearch
result = md.convert("meeting.mp4") # video → prebuilt-videoSearch
result = md.convert("call.wav") # audio → prebuilt-audioSearch
print(result.markdown)
```

**With a custom analyzer** (for domain-specific field extraction):

```python
md = MarkItDown(
cu_endpoint="<content_understanding_endpoint>",
cu_analyzer_id="my-invoice-analyzer",
)
result = md.convert("invoice.pdf")
print(result.markdown)
# Output includes YAML front matter with extracted fields:
# ---
# contentType: document
# fields:
# VendorName: CONTOSO LTD.
# InvoiceDate: '2019-11-15'
# ---
# <!-- page 1 -->
# ...
```

When `cu_analyzer_id` is set, the converter automatically scopes it to compatible file types based on the analyzer's modality. Incompatible types (e.g., audio files with a document analyzer) auto-route to default prebuilt analyzers.

**Cost note:** Each `convert()` call for a CU-routed format is a billable Azure API call. Use `cu_file_types` to restrict which formats route to CU:

```python
from markitdown.converters import ContentUnderstandingFileType

md = MarkItDown(
cu_endpoint="<content_understanding_endpoint>",
cu_file_types=[ContentUnderstandingFileType.PDF], # only PDFs use CU
)
```

More information about Azure Content Understanding can be found [here](https://learn.microsoft.com/azure/ai-services/content-understanding/).

### Python API

Basic usage in Python:
Expand Down
3 changes: 3 additions & 0 deletions packages/markitdown/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ all = [
"SpeechRecognition",
"youtube-transcript-api~=1.0.0",
"azure-ai-documentintelligence",
"azure-ai-contentunderstanding>=1.2.0b1",
"azure-identity",
]
pptx = ["python-pptx"]
Expand All @@ -58,6 +59,8 @@ outlook = ["olefile"]
audio-transcription = ["pydub", "SpeechRecognition"]
youtube-transcription = ["youtube-transcript-api"]
az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
# >=1.2.0b1 required for to_llm_input() helper used by ContentUnderstandingConverter
az-content-understanding = ["azure-ai-contentunderstanding>=1.2.0b1", "azure-identity"]

[project.urls]
Documentation = "https://github.com/microsoft/markitdown#readme"
Expand Down
60 changes: 59 additions & 1 deletion packages/markitdown/src/markitdown/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import argparse
import sys
import codecs
from typing import Any, Dict
from textwrap import dedent
from importlib.metadata import entry_points
from .__about__ import __version__
Expand Down Expand Up @@ -77,20 +78,47 @@ def main():
help="Provide a hint about the file's charset (e.g, UTF-8).",
)

parser.add_argument(
cloud_group = parser.add_mutually_exclusive_group()
cloud_group.add_argument(
"-d",
"--use-docintel",
action="store_true",
help="Use Document Intelligence to extract text instead of offline conversion. Requires a valid Document Intelligence Endpoint.",
)

cloud_group.add_argument(
"--use-cu",
"--use-content-understanding",
action="store_true",
dest="use_cu",
help="Use Azure Content Understanding to extract text. Requires --cu-endpoint.",
)

parser.add_argument(
"-e",
"--endpoint",
type=str,
help="Document Intelligence Endpoint. Required if using Document Intelligence.",
)

parser.add_argument(
"--cu-endpoint",
type=str,
help="Content Understanding Endpoint. Required if using --use-cu.",
)

parser.add_argument(
"--cu-analyzer",
type=str,
help="Content Understanding analyzer ID. If not specified, auto-selects by file type.",
)

parser.add_argument(
"--cu-file-types",
type=str,
help="Comma-separated list of file types to route to Content Understanding (e.g., pdf,jpeg,mp4). If omitted, all supported types are routed.",
)

parser.add_argument(
"-p",
"--use-plugins",
Expand Down Expand Up @@ -183,6 +211,36 @@ def main():
markitdown = MarkItDown(
enable_plugins=args.use_plugins, docintel_endpoint=args.endpoint
)
elif args.use_cu:
if args.cu_endpoint is None:
_exit_with_error(
"Content Understanding Endpoint (--cu-endpoint) is required when using --use-cu."
)
elif args.filename is None:
_exit_with_error("Filename is required when using Content Understanding.")

cu_kwargs: Dict[str, Any] = {
"cu_endpoint": args.cu_endpoint,
}
if args.cu_analyzer is not None:
cu_kwargs["cu_analyzer_id"] = args.cu_analyzer
if args.cu_file_types is not None:
# Parse comma-separated file types into ContentUnderstandingFileType list
from .converters import ContentUnderstandingFileType

type_names = [
t.strip().lower() for t in args.cu_file_types.split(",") if t.strip()
]
cu_types = []
for name in type_names:
# Try matching by value (e.g., "pdf", "jpeg", "mp4")
try:
cu_types.append(ContentUnderstandingFileType(name))
except ValueError:
_exit_with_error(f"Unknown file type: {name}")
cu_kwargs["cu_file_types"] = cu_types

markitdown = MarkItDown(enable_plugins=args.use_plugins, **cu_kwargs)
else:
markitdown = MarkItDown(enable_plugins=args.use_plugins)

Expand Down
23 changes: 23 additions & 0 deletions packages/markitdown/src/markitdown/_markitdown.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
ZipConverter,
EpubConverter,
DocumentIntelligenceConverter,
ContentUnderstandingConverter,
CsvConverter,
)

Expand Down Expand Up @@ -225,6 +226,28 @@ def enable_builtins(self, **kwargs) -> None:
DocumentIntelligenceConverter(**docintel_args),
)

# Register Content Understanding converter at the top of the stack if endpoint is provided
cu_endpoint = kwargs.get("cu_endpoint")
if cu_endpoint is not None:
cu_args: Dict[str, Any] = {}
cu_args["endpoint"] = cu_endpoint

cu_credential = kwargs.get("cu_credential")
if cu_credential is not None:
cu_args["credential"] = cu_credential

cu_analyzer_id = kwargs.get("cu_analyzer_id")
if cu_analyzer_id is not None:
cu_args["analyzer_id"] = cu_analyzer_id

cu_file_types = kwargs.get("cu_file_types")
if cu_file_types is not None:
cu_args["file_types"] = cu_file_types

self.register_converter(
ContentUnderstandingConverter(**cu_args),
)

self._builtins_enabled = True
else:
warn("Built-in converters are already enabled.", RuntimeWarning)
Expand Down
6 changes: 6 additions & 0 deletions packages/markitdown/src/markitdown/converters/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@
DocumentIntelligenceConverter,
DocumentIntelligenceFileType,
)
from ._cu_converter import (
ContentUnderstandingConverter,
ContentUnderstandingFileType,
)
from ._epub_converter import EpubConverter
from ._csv_converter import CsvConverter

Expand All @@ -43,6 +47,8 @@
"ZipConverter",
"DocumentIntelligenceConverter",
"DocumentIntelligenceFileType",
"ContentUnderstandingConverter",
"ContentUnderstandingFileType",
"EpubConverter",
"CsvConverter",
]
Loading
Loading