Metadata-Version: 2.4
Name: html-to-markdown
Version: 2.4.1
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Rust
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Markup
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Typing :: Typed
License-File: LICENSE
Summary: High-performance HTML to Markdown converter powered by Rust with a clean Python API
Keywords: cli-tool,converter,html,html2markdown,html5,markdown,markup,parser,rust,text-processing
Home-Page: https://github.com/Goldziher/html-to-markdown
Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git

# html-to-markdown

High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). Wheels are published for Linux, macOS, and Windows.

[![PyPI version](https://badge.fury.io/py/html-to-markdown.svg)](https://github.com/Goldziher/html-to-markdown)
[![Rust crate](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://github.com/Goldziher/html-to-markdown)
[![Python Versions](https://img.shields.io/pypi/pyversions/html-to-markdown.svg)](https://github.com/Goldziher/html-to-markdown)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)

## Installation

```bash
pip install html-to-markdown
```

## Performance Snapshot

Apple M4 • Real Wikipedia documents • `convert()` (Python)

| Document            | Size  | Latency | Throughput | Docs/sec |
| ------------------- | ----- | ------- | ---------- | -------- |
| Lists (Timeline)    | 129KB | 0.62ms  | 208 MB/s   | 1,613    |
| Tables (Countries)  | 360KB | 2.02ms  | 178 MB/s   | 495      |
| Mixed (Python wiki) | 656KB | 4.56ms  | 144 MB/s   | 219      |

> V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2's Rust engine delivers 60–80× higher throughput.

## Quick Start

```python
from html_to_markdown import convert

html = """
<h1>Welcome</h1>
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
<ul>
    <li>Blazing fast</li>
    <li>Type safe</li>
    <li>Easy to use</li>
</ul>
"""

markdown = convert(html)
print(markdown)
```

## Configuration (v2 API)

```python
from html_to_markdown import ConversionOptions, convert

options = ConversionOptions(
    heading_style="atx",
    list_indent_width=2,
    bullets="*+-",
)
options.escape_asterisks = True
options.code_language = "python"
options.extract_metadata = True

markdown = convert(html, options)
```

### HTML Preprocessing

```python
from html_to_markdown import ConversionOptions, PreprocessingOptions, convert

options = ConversionOptions(
    preprocessing=PreprocessingOptions(enabled=True, preset="aggressive"),
)

markdown = convert(scraped_html, options)
```

### Inline Image Extraction

```python
from html_to_markdown import InlineImageConfig, convert_with_inline_images

markdown, inline_images, warnings = convert_with_inline_images(
    '<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
    image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
)

if inline_images:
    first = inline_images[0]
    print(first["format"], first["dimensions"], first["attributes"])  # e.g. "png", (1, 1), {"width": "1"}
```

Each inline image is returned as a typed dictionary (`bytes` payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.

### hOCR (HTML OCR) Support

```python
from html_to_markdown import ConversionOptions, convert

# Default: emit structured Markdown directly
markdown = convert(hocr_html)

# hOCR documents are detected automatically; tables are reconstructed without extra configuration.
markdown = convert(hocr_html)
```

## CLI (same engine)

```bash
pipx install html-to-markdown  # or: pip install html-to-markdown

html-to-markdown page.html > page.md
cat page.html | html-to-markdown --heading-style atx > page.md
```

## API Surface

### `ConversionOptions`

Key fields (see docstring for full matrix):

- `heading_style`: `"underlined" | "atx" | "atx_closed"`
- `list_indent_width`: spaces per indent level (default 2)
- `bullets`: cycle of bullet characters (`"*+-"`)
- `strong_em_symbol`: `"*"` or `"_"`
- `code_language`: default fenced code block language
- `wrap`, `wrap_width`: wrap Markdown output
- `strip_tags`: remove specific HTML tags
- `preprocessing`: `PreprocessingOptions`
- `encoding`: input character encoding (informational)

### `PreprocessingOptions`

- `enabled`: enable HTML sanitisation
- `preset`: `"minimal" | "standard" | "aggressive"`
- `remove_navigation`, `remove_forms`

### `InlineImageConfig`

- `max_decoded_size_bytes`: reject larger payloads
- `filename_prefix`: generated name prefix (`embedded_image` default)
- `capture_svg`: collect inline `<svg>` (default `True`)
- `infer_dimensions`: decode raster images to obtain dimensions (default `False`)

## Performance: V2 vs V1 Compatibility Layer

### ⚠️ Important: Always Use V2 API

The v2 API (`convert()`) is **strongly recommended** for all code. The v1 compatibility layer adds significant overhead and should only be used for gradual migration:

```python
# ✅ RECOMMENDED - V2 Direct API (Fast)
from html_to_markdown import convert, ConversionOptions

markdown = convert(html)  # Simple conversion - FAST
markdown = convert(html, ConversionOptions(heading_style="atx"))  # With options - FAST

# ❌ AVOID - V1 Compatibility Layer (Slow)
from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(html, heading_style="atx")  # Adds 77% overhead
```

### Performance Comparison

Benchmarked on Apple M4 with 25-paragraph HTML document:

| API                      | ops/sec          | Relative Performance | Recommendation      |
| ------------------------ | ---------------- | -------------------- | ------------------- |
| **V2 API** (`convert()`) | **129,822**      | baseline             | ✅ **Use this**     |
| **V1 Compat Layer**      | **67,673**       | **77% slower**       | ⚠️ Migration only   |
| **CLI**                  | **150-210 MB/s** | Fastest              | ✅ Batch processing |

The v1 compatibility layer creates extra Python objects and performs additional conversions, significantly impacting performance.

### When to Use Each

- **V2 API (`convert()`)**: All new code, production systems, performance-critical applications ← **Use this**
- **V1 Compat (`convert_to_markdown()`)**: Only for gradual migration from legacy codebases
- **CLI (`html-to-markdown`)**: Batch processing, shell scripts, maximum throughput

## v1 Compatibility

A compatibility layer is provided to ease migration from v1.x:

- **Compat shim**: `html_to_markdown.v1_compat` exposes `convert_to_markdown`, `convert_to_markdown_stream`, and `markdownify`. Keyword mappings are listed in the [changelog](CHANGELOG.md#v200).
- **⚠️ Performance warning**: These compatibility functions add 77% overhead. Migrate to v2 API as soon as possible.
- **CLI**: The Rust CLI replaces the old Python script. New flags are documented via `html-to-markdown --help`.
- **Removed options**: `code_language_callback`, `strip`, and streaming APIs were removed; use `ConversionOptions`, `PreprocessingOptions`, and the inline-image helpers instead.

## Links

- GitHub: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
- Discord: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
- Kreuzberg ecosystem: [https://kreuzberg.dev](https://kreuzberg.dev)

## License

MIT License – see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE).

## Support

If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).

