Metadata-Version: 2.4
Name: html-to-markdown
Version: 1.9.0
Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
License: MIT
Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
Keywords: beautifulsoup,cli-tool,converter,html,html2markdown,markdown,markup,text-extraction,text-processing
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Markup
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: beautifulsoup4>=4.13.4
Requires-Dist: nh3>=0.3
Provides-Extra: lxml
Requires-Dist: lxml>=6; extra == "lxml"
Dynamic: license-file

# html-to-markdown

A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork
of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
Python 3.9+.

## Features

- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
- **Enhanced Table Support**: Advanced handling of merged cells with rowspan/colspan support for better table representation
- **Type Safety**: Strict MyPy adherence with comprehensive type hints
- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
- **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)
- **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax
- **Flexible Configuration**: 20+ configuration options for customizing conversion behavior
- **CLI Tool**: Full-featured command-line interface with all API options exposed
- **Custom Converters**: Extensible converter system for custom HTML tag handling
- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
- **Comprehensive Test Coverage**: 91%+ test coverage with 623+ comprehensive tests

## Installation

```shell
pip install html-to-markdown
```

### Optional lxml Parser

For improved performance, you can install with the optional lxml parser:

```shell
pip install html-to-markdown[lxml]
```

The lxml parser offers:

- **~30% faster HTML parsing** compared to the default html.parser
- Better handling of malformed HTML
- More robust parsing for complex documents

Once installed, lxml is automatically used by default for better performance. You can explicitly specify a parser if needed:

```python
result = convert_to_markdown(html)  # Auto-detects: uses lxml if available, otherwise html.parser
result = convert_to_markdown(html, parser="lxml")  # Force lxml (requires installation)
result = convert_to_markdown(html, parser="html.parser")  # Force built-in parser
```

## Quick Start

Convert HTML to Markdown with a single function call:

```python
from html_to_markdown import convert_to_markdown

html = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample Document</title>
    <meta name="description" content="A sample HTML document">
</head>
<body>
    <article>
        <h1>Welcome</h1>
        <p>This is a <strong>sample</strong> with a <a href="https://example.com">link</a>.</p>
        <p>Here's some <mark>highlighted text</mark> and a task list:</p>
        <ul>
            <li><input type="checkbox" checked> Completed task</li>
            <li><input type="checkbox"> Pending task</li>
        </ul>
    </article>
</body>
</html>
"""

markdown = convert_to_markdown(html)
print(markdown)
```

Output:

```markdown
<!--
title: Sample Document
meta-description: A sample HTML document
-->

# Welcome

This is a **sample** with a [link](https://example.com).

Here's some ==highlighted text== and a task list:

* [x] Completed task
* [ ] Pending task
```

### Working with BeautifulSoup

If you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:

```python
from bs4 import BeautifulSoup
from html_to_markdown import convert_to_markdown

# Configure BeautifulSoup with your preferred parser
soup = BeautifulSoup(html, "lxml")  # Note: lxml requires additional installation
markdown = convert_to_markdown(soup)
```

## Advanced Usage

### Customizing Conversion Options

The library offers extensive customization through various options:

```python
from html_to_markdown import convert_to_markdown

html = "<div>Your content here...</div>"
markdown = convert_to_markdown(
    html,
    # Document processing
    extract_metadata=True,  # Extract metadata as comment header
    convert_as_inline=False,  # Treat as block-level content
    strip_newlines=False,  # Preserve original newlines
    # Formatting options
    heading_style="atx",  # Use # style headers
    strong_em_symbol="*",  # Use * for bold/italic
    bullets="*+-",  # Define bullet point characters
    highlight_style="double-equal",  # Use == for highlighted text
    # Text processing
    wrap=True,  # Enable text wrapping
    wrap_width=100,  # Set wrap width
    escape_asterisks=True,  # Escape * characters
    escape_underscores=True,  # Escape _ characters
    escape_misc=True,  # Escape other special characters
    # Code blocks
    code_language="python",  # Default code block language
    # Streaming for large documents
    stream_processing=False,  # Enable for memory efficiency
    chunk_size=1024,  # Chunk size for streaming
)
```

### Custom Converters

You can provide your own conversion functions for specific HTML tags:

```python
from bs4.element import Tag
from html_to_markdown import convert_to_markdown

# Define a custom converter for the <b> tag
def custom_bold_converter(*, tag: Tag, text: str, **kwargs) -> str:
    return f"IMPORTANT: {text}"

html = "<p>This is a <b>bold statement</b>.</p>"
markdown = convert_to_markdown(html, custom_converters={"b": custom_bold_converter})
print(markdown)
# Output: This is a IMPORTANT: bold statement.
```

Custom converters take precedence over the built-in converters and can be used alongside other configuration options.

### Enhanced Table Support

The library now provides better handling of complex tables with merged cells:

```python
from html_to_markdown import convert_to_markdown

# HTML table with merged cells
html = """
<table>
    <tr>
        <th rowspan="2">Category</th>
        <th colspan="2">Sales Data</th>
    </tr>
    <tr>
        <th>Q1</th>
        <th>Q2</th>
    </tr>
    <tr>
        <td>Product A</td>
        <td>$100K</td>
        <td>$150K</td>
    </tr>
</table>
"""

markdown = convert_to_markdown(html)
print(markdown)
```

Output:

```markdown
| Category | Sales Data |  |
| --- | --- | --- |
| | Q1 | Q2 |
| Product A | $100K | $150K |
```

The library handles:

- **Rowspan**: Inserts empty cells in subsequent rows
- **Colspan**: Properly manages column spanning
- **Clean output**: Removes `<colgroup>` and `<col>` elements that have no Markdown equivalent

### Key Configuration Options

| Option              | Type | Default          | Description                                                     |
| ------------------- | ---- | ---------------- | --------------------------------------------------------------- |
| `extract_metadata`  | bool | `True`           | Extract document metadata as comment header                     |
| `convert_as_inline` | bool | `False`          | Treat content as inline elements only                           |
| `heading_style`     | str  | `'underlined'`   | Header style (`'underlined'`, `'atx'`, `'atx_closed'`)          |
| `highlight_style`   | str  | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`)          |
| `stream_processing` | bool | `False`          | Enable streaming for large documents                            |
| `parser`            | str  | auto-detect      | BeautifulSoup parser (auto-detects `'lxml'` or `'html.parser'`) |
| `autolinks`         | bool | `True`           | Auto-convert URLs to Markdown links                             |
| `bullets`           | str  | `'*+-'`          | Characters to use for bullet points                             |
| `escape_asterisks`  | bool | `True`           | Escape * characters                                             |
| `wrap`              | bool | `False`          | Enable text wrapping                                            |
| `wrap_width`        | int  | `80`             | Text wrap width                                                 |

For a complete list of all 20+ options, see the [Configuration Reference](#configuration-reference) section below.

## CLI Usage

Convert HTML files directly from the command line with full access to all API options:

```shell
# Convert a file
html_to_markdown input.html > output.md

# Process stdin
cat input.html | html_to_markdown > output.md

# Use custom options
html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md

# Advanced options
html_to_markdown \
  --no-extract-metadata \
  --convert-as-inline \
  --highlight-style html \
  --stream-processing \
  --show-progress \
  input.html > output.md
```

### Key CLI Options

```shell
# Content processing
--convert-as-inline          # Treat content as inline elements
--no-extract-metadata        # Disable metadata extraction
--strip-newlines             # Remove newlines from input

# Formatting
--heading-style {atx,atx_closed,underlined}
--highlight-style {double-equal,html,bold}
--strong-em-symbol {*,_}
--bullets CHARS              # e.g., "*+-"

# Text escaping
--no-escape-asterisks        # Disable * escaping
--no-escape-underscores      # Disable _ escaping
--no-escape-misc             # Disable misc character escaping

# Large document processing
--stream-processing          # Enable streaming mode
--chunk-size SIZE            # Set chunk size (default: 1024)
--show-progress              # Show progress for large files

# Text wrapping
--wrap                       # Enable text wrapping
--wrap-width WIDTH           # Set wrap width (default: 80)
```

View all available options:

```shell
html_to_markdown --help
```

## Migration from Markdownify

For existing projects using Markdownify, a compatibility layer is provided:

```python
# Old code
from markdownify import markdownify as md

# New code - works the same way
from html_to_markdown import markdownify as md
```

The `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.

**Note**: While the compatibility layer ensures existing code continues to work, new projects should use `convert_to_markdown` directly as it provides better type hints and clearer naming.

## Configuration Reference

Complete list of all configuration options:

### Document Processing

- `extract_metadata` (bool, default: `True`): Extract document metadata (title, meta tags) as comment header
- `convert_as_inline` (bool, default: `False`): Treat content as inline elements only (no block elements)
- `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing
- `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)
- `strip` (list, default: `None`): List of HTML tags to remove from output
- `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions

### Streaming Support

- `stream_processing` (bool, default: `False`): Enable streaming processing for large documents
- `chunk_size` (int, default: `1024`): Size of chunks when using streaming processing
- `chunk_callback` (callable, default: `None`): Callback function called with each processed chunk
- `progress_callback` (callable, default: `None`): Callback function called with (processed_bytes, total_bytes)

### Text Formatting

- `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
- `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)
- `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)
- `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists
- `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
- `sub_symbol` (str, default: `''`): Custom symbol for subscript text
- `sup_symbol` (str, default: `''`): Custom symbol for superscript text

### Text Escaping

- `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting
- `escape_underscores` (bool, default: `True`): Escape `_` characters to prevent unintended formatting
- `escape_misc` (bool, default: `True`): Escape miscellaneous characters to prevent Markdown conflicts

### Links and Media

- `autolinks` (bool, default: `True`): Automatically convert valid URLs to Markdown links
- `default_title` (bool, default: `False`): Use default titles for elements like links
- `keep_inline_images_in` (list, default: `None`): Tags where inline images should be preserved

### Code Blocks

- `code_language` (str, default: `''`): Default language identifier for fenced code blocks
- `code_language_callback` (callable, default: `None`): Function to dynamically determine code block language

### Text Wrapping

- `wrap` (bool, default: `False`): Enable text wrapping
- `wrap_width` (int, default: `80`): Width for text wrapping

## Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
submitting PRs to avoid disappointment.

### Local Development

1. Clone the repo

1. Install system dependencies (requires Python 3.9+)

1. Install the project dependencies:

    ```shell
    uv sync --all-extras --dev
    ```

1. Install pre-commit hooks:

    ```shell
    uv run pre-commit install
    ```

1. Run tests to ensure everything works:

    ```shell
    uv run pytest
    ```

1. Run code quality checks:

    ```shell
    uv run pre-commit run --all-files
    ```

1. Make your changes and submit a PR

### Development Commands

```shell
# Run tests with coverage
uv run pytest --cov=html_to_markdown --cov-report=term-missing

# Lint and format code
uv run ruff check --fix .
uv run ruff format .

# Type checking
uv run mypy

# Test CLI during development
uv run python -m html_to_markdown input.html

# Build package
uv build
```

## Performance

The library is optimized for performance with several key features:

- **Efficient ancestor caching**: Reduces repeated DOM traversals using context-aware caching
- **Streaming support**: Process large documents in chunks to minimize memory usage
- **Optional lxml parser**: ~30% faster parsing for complex HTML documents
- **Optimized string operations**: Minimizes string concatenations in hot paths

Typical throughput: ~2 MB/s for regular processing on modern hardware.

## License

This library uses the MIT license.

## HTML5 Element Support

This library provides comprehensive support for all modern HTML5 elements:

### Semantic Elements

- `<article>`, `<aside>`, `<figcaption>`, `<figure>`, `<footer>`, `<header>`, `<hgroup>`, `<main>`, `<nav>`, `<section>`
- `<abbr>`, `<bdi>`, `<bdo>`, `<cite>`, `<data>`, `<dfn>`, `<kbd>`, `<mark>`, `<samp>`, `<small>`, `<time>`, `<var>`
- `<del>`, `<ins>` (strikethrough and insertion tracking)

### Form Elements

- `<form>`, `<fieldset>`, `<legend>`, `<label>`, `<input>`, `<textarea>`, `<select>`, `<option>`, `<optgroup>`
- `<button>`, `<datalist>`, `<output>`, `<progress>`, `<meter>`
- Task list support: `<input type="checkbox">` converts to `- [x]` / `- [ ]`

### Table Elements

- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`
- **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts
- **Smart cleanup**: Automatically handles table styling elements for clean Markdown output

### Interactive Elements

- `<details>`, `<summary>`, `<dialog>`, `<menu>`

### Ruby Annotations

- `<ruby>`, `<rb>`, `<rt>`, `<rtc>`, `<rp>` (for East Asian typography)

### Media Elements

- `<img>`, `<picture>`, `<audio>`, `<video>`, `<iframe>`
- SVG support with data URI conversion

### Math Elements

- `<math>` (MathML support)

## Advanced Table Support

The library provides sophisticated handling of complex HTML tables, including merged cells and proper structure conversion:

```python
from html_to_markdown import convert_to_markdown

# Complex table with merged cells
html = """
<table>
    <caption>Sales Report</caption>
    <tr>
        <th rowspan="2">Product</th>
        <th colspan="2">Quarterly Sales</th>
    </tr>
    <tr>
        <th>Q1</th>
        <th>Q2</th>
    </tr>
    <tr>
        <td>Widget A</td>
        <td>$50K</td>
        <td>$75K</td>
    </tr>
</table>
"""

result = convert_to_markdown(html)
```

**Features:**

- **Merged cell support**: Handles `rowspan` and `colspan` attributes intelligently
- **Clean output**: Automatically removes table styling elements that don't translate to Markdown
- **Structure preservation**: Maintains table hierarchy and relationships

## Acknowledgments

Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.
