Metadata-Version: 2.3
Name: py-pinyin-split
Version: 4.0.0
Summary: Library for splitting Hanyu Pinyin phrases into all valid syllable combinations
Project-URL: Documentation, https://github.com/lstrobel/py-pinyin-split#readme
Project-URL: Issues, https://github.com/lstrobel/py-pinyin-split/issues
Project-URL: Source, https://github.com/lstrobel/py-pinyin-split
Author-email: lstrobel <mail@lstrobel.com>, Thomas Lee <thomaslee@throput.com>
Maintainer-email: lstrobel <mail@lstrobel.com>
License: MIT
Keywords: chinese,pinyin
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Utilities
Requires-Python: <3.13,>=3.8
Requires-Dist: marisa-trie>=1.2.1
Requires-Dist: nltk==3.9.1
Description-Content-Type: text/markdown

# py-pinyin-split

A Python library for splitting Hanyu Pinyin words into syllables. Built on [NLTK's](https://github.com/nltk/nltk) [tokenizer interface](https://www.nltk.org/api/nltk.tokenize.html), it handles standard syllables defined in the [Pinyin Table](https://en.wikipedia.org/wiki/Pinyin_table) and supports tone marks.


Based originally on [pinyinsplit](https://github.com/throput/pinyinsplit) by [@tomlee](https://github.com/tomlee).

PyPI: https://pypi.org/project/py-pinyin-split/

## Installation

```bash
pip install py-pinyin-split
```

## Usage

Instantiate a tokenizer and split away.

The tokenizer can handle standard Hanyu Pinyin with whitespaces and punctuation. However, invalid pinyin syllables will raise a `ValueError`

The tokenizer uses syllable frequency data to resolve ambiguous splits.

```python
from py_pinyin_split import PinyinTokenizer

tokenizer = PinyinTokenizer()

# Basic splitting
tokenizer.tokenize("nǐhǎo")  # ['nǐ', 'hǎo']
tokenizer.tokenize("Běijīng")  # ['Běi', 'jīng']

# Handles whitespace and punctuation
tokenizer.tokenize("Nǐ hǎo ma?")  # ['Nǐ', 'hǎo', 'ma', '?']
tokenizer.tokenize("Wǒ hěn hǎo!")  # ['Wǒ', 'hěn', 'hǎo', '!']

# Handles ambiguous splits using frequency data
tokenizer.tokenize("xian")  # ['xian'] not ['xi', 'an']
tokenizer.tokenize("wanan")  # ['wan', 'an'] not ['wa', 'nan']

# Tone marks or punctuation help resolve ambiguity
tokenizer.tokenize("xīān")  # ['xī', 'ān']
tokenizer.tokenize("xián")  # ['xián']
tokenizer.tokenize("Xī'ān") # ["Xī", "'", "ān"]

# Raises ValueError for invalid pinyin
tokenizer.tokenize("hello")  # ValueError

# Optional support for non-standard syllables
tokenizer = PinyinTokenizer(include_nonstandard=True)
tokenizer.tokenize("duang")  # ['duang']
```

## Related Projects
- https://pypi.org/project/pinyintokenizer/
- https://pypi.org/project/pypinyin/
- https://github.com/throput/pinyinsplit
