Metadata-Version: 2.2
Name: llm-obfuscator
Version: 0.1.0
Summary: A tool for obfuscating text by manipulating token IDs while preserving token count and structure
Home-page: https://github.com/yourusername/llm-obfusicator
Author: Alex
Author-email: your.email@example.com
Project-URL: Bug Reports, https://github.com/yourusername/llm-obfusicator/issues
Project-URL: Source, https://github.com/yourusername/llm-obfusicator
Keywords: llm,tokenization,obfuscation,nlp,ai
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: numpy>=1.20.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: project-url
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# LLM Token Obfuscator

A tool for obfuscating text by manipulating token IDs while preserving token count and structure. Originally developed for benchmarking LLM inference performance and prefix caching behavior by generating test data that maintains token patterns but with obscured text.

## Overview

This project provides a system for obfuscating text by applying a shift to token IDs. The obfuscation is reversible and preserves the token count, making it useful for:

- Testing LLM systems with obfuscated content
- Benchmarking tokenization performance
- Creating privacy-preserving datasets
- Generating synthetic text with realistic token distributions

Example of obfuscated text patterns:

```
Original: The quick brown fox jumps
Obfuscated: eng($_ ét rl manga

Original: The quick brown fox runs
Obfuscated: eng($_ ét rl Android
```

Note how the common prefix "The quick brown" is obfuscated to "eng($\_ ét" in both cases, preserving the pattern.

## Installation

### From PyPI (Recommended)

```bash
pip install llm-obfuscator
```

### From Source

1. Clone the repository:

```bash
git clone https://github.com/yourusername/llm-obfusicator.git
cd llm-obfusicator
```

2. Install dependencies:

```bash
pip install -r requirements.txt
```

3. Install the package in development mode:

```bash
pip install -e .
```

## Usage

### Command Line Interface

The package provides a command-line interface for easy use:

```bash
# Tokenize text
llm-obfuscator tokenize gpt-4 "Hello, world!"

# Obfuscate text
llm-obfuscator obfuscate gpt-4 "Hello, world!"

# Obfuscate text with a fixed shift
llm-obfuscator obfuscate gpt-4 "Hello, world!" --shift 42
```

### Python API

```python
from llm_obfuscator import obfuscate_text, tokenize_text

# Obfuscate text using a specific model's tokenizer
obfuscated = obfuscate_text("gpt-4", "Hello, world!")
print(obfuscated)

# Use a fixed shift value for deterministic results
obfuscated = obfuscate_text("gpt-4", "Hello, world!", shift=42)
print(obfuscated)

# Tokenize text
tokens = tokenize_text("gpt-4", "Hello, world!")
print(tokens)
```

### Supported Models

The system supports both OpenAI and HuggingFace tokenizers:

- OpenAI models: `gpt-4`, `gpt-3.5-turbo`, `cl100k_base`, etc.
- HuggingFace models: `gpt2`, `bert-base-uncased`, etc.

## Testing

The project includes several test suites to validate the obfuscation system:

### Running All Tests

The easiest way to run all tests is to use the provided shell script:

```bash
# Make the script executable (if needed)
chmod +x run_all_tests.sh

# Run all tests
./run_all_tests.sh
```

This will run all test files in sequence, including unit tests and specialized test scripts.

### Running Basic Tests

```bash
# Run all tests
python -m pytest tests/

# Run specific test file
python -m pytest tests/test.py
```

### Specialized Test Scripts

The project includes specialized test scripts for different aspects of the obfuscation system:

```bash
# Test with real-world examples
python tests/test_real_world.py

# Test obfuscation demonstration
python tests/test_obfuscation.py

# Test mathematical properties
python tests/test_mapping_properties.py
```

## Validation

The obfuscation system has been validated to ensure:

1. **Token Count Preservation**: The number of tokens remains the same after obfuscation
2. **One-to-One Mapping**: The obfuscation is a bijective (one-to-one) mapping
3. **Frequency Preservation**: Token frequency distributions are preserved
4. **Reversibility**: The original text can be recovered by applying the reverse shift

## How It Works

The obfuscation process works as follows:

1. Text is tokenized using the specified model's tokenizer
2. Each token ID is shifted by a fixed amount (either specified or randomly generated)
3. The shifted tokens are detokenized back to text

The shift operation is performed modulo the vocabulary size (typically 50,000) to ensure all tokens remain within the valid vocabulary range.

> **Note**: The token count preservation has a known margin of error of up to 8% for obscured texts. We are working on improving this.

## License

[MIT License](LICENSE)
