Metadata-Version: 2.1
Name: xpath_filter
Version: 1.0.0
Summary: XPath filter of HTML files
Home-page: https://github.com/CarlosAdp/xpath-filter
Author: Carlos Pinto
Author-email: carlos.adpinto@gmail.com
License: Apache License 2.0
Keywords: xpath html scraping webscraping scraper webscraper
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Text Processing :: Markup :: HTML
Description-Content-Type: text/markdown
Provides-Extra: test
Provides-Extra: dev
License-File: LICENSE

# xpath-filter

[![version](https://img.shields.io/badge/version-1.0.0-blue)](https://pypi.org/project/xpath-filter)
[![tests](https://img.shields.io/badge/tests-passed-green)](tests/test_xpath_filter.py)

Filter HTML files using xpath mappings.

## Installation

Install `xpath-filter` using pip:

```shell
pip install xpath-filter
```

## Usage

Import the `xpath_filter` function from the `xpath_filter` module. Find below
some use cases.

### Filtering HTML file

```python
>>> xpaths = {
...     'article': {
...         'xpath': '//div[@class="article"]',
...         'matches': 'all',
...         'elements': {
...             'author': './@data-author',
...             'content': './p/text()'
...         }
...     }
... }
>>> xpath_filter('index.html', xpaths)
```

Result

```python
{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}
```

### Filtering HTML file from a YAML xpaths definition.

File at "xpaths.yml":

```yml
article:
    xpath: //div[@class="article"]
    matches: first
    elements:
        author: './@data-author'
        content: ./p/text()
```

Code:

```python
>>> xpath_filter('index.html', 'xpaths.yml')
```

Result

```python
{'article': [{'author': 'Ana', 'Content': 'Awesome'}, {'author': 'Bob', 'Content': 'Bad'}]}
```

### Simplified filtering

By definining only the xpath of an HTML element, only its first match is returned and no inner element is searched.

```python
>>> xpath_filter('index.html', {'article': '//div[@class="article"]'})
>>> xpath_filter('index.html', {'article': '//div[@class="article"]/p/text()'})
```

Result

```python
{'article': <Element div at 0x1f08369ea80>}
{'article': 'Awesome'}
```
