Metadata-Version: 2.1
Name: small-web-dataset
Version: 0.0.2
Summary: Process all the RSS and Atom feeds from the Small Web feeds list, validate them, generate statistics and eventually more.
Home-page: https://github.com/fgiasson/small-web-dataset
Author: Frederick Giasson
Author-email: Frederick Giasson <fred@fgiasson.com>
License: GNU GPLv3
Project-URL: Homepage, https://github.com/fgiasson/small-web-dataset
Project-URL: Bug Tracker, https://github.com/fgiasson/small-web-dataset/issues
Keywords: nbdev jupyter notebook python
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer[all]
Requires-Dist: python-dotenv
Requires-Dist: requests
Requires-Dist: feedparser==6.0.10
Requires-Dist: transformers
Requires-Dist: torch
Requires-Dist: langdetect
Provides-Extra: dev

# small-web-dataset

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

The Small Web Dataset is a command line tool used to generate a dataset
by aggregating of all the data from the [Kagi Small Web
index](https://github.com/kagisearch/smallweb/blob/main/smallweb.txt).

What is the Small Web? The Small Web is the web of independent websites
that are not part of the big tech platforms. Here are some more
reference about the concept
\[[1](https://neustadt.fr/essays/the-small-web/)\]\[[2](https://benhoyt.com/writings/the-small-web-is-beautiful/)\]\[[3](https://smallweb.page/why)\]\[[4](https://ar.al/2020/08/07/what-is-the-small-web/)\]\[[5](https://news.ycombinator.com/item?id=29768197)\].

There are different purpose for this tool and the dataset it creates:

1.  help analyzing the Kagi Small Web index, to detect and eventually
    remove the sites that doesn’t comply with the policy of the index
2.  create a dataset of all the sites that compose the index. This
    dataset is a very specialized subset of websites that are created
    and maintained by independent people, mostly old school bloggers.
    This dataset can be used for different specialized ML training, for
    example to train a classifier to detect the Small Web sites from the
    Big Web sites, etc.

## Install

To install the command line tool, you simply have to:

``` sh
git clone https://github.com/fgiasson/small-web-dataset.git
cd small-web-dataset

make build
make install-local-build
```

This will clone the repository, build the command line tool and install
it in your local Python environment.

## Configure

You have to make those environment variables available in your
environment:

| Variable     | Description                                                                  |
|--------------|------------------------------------------------------------------------------|
| `FEEDS_PATH` | The path where you want to save all the feeds on your local file system      |
| `DB_PATH`    | The path where you want to save the SQLite dataset on your local file system |

## How to use

You can make sure that the command line tool is installed by running,
and that the latest version is available by running:

``` sh
small-web-dataset version
```

You can get the help documentation by running:

``` sh
small-web-dataset --help
```

You can check what are the current configuration options for the tool in
the current environment by running:

``` sh
small-web-dataset config
```

To create the dataset, you simply have to run the following command:

``` sh
small-web-dataset sync-feeds
```

This command will do three things:

1.  it will download all the RSS and Atom feeds from the Kagi Small Web
    index in the `FEEDS_PATH` folder
2.  it will read all the local feeds files and import them in a local
    SQLite database in the `DB_PATH` folder
3.  it will infer the core language of a feed from the language used to
    write the articles in the feed, and it will add this information in
    the database

Optionally, if you already have a local cache of the feeds and you only
want to update/recreate the database, you simply have to specify the
`DDMMYYYY` folder of the feeds you want to process:

``` sh
small-web-dataset sync-feeds 18092023
```
