Metadata-Version: 2.1
Name: ERStruct
Version: 0.1.12
Summary: Determine number of principle components based on sequencing data
Home-page: https://github.com/ecielyang/ERStruct
Author: Jinghan Yang
Author-email: <eciel@connect.hku.hk>
License: MIT
Keywords: Population structure,Principal component,Random matrix theory,Sequencing data,Spectral analysis
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown


# ERstruct - Official Python Implementation

A Python package for inferring the number of top informative PCs that capture population structure based on genotype information.

## Requirements for Data File
Data files must be of .npy format. The data matrix must with 0,1,2 and/or NaN (for missing values) entries only, the rows represent individuals and columns represent markers. If there are more than one data files, the data matrix inside must with the same number of rows.

## Dependencies
ERStruct depends on `numpy`, `torch` and `joblib`.

## Installation
Users can install `ERStruct` by running the command below in command line:
```commandline
pip install ERStruct
```

Import the module
```
from ERStruct import erstruct
```
## Parameters
```
erstruct(n, path, rep, alpha, cpu_num=1, device_idx="cpu", varm=1, Kc=-1)
```

**n** *(int)* - total number of individuals in the study

**path** *(str)* - the path of data file(s)

**filename** *(list)* - the name of the data file(s)

**rep** *(int)* - number of simulation times for the null distribution

**alpha** *(float)* - significance level, can be either a scaler or a vector

**Kc** *(int)* - a coarse estimate of the top PCs number (set to `-1` by default)

**cpu_num** *(int)* - optional, number of CPU cores to be used for parallel computing. (set to `1` by default)

**device_idx** *(str)* - device you are using, "cpu" pr "gpu". (set to `"cpu"` by default)

**varm** *(int)*: - Allocated memory (in bytes) of GPUs for computing. When device_idx is set to "gpu", the varm parameter can be specified to increase the computational speed by allocating the required amount of memory (in bytes) to the GPU.  (set to 2e+8 by default)

## Examples
Run the code on CPUs:
```commandline
test = erstruct(2504, './', ['test_chr21', 'test_chr22'], 5000, 1e-4, cpu_num=1, device_idx="cpu")
K = test.run()
```
Run the code on GPUs:
```commandline
test = erstruct(2504, './', ['test_chr21', 'test_chr22'], 5000, 1e-4, device_idx="gpu", varm=12000000000)
K = test.run()
```
Example data files `test_chr21.npy` and `test_chr22.npy` can be found on the "sample_data" of [ERStruct GitHub repository](https://github.com/ecielyang/ERStruct).




## Other Details
## Other Details
Please refer to our paper
> [ERStruct: A Python Package for Inferring the Number of Top Principal Components from Whole Genome Sequencing Data](https://www.biorxiv.org/content/10.1101/2022.08.15.503962v2)

For details of the ERStruct algorithm:
> [ERStruct: An Eigenvalue Ratio Approach to Inferring Population Structure from Sequencing Data](https://www.researchgate.net/publication/350647012_ERStruct_An_Eigenvalue_Ratio_Approach_to_Inferring_Population_Structure_from_Sequencing_Data)

If you have any question, please contact the email eciel@connect.hku.hk.
