pandas_genomics.io.from_plink¶

pandas_genomics.io.from_plink(input: Union[str, pathlib.Path], swap_alleles: bool = False, max_variants: Optional[int] = None, categorical_phenotype: bool = True)[source]¶

Load genetic data from plink v1 files (.bed, .bim, and .fam) into a DataFrame.

Parameters

input: str or Path: PLINK sample (no extension): .bed, .bim and .fam files with the same name and location must exist.
swap_alleles: bool: False by default, in which case “allele2” (usually major) in the bim file is considered the “reference” allele. If True, “allele1” (usually minor) is considered the “reference” allele.
max_variants: Optional[int]: If provided, only load this number of variants
categorical_phenotype: bool, True by default: If True, the phenotype is encoded as a categorical when loaded (1 = “Control”, 2 = “Case”, otherwise missing. If False, load values directly.

Returns

DataFrame: Columns correspond to variants (named as {variant_number}_{variant ID}). Rows correspond to samples and index columns include sample information.

Notes

Plink v1 files encode all variants as diploid (2n) and utilize “missing” alleles if the variant is actually haploid