View on GitHub

Library-Design

Peptide design and oligonucleotide encoding

Download this project as a .zip file

One Hundred Reps

Overview

The One Hundred Reps scripts can be used to reduce the size of your Target protein sequence file by removing identical protein sequences. This includes perfect matches as well as those that are a subset of another target. A single, longest representative will be maintained in the output.

Two versions of this script are available: a Python version and a C version. The C version is recommended for large datasets.

Inputs

The only required input is a fasta-formatted file containing a set of target protein sequences.

Outputs

There is one required output, a fasta-formatted file containing the unique, representative sequences from the input.

There is one optional output, a tab-delimited map file, which relates all removed sequences to the representatives to which they are identical (one line per unique representative).

Installation

Use

In this example, our input file will contain a set of 17,786 protein sequences dowloaded from UniProt.

The output should contain 14,505 unique representatives from this input file.

Command (Python version):

one_hundred_reps.py \
  -f poxviridae_unaligned_min30AA.fasta \
  -r poxviridae_unaligned_min30AA_100rep.fasta \
  -m 100_rep_map.txt
- This real world example took <2 min to complete on a Macbook Pro laptop (Apple M1, macOS v11.6.4). 

Command (C version, using 2 threads):

one_hundred_reps \
  -f poxviridae_unaligned_min30AA.fasta \
  -n 2 \
  -m 100_rep_map.txt
- This real world example took ~# min to complete on a Macbook Pro laptop (Apple M1, macOS v11.6.4).