PepSeqPred

Residue-level epitope prediction with reproducible evidence.

PepSeqPred predicts epitope masks for protein sequences and supports full developer workflows for preprocessing, training, evaluation, and HPC execution.

Python 3.12+ESM2 embeddingsDDP-ready trainingSeeded evaluation snapshots

Install

pip install pepseqpred

Quickstart APIs

Pretrained API

from pepseqpred import load_pretrained_predictor

predictor = load_pretrained_predictor(
    model_id="default",
    device="auto"
)
result = predictor.predict_sequence(
    "ACDEFGHIKLMNPQRSTVWY",
    header="example_protein"
)
print(result.binary_mask)

Artifact-path API

from pepseqpred import load_predictor

predictor = load_predictor(
    model_artifact="path/to/ensemble_manifest.json",
    device="auto"
)
result = predictor.predict_sequence(
    "ACDEFGHIKLMNPQRSTVWY"
)
print(result.binary_mask)

Why PepSeqPred

PepSeqPred is built for reproducible residue-level prediction under strong class imbalance. Training and validation are documented as protocol-first; numeric scorecards are based on seeded evaluation snapshots.

Training

  • Ensemble-kfold and seeded runs with deterministic split/train seeds.
  • ID-family-aware splitting to reduce leakage risk across related proteins.
  • DistributedDataParallel support for multi-GPU HPC workflows.
  • Run artifacts include checkpoints, manifests, and run-level CSV/JSON outputs.

Validation

  • Checkpoint selection records threshold, PR-AUC, F1, MCC, AUC, and AUC10.
  • Threshold policy maximizes recall subject to minimum precision constraints.
  • Validation metrics are captured per run with explicit seed provenance.
  • PhaseB validation scorecards are shown directly from manually curated run artifacts.

Evaluation

  • Seeded external Cocci evaluation compares flagship models across sets 1-10.
  • Class prevalence is very low, so PR metrics are emphasized over accuracy.
  • Paired set statistics include bootstrap confidence intervals and sign tests.
  • Frozen benchmark snapshot remains available as a curated web evidence source.