Venom Annotation Pipeline
From manual proteomics annotation to a reproducible classification system.
Context
This project started long before I wrote any code.
During my proteomics specialization, I worked on the characterization of snake venom proteins identified through bottom-up proteomics. Functional annotation was done manually: each UniProt accession had to be reviewed individually, the biological information interpreted, and the classifications recorded in spreadsheets.
Years later, when the same dataset was revisited for manuscript preparation, the problem resurfaced. Re-running the analysis meant depending again on whoever still remembered the original reasoning. That was the point where I decided to redesign the workflow, not to update the spreadsheet, but to replace the process entirely.
The Problem
The challenge was not simply querying the UniProt API.
The real challenge was translating biological reasoning into a reproducible system. Many classification decisions relied on domain knowledge, exceptions, and contextual interpretation. If those rules remained implicit in the analyst's head, the process would keep depending on that person. The goal was to make those decisions explicit and defensible.
Beyond that, biological databases are rich but inconsistent. The same protein may have a well-defined domain annotation in InterPro and a vague family comment in UniProt. Another may have solid family information but no catalogued domains. Some have almost nothing.
There is also the fact that snake venoms are biologically messy in an interesting way. Alongside the toxins, plasma proteins and blood cell proteins from the snake itself end up co-purifying during extraction. Actin, tubulin, and histones: these proteins appear in the data and must be discarded before classification, otherwise the final result gets distorted.
And there was another, subtler problem: some toxins are inhibitory by nature. Cobra venom factor, for example, inhibits the complement system, but it is a legitimate toxin. Discarding any protein with the word "inhibitor" in its name would be a serious mistake.
Design Decisions
View pipeline
- Input Proteomics spreadsheet (.xlsx) · InterPro entry list (.csv)
- Accession extraction + InterPro dictionary load
- UniProt REST API retrieval
- Annotation enrichment
- Venom class assignment
- Weighted scoring
- High / Medium / Low confidence
- Safety filters
- Contaminant removal
- Non-toxin inhibitor validation
- Secondary class detection
- Flagged when secondary score ≥ 75% of top
- SVMP subclassification
- P-I · P-II · P-III by domain composition
- Evidence trail generation
- field:term pairs per accession
- Annotated Excel output
An evidence hierarchy with binary weights
The first decision was not to treat all information sources as equivalent. A domain identified by InterPro is much more specific evidence than a similarity comment in UniProt. An InterPro family is more reliable than a free-text description field in the input spreadsheet.
To reflect this, I implemented a field-weighted scoring system with weights as powers of 2:
| Field | Weight | Source |
|---|---|---|
| InterPro Domain | 32 | Most specific evidence |
| InterPro Family | 16 | — |
| Active Sites | 8 | — |
| Spreadsheet description | 4 | Free-text input |
| UniProt family comment | 2 | Fallback when no InterPro data |
| InterPro Superfamily | 1 | Most generic evidence |
Using powers of 2 was a deliberate choice: it guarantees that no combination of less reliable sources can outweigh a more reliable one. A domain match is worth more than all other evidence combined. It is a system that reflects the reality of biological databases.
The confidence tier of the final classification is derived from the highest weight that contributed to the result: if it came from a domain or family, it is "High"; if only from a free-text description, it is "Medium"; if only from a superfamily, it is "Low".
Safety filters with two-step logic
For contaminants, the solution was straightforward: a list of terms that immediately discards the protein before any scoring.
For inhibitors, I needed a two-step logic because the problem was ambiguous. First I check whether the protein is a legitimate venom inhibitor (Kunitz-type, three-finger toxins, cobra venom factor). If so, it goes straight to classification. Only then do I check whether it matches non-toxic inhibitor terms (serpin, alpha-2-macroglobulin). The order matters: without it, legitimate toxins would be discarded along with the noise.
Ambiguity detection
Multi-domain proteins exist and are a real problem. A protein may carry evidence consistent with two classes at once. Rather than forcing a single answer, the pipeline records secondary classes whenever another class scores at least 75% of the top score. Those cases are flagged for manual review: the system does not try to be smarter than it should be.
SVMP sub-classification
Snake venom metalloproteinases (SVMPs) have three structural sub-classes based on domain composition: P-I has only the catalytic domain; P-II adds a disintegrin domain; P-III further adds a cysteine-rich domain. This matters biologically because it determines the toxic activity profile.
Detecting P-III requires two specific terms to co-occur in the same text ("cysteine-rich" + "disintegrin"). Neither alone is sufficient. This kind of co-occurrence logic required careful ordering of checks: first the direct and explicit terms, then the required combinations, and only then inference from isolated terms.
Full traceability
Every classification produces an evidence column with the exact trail of field:term pairs that drove the decision. For example:
Domain:metalloproteinase | Domain:disintegrin | Family:m12b
This is not optional: it is essential. An automated pipeline that does not explain its decisions is unusable in a scientific context. Any suspicious result must be auditable without having to read through the code.
Technical Challenges
Term overlap in the classification map. "Phospholipase A2" and "phospholipase" are different terms that map to different classes. Without controlled matching, "phospholipase" could be found inside "phospholipase a2" and misclassify the protein. The solution was to sort the term map by descending length before any lookup; more specific terms are always checked first.
Public API with implicit rate limits. UniProt does not clearly document its limits, so I added a 200ms delay between requests as a courtesy measure and to avoid intermittent blocks. Not glamorous, but it works and it is considerate.
Proteins with no annotation. Not every protein has information in UniProt: some return completely empty fields. The pipeline handles this gracefully without breaking, recording the result as "Others/Non Toxins" and moving on.
GO terms and row explosion. Gene Ontology terms arrive as lists per protein. To facilitate downstream analysis (filtering by molecular function, for example), the final table is "exploded": each GO term becomes a separate row. This multiplies the output rows and needs to be well documented to avoid confusing whoever uses the data.
Limitations
The pipeline is rule-based, not machine learning. The term map is handcrafted: it works well for the most well-known and well-documented toxin classes, but fails silently for atypical proteins or poorly annotated ones: they fall into Others/Non Toxins with no signal that they might have deserved a match.
A natural evolution would be to replace exact term matching with semantic similarity. Text embeddings would capture functionally equivalent descriptions even when terminology varies across databases, but they would not solve the root problem: if the annotation in UniProt or InterPro is sparse, the embedding will also be poor. To overcome this limitation, sequence-based models like ESM-2 or ProtTrans would be the most robust candidates, as they operate directly on the protein sequence, independent of the quality of textual annotation. Adopting them, however, would require fetching FASTA sequences by accession and rewriting the classification core. It would not be an incremental improvement.
The entire pipeline depends on external APIs. If UniProt changes its response format or discontinues an endpoint, the code breaks. This is a real risk that was not mitigated with any caching layer or local fallback.
The SVMP sub-classification returns "P-? (SVMP confirmed)" when there is insufficient evidence in the queried fields. That is honest, but it is a gap that only manual review can fill.
The GO term system filters only molecular function (F:) and biological process (P:) terms, discarding cellular component (C:). That was a pragmatic decision, but it may not be what every user wants.
What I Learned
More than the code itself, this project taught me that the hardest part is often not implementing a solution. It is understanding the problem well enough to model it correctly without losing the details that actually matter.
The technical side of the project: querying the API, parsing JSON, merging DataFrames, was relatively straightforward. What took the most time was thinking through the biological edge cases: serpins that look like toxins but are not, toxins that look like inhibitors but are not, protein fragments that map to two domains at once.
I learned that rule-based classification systems succeed or fail exactly on the quality of their exception rules. The main rule is easy to write. What separates a usable pipeline from an unusable one is how it handles what does not fit neatly anywhere. I also learned to value traceability from the very beginning. The evidence column was not an afterthought: but it could have been, if I had not anticipated that someone (including myself) would look at a result and ask "why did it classify this way?" In scientific tools, the answer to that question needs to be in the output, not just in the code.