ViMRT: a text-mining tool and search engine for automated virus mutation recognition

Abstract Motivation Virus mutation is one of the most important research issues which plays a critical role in disease progression and has prompted substantial scientific publications. Mutation extraction from published literature has become an increasingly important task, benefiting many downstream applications such as vaccine design and drug usage. However, most existing approaches have low performances in extracting virus mutation due to both lack of precise virus mutation information and their development based on human gene mutations. Results We developed ViMRT, a text-mining tool and search engine for automated virus mutation recognition using natural language processing. ViMRT mainly developed 8 optimized rules and 12 regular expressions based on a development dataset comprising 830 papers of 5 human severe disease-related viruses. It achieved higher performance than other tools in a test dataset (1662 papers, 99.17% in F1-score) and has been applied well to two other viruses, influenza virus and severe acute respiratory syndrome coronavirus-2 (212 papers, 96.99% in F1-score). These results indicate that ViMRT is a high-performance method for the extraction of virus mutation from the biomedical literature. Besides, we present a search engine for researchers to quickly find and accurately search virus mutation-related information including virus genes and related diseases. Availability and implementation ViMRT software is freely available at http://bmtongji.cn:1225/mutation/index.


Introduction
Viruses cause many human diseases including cancers by persistent chronic infection and induce public health insecurity (Escandon et al., 2021;Zapatka et al., 2020). Viral mutation is the ultimate source of viral genetic diversity and evolution which plays a key role in the relationship between the virus and disease progression (Sanjuan and Domingo-Calap, 2016). For example, the point mutations of hepatitis B virus (HBV) surface are closely related to immune and vaccine escape, resulting in active viral replication and the development of hepatocellular carcinoma (Zhao et al., 2021). Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) variants can cause a very high risk of infection (Araf et al., 2022). Substantial virus research has been prompted by the importance of virus mutation across many diverse clinical and basic research fields and reported in PubMed with constant updates, which provides an invaluable resource for obtaining key information about virus genomic variants. Therefore, it is of great significance for clinical diagnosis and treatment to perform extraction and relationship analysis of virus-related mutations, genes and disease entities by the efficient use of the rich literature.
However, those unstructured texts are limited by the insufficient feasibility of traditional manual curation in utilization, as it cannot be carried out on all of the available literature. Furthermore, due to the inconsistency of virus mutations and the lack of standard mutation nomenclature, the conventional 'Keyword Search' may be difficult to quickly and non-redundantly get desired mutation information from literature in PubMed. As such, extracting, mining and integrating virus mutation-related information from those extremely large amounts of literature has become an increasingly important task in many downstream applications, such as virus mutation database development (Davey et al., 2014;Wang et al., 2022), immune escape mechanism exploration (Huang et al., 2020), vaccine design and updating (Xie et al., 2021), drug resistance mutation analysis and assistance in clinical personalized medicine Yeo et al., 2020).
With the wide application of text mining and natural language processing in the biomedical field, many named-entity recognition and named-entity normalization tools for mutation have been successively developed to overcome the problem of mutation recognition from the biomedical literature (Lee et al., 2021). For example, MutationFinder uses about 700 regular expressions to identify individual nucleotide or amino acid point mutations or the mutations in simple natural language forms (Caporaso et al., 2007). Extractor of mutation can identify insertions or deletion mutations in addition to point mutations based on regular expression (Doughty et al., 2011). tmVar comprehensively uses conditional random fields (CRF) with regular expression for mutation recognitions (Wei et al., 2013). AVADA is another machine learning and regular expression-based mutation extraction tool which improved gene-mutation mapping in full-text papers (Birgmeier et al., 2020). nala mines natural language processing mutations from the literature (Cejuela et al., 2017). However, those tools aimed at identifying human gene mutation. The accuracy and recall largely reduce when they are applied to identify virus mutation due to the specific written forms of virus mutation significantly differing from human mutation. For example, most written forms often contain slash symbols resulting from occurring multiple mutations at a single site, e.g. E138A/G/K/Q/R/S in human immunodeficiency virus (HIV) reverse transcriptase region (Martin-Alonso et al., 2020). The other popular form of virus variant nomenclature includes the lowercase virus gene at the beginning, e.g. sG145R in the HBV S gene . Consequently, developing a tool for virus mutation from free text is promising but remains a challenge especially for a variety of writing styles and the shortage of a unified nomenclature as shown by the human genome variation society (HGVS) (den Dunnen et al., 2016). Moreover, extracting the relationship of virus mutation-related entities is also of great biological and clinical application in virus research, especially in genes and diseases (Araf et al., 2022;Soulie et al., 2020). There are several semantic search engines for linking genomic variant data, such as LitVar (Allot et al., 2018), PubTator (Wei et al., 2019), but their information focuses on human mutations or keyword-based search, which hinders quality improvement (sensitivity and specificity) of search results.
To address the above difficulties and challenges, this study developed a novel localization recognition tool of virus mutation to extract and standardize mutation data from available literature, and further integrated the co-occurrence information of virus mutation between viral gene and human disease, with an attempt to design a user-friendly search engine to serve the requirement of researchers to quickly search and download mutation related information, thereby aiding experimental design and clinical guidelines.

System and methods
An overview of ViMRT is summarized in Figure 1, including virus dataset preparation, mutation recognition tool development, performance evaluation and web construction.

Virus dataset preparation
The gold standard file of virus mutation was collected from our previously developed ViMIC database (Wang et al., 2022), consisting of mutation-related literature in five viruses, HBV, HIV, human papillomavirus (HPV), human T-cell lymphotropic virus type-1 (HTLV1) and Epstein-Barr virus. In this study, the positive samples consisted of 1246 abstracts containing virus mutation information. To balance the positive and negative samples, 1246 abstracts with no virus mutation information were selected as the negative samples. All collected virus literature was divided into a development dataset and a test dataset on a scale of 1:2. In total, 415 positive abstracts and 415 negative abstracts were used as the development dataset, and 831 positive abstracts and 831 negative abstracts were used as the independent test dataset. All virus datasets are available at: http://bmtongji.cn:1225/mutation/dataset.

Mutation recognition tool development
ViMRT has two independent modules to identify mutations as follows: 2.2.1 Module 1: Optimizing the recognition result of tmVar by rule patterns In this module, we first used the tmVar (Wei et al., 2013) to obtain mutation information and then we built eight rule patterns to correct and standardize mutation identification of tmVar based on the site of mutation in the original sentences and mutation displacement (Table 1). This mainly included two aspects: negative results (identified entities are not actual virus mutations) and positive results (identified entities are actual virus mutations). For negative results, our rules mainly dealt with three recognition errors: multiple mutations linked with a slash ('A1762T/G1764A' was identified as 'A1762T/G' by tmVar), non-sense mutation ('Q118R/Stop' was identified as 'Q118R/S' by tmVar) and the three-letter abbreviation of amino acid ('Pro130Ile/Thr/Ser' was identified as 'Pro130Ile/T' by tmVar). The mentioned errors resulted from incomplete extractions of words. Thus, the recognition result can be optimized by completing words in sentences, and the incorrect recognition part was excluded after optimization. For example, 'A1762T/G' by tmVar will be extended into 'A1762T/G1764A' and normalized as 'A1762T' and 'G1764A', without the negative result 'A1762G'. For positive results, our rules mainly standardized five written forms: mutation with lowercase viral gene (rtM204V), simple natural language written form with a linking preposition (a mutation from T to G at nucleotide 178), mutation written form with '>' (8403A>G), a word beginning with mutation level (p.Lys316Glu) and a special written form of HPV mutant (E-C109G).

Module 2: Developing regular expression patterns to recognize virus mutation
Based on the development dataset and false positive results of tmVar, we classified the written form of virus mutation into three signatures for nucleotide and amino acid mutation: standard mutation (SM: written form of mutation with no punctuation), semi-standard mutation (SSM: written form of mutation containing punctuation) and natural language mutation (NLM: written form of mutation with the natural language) and then constructed corresponding regular expression patterns (Table 2). Finally, we combined the united results of the above two modules as the final recognition result of ViMRT with a standardized written format which consisted of the wild type with uppercase letter, the location number of the mutation and the mutated type with uppercase letter.

Performance evaluation
To evaluate the performance of ViMRT, we compared ViMRT with three other classic tools (tmVar, MutationFinder and nala). Moreover, we used the additional dataset including the abstracts of influenza virus (IV) and SARS-CoV-2 to further evaluate the recognition effect in other viruses. We calculated three different parameters: Precision, Recall and F1-score. The three parameters were defined as follows: where TP, FP and FN represent the numbers of true positive (identified entities are actual virus mutation), false positive (identified entities are not actual virus mutation) and false negative (identified non-virus mutation entities are actual virus mutation), respectively.

Virus genes extraction
We first built the virus gene corpus from PubMed and the NCBI gene database (https://www.ncbi.nlm.nih.gov/gene/) due to the particularity of different viral genes and then developed Python scripts to identify the virus genes from the sentences including virus mutation.

Virus mutation related diseases annotation
We used the Stanza natural language processing (NLP) library (https://github.com/stanfordnlp/stanza) for many human languages to identify the diseases from the sentences including virus mutation and further developed Python scripts to extract related-disease information based on disease corpus collected from CTD (http://ctdbase. org).

Web search engine development
To quickly search and find the interactions of virus mutation and related information for researchers, we used ViMRT to identify   The normalized mutation written forms.
virus mutations, genes and diseases from the 9629 PubMed abstracts and 3160 PMC full-text articles and calculated the co-occurrence correlation of the variant and disease in sentences by the Fisher's test. We finally constructed the user-friendly web pages for visual search and display using Django 3.2.6.

Virus mutation dataset
Viral mutation gold data contained 1246 abstracts on five viruses (2349 mutations) from the ViMIC database. The development dataset and test dataset were allocated in a ratio of 1:2. In total, SM accounted for 62.60% of total annotations. SSM was 36.45% and the fraction of NLM was only 0.95% (Fig. 2A). The fraction of mutation signatures in the corpus of different viruses is shown in Figure 2B.

Evaluation and comparison with existing tools on different datasets
To evaluate the performance of ViMRT, we compared recognition results with three other conventional tools, tmVar, nala and MutationFinder in the test dataset. Notably, Table 3 shows that ViMRT outperformed three other methods with the best two evaluation metrics (Recall: 98.95%, F1-score: 99.17%) and the precision also reached 99.39%. ViMRT displays the state-of-art performance to recognize the mutation of five viruses from the literature used in this study. To further estimate the application of our model in other viral contexts, we manually annotated 212 mutation-related abstracts of SARS-CoV-2 and IV. Figure 2C shows the fraction of mutation signatures. The recognition results are presented in Table 4. The performance of our method was higher than that of the three other methods, which also shows the feasibility and effectiveness of our method.

Recognition comparison with existing tools on SM, SSM and NLM
We further compared the recognition results of four methods in three mutation signatures (Table 5). For SM, four methods could perform well in mutations that resembled the HGVS nomenclature. However, ViMRT and tmVar could identify the gene mixed-written form and nonsense mutations that were not identified by nala or MutationFinder. For SSM, ViMRT, tmVar and nala could identify a large proportion of written forms, but only ViMRT could correctly detect SSM which contains more than two slashes. For NLM, ViMRT, tmVar and nala could discern the mutations in simple written form of natural language. Figure 2D shows the percentage of different mutation signatures with true positive results and false negative results by ViMRT, MutationFinder, nala and tmVar.
ViMRT performed more accurately than the three other tools, which had the highest percentage of positive results and the lowest percentage of negative results.

The ViMRT web search engine
We developed the ViMRT web search engine by identifying mutations in 9629 abstracts and 3160 full texts and covered the annotations of viral genes and related diseases. It can be accessed through an easy-to-use graphical web interface, as shown in Figure 3. The web provides two ways to get variant information. One is a simple search. After users enter a query entity in the select bar on the home page, including virus mutation, gene or disease (Fig. 3a), the web will return the best match entity results in its database (Fig. 3b), which includes a list of publications containing highlighted entities, sorted by publication date. Another is co-occurrence correlation rank to search if the aim of users is not very clear. Users can firstly click 'To search the relationship' button on the home page, which will return co-occurrence correlation of mutation and disease including seven viruses by the histogram chart and table, and click 'View' to get detailed information (Fig. 3c). Those processes have three main features: First, for each result, it returns the searched entity with a red background as well as other entities with different colors including diseases, genes and other mutations (Fig. 3b), which can very efficiently catch the entity information and research work. Second, users can filter the other co-occurrence entities by the table and the filter box at the left ( Fig. 3c and d) which are useful to find potential relations (e.g. C1653T mutation of HBV HBx/core promoter region is strongly associated with an increased risk of hepatocellular carcinoma and reported in many literatures). Third, users can also filter literature information through 'Year', 'Position' and 'PMID' of publication (Fig. 3e). The 'Dataset' page and 'Docs' page of the web provide the download to easily get complete mutation information of seven viruses and the local usage of ViMRT with a detailed script, respectively (Fig. 3f).

The localization usage of ViMRT
The usage of ViMRT can realize three recognitions: virus mutation recognition, virus gene recognition and disease recognition. Firstly, virus mutation recognition includes downloading literature, optimizing the recognition results of tmVar by rule patterns and recognizing virus mutation by regular expression patterns. Secondly, the virus gene is identified by virus gene vocabulary and a gene match program. Thirdly, disease recognition is based on Stanza 1.4.0 and a normalizing script written by Python (v3.8.5). Full usage of ViMRT is minutely displayed on the 'Docs' page: http://bmtongji.cn:1225/ mutation/Docs.

Discussion
This study is the first to gather abundant virus mutation information and develop ViMRT to identify virus mutation by precise rules and regular expressions. It has achieved a higher performance for virus mutation recognition than other conventional tools. At the same time, we are the first to develop a search engine web about a variety of viruses for researchers to quickly search mutation-specific information that includes virus gene and related diseases as reported in the biomedical literatures.
ViMRT displays some notable advantages in identifying virus mutation recognition. Firstly, ViMRT curated and created virus mutation information from five viruses closely related to human disease from the literature to comprehensively cover the majority of written forms. Next, based on that information, we designed rule patterns to optimize recognition results of tmVar, and further summarized three mutation signatures termed SM, SSM, NLM which characterized the written form preference of virus mutation in different literatures. Then we developed regular expressions to recognize mutations for each mutation signatures ( Table 2). As we have observed, ViMRT showed state-of-art performance compared to three other conventional tools (Table 3), especially for the written form of virus mutation containing punctuation or gene. We also The bold values indicate the best value per metric.   developed a specific search engine for seven viruses to realize mutation information visualization and mined the buried co-occurrence relationships with related diseases in the original paper, which can contribute to more quickly and accurately retrieving evidence aiding both basic research and clinical application compared to LitVar and Pubtator.
In addition, we observed that ViMRT showed a slightly decreased but still good performance (all indicators > 0.9) to recognize the mutation of SARS-CoV-2 and IV compared to five viruses (Table 4), while the performance of three other tools on the corpus of SARS-CoV-2 and IV were better than those in the test dataset. The possible interpretation is that a lower proportion of SSM in SARS-CoV-2 and IV corpus caused the rise of recall of three other tools (Fig. 2C). Also, mutation of different viruses in the datasets existed in some particular written forms, such as nsp6-L3606fs and spike-glycoprotein-V6fs variants of SARS-CoV-2 (Zekri et al., 2021). Furthermore, we analyzed false positive results of ViMRT, which was mainly caused by terms that are highly similar to the written form of virus mutation, such as human hepatoma cells (C3A) (Nishizawa et al., 2016), where C3A was misidentified as a mutation.
Several future directions can extend this work. Firstly, future study is warranted to expand the rules and regular expressions for more viruses with the emergence of special written forms. With the rapid growth of literature data, we will continue to expand our corpus of viruses, add more virus types and regularly update the literature of our search engine. Secondly, due to the particularity of writing styles of virus mutations, the current version of ViMRT mainly used the rule-based NLP method. In a preliminary study, we have explored whether it is possible to perform a binary classification task using machine learning models for discriminating sentences containing the mutation information to reduce the false positive.
Both F1-scores of machine learning methods based on TF-IDF word embedding and TextCNN model could reach 0.8. In the future, with continuous updates and expansion of the virus data set constructed in this study, the state-of-the-art pre-trained language models, such as BERT (Devlin et al., 2019), BioBERT (Lee et al., 2020), PubMedBERT (Gu et al., 2020), will be worth attempting. Thirdly, we will continue to extract and update more buried biological relationships associated with a viral function such as receptor binding and drug resistance.   3. The ViMRT user interface and the usage guidance locally. Multiple clearly delimited zones allow users to perform search and visualize results, which include: (a) A query entity can be inputted on the home page, such as a virus mutation, virus gene or related disease. (b) The match query includes the detailed information of virus mutations, virus genes, and related diseases, sorted by the publication year in a descending order. (c) The co-occurrence correlation of viral mutation and related disease. The table shows co-occurrence score of mutation which have co-occurred in the same sentence with one disease by Fisher's test in each virus (***P < 0.001; **P < 0.01; *P < 0.05; ns, not significant). The histogram chart shows the co-occurrence correlation score of mutation and disease ranking top 10. (d) The user can select the entities menu to get sentence co-occurrence of different entities. (e) Users can also filter publications based on 'Year', 'Position' and 'PMID' of publication matching the query. (f) Users can download mutation data on the 'Dataset' page and view the usage of ViMRT in the 'Docs' page with a detailed script