This collection of three works contains the data sets supporting the publication "Convergence in Viral Epidemic Research: Using Natural Language Processing to Define Network Bridges in the Bench-Bedside-Population Paradigm" submitted to the Harvard Data Science Review in November 2020.
The authors were Margaret Powers, Erin McCabe, Sally Luken, Danny Wu, Philip Hagedorn, Ezra Edgerton, Amy Koshoffer, Dorcas Washington, Suraj Kannayyagari, Jennifer Latessa, and James Lee.
The data sets were derived from coronavirus related scientific literature using the CORD-19 dataset released by the Allen Institute of Artificial Intelligence as of July 14, 2020, using the Elasticsearch engine hosted by the Digital Scholarship Center (DSC). Through indexing the full-text and the metadata of the article corpus, the research team generated a full-corpus model and 7 different models corresponding to key viral outbreaks from the past several decades' coronaviruses (SARS-CoV, MERS-CoV, and SARS- CoV-2) and non-coronaviruses (HIV, Zika, H1N1, and Ebola). The targeted subsets of the articles used two or more occurrences of virus-specific keywords drawn from conventions established by the World Health Organization.