Classifier algorithms use the features (collectively known as Feature Vectors) of each item in a dataset to assess the classification to which that item belongs.
In this classifier approach, each item represents one document containing the application essay combined with unstructured language describing relevant activities of a single applicant. For privacy, the full text of this document is not provided. Instead, each document is represented only by its features. The feature vector for this classifier is based on the term frequency for each of the identified terms. E.G. Doc_A contains 0 occurrences of any terms identified as family medicine vocabulary, and 10 occurrences of terms from the the non-family-medicine vocabulary.
W2V takes terms from a large corpus of text and models them onto a vector space, based on word associations from your dataset. These Word Associations take into account each word's immediate context (its ten neighboring words).
Following the data modeling (large-scale unstructured text), The platform then generates a visualization of this vector space, which lets us perform analysis e.g. detect synonymous/synonym-ish words and highlight related words. At the heart of this project, is W2V's ability to identify key words that were more frequent - and more unique - to each group using results from 2 different W2V models – one for each group's application texts.
We coded these Key Terms into categories, then analyzed those categories for overarching themes.
Each row in this dataset depicts a single non-profit organization (NPO), labeled by their Employer Identification Number (EIN).
Each row contains the National Taxonomy of Exempt Entities (NTEE) code assigned to each NPO by the IRS (if any) and the official Essential/Non-Essential status connected to that NTEE code.
Each row of this dataset depicts a single Ohio-based non-profit organization (NPO) (identified by Employer Identification Number) and a hand-coded determination of their 'essential' status.
This determination of essential status is guided by the official IRS definition and based strictly on the NPO's own mission statement and activities language supplied in their 2019 tax form.
This CSV file contains the topic distribution of each EIN as uncovered using six parallel Latent Dirichlet Allocation (LDA) Topic Models.
Each row depicts a topic and topic-score associated with an Ohio NPO (identified by Employer Identification Number) generated from one model run.
The sum of topic scores possible for every row associated with an EIN therefore will not exceed 6.0 (6 models x 100%)
Topic scores below .01 (1%) are not included.
Each topic from the models is further identified as Essential/Non-Essential by subject matter expert, Dr. Michael Jones, guided by the official IRS definition.
The topic models are generated on unstructured text language from the mission statement and activities language taken from the 2019 tax forms of Ohio non-profit organizations.
All models and corresponding network visualizations are generated from documents in the CORD-19 dataset as of July 14, 2020. All annotations in red were added by the research team.
Note: These topic models are included here as additional reference and to append links to interactive versions on the Digital Scholarship Center’s machine learning platform for further exploration.
All models and corresponding network visualizations are generated from virus related documents in the CORD-19 dataset as of July 14, 2020. All annotations in red were added by the research team.
Note: Certain Non-Coronaviridae topic models are included in the text of this article and are included here only as additional reference and to append links to interactive versions on the Digital Scholarship Center’s machine learning platform for further exploration.
All models and corresponding network visualizations are generated from virus related documents in the CORD-19 dataset as of July, 2020. All annotations in red were added by the research team.
Note: Coronavirus topic models are included in the text of this article and are included here only as additional reference and to append links to interactive versions on the Digital Scholarship Center’s machine learning platform for further exploration.
These Centrality measurements were generated with NetworkX, a Python package for networks. The specific algorithms used for this paper are Betweenness Centrality (where Degree Centrality considers individual topics).
Complete Centrality Data for this research can be found at https://scholar.uc.edu/show/6t053h21x