Betweenness centrality is a measure of centrality in a network based on shortest paths.
The data files in this collection are for datasets:
Document Count: 5,000 documents
Corpus: (one of) Caselaw (cas) / Pubmed Abstracts (pma) / Pubmed Central (pmc)
Search Term: (one of) Climate / Earth / Environmental / Pollution
Networked Models at Topic Counts: 15, 20
CSV files containing the coherence scoring pertaining to datasets of:
DocumentCount = 5,000
Corpus = (one from) Federal Caselaw [cas] / Pubmed-Abstracts [pma] / Pubmed-Central [pmc] / Chicago Novel Corpus [nvl] / Newspaper Corpus [nws]
SearchTerm[s] = (one from) Earth / Environmental / Climate / Pollution / Random 5k documents of a specific corpus
Coherence was scored across every combination of:
TopicCount: 10-40
Hyperparameter-Alpha: [0.01, 0.31, 0.61, 0.91, symmetric, asymmetric]
Hyperparameter-Beta: [0.01, 0.31, 0.61, 0.91, automatic, symmetric]
The columns in this file include:
Validation_Set: Which search term this scoring pertains to
Topics: Number of topics in the model
Alpha: Hyperparameter alpha selection from the 6 options above
Beta: Hyperparameter beta selection from the 6 options above
Coherence: The topic coherence score for the given model-row
Perplexity: The perplexity score for the given model-row
Box-and-Whisker visualization of coherence scores for three corpora types: Caselaw (cas), Pubmed Abstracts (pma), Pubmed Central (pmc).
This figure is for models matching search-term "climate". Visualizations for other search terms and additional interactive elements available at the related URL below.
Coherence was scored across every combination of:
- TopicCount: 10-40
- Hyperparameter-Alpha: [0.01, 0.31, 0.61, 0.91, symmetric, asymmetric]
- Hyperparameter-Beta: [0.01, 0.31, 0.61, 0.91, automatic, symmetric]
Box-and-Whisker visualization of topic coherence scores for three corpora types: Caselaw (cas), Pubmed Abstracts (pma), Pubmed Central (pmc). This figure is for models matching search-term "climate". Visualizations for other search terms and additional interactive elements available at related URL below.
Coherence was scored across every combination of:
- TopicCount: 10-40
- Hyperparameter-Alpha: [0.01, 0.31, 0.61, 0.91, symmetric, asymmetric]
- Hyperparameter-Beta: [0.01, 0.31, 0.61, 0.91, automatic, symmetric]
Heat map visualization of median coherence scores for three corpora: Caselaw (cas), Pubmed Abstracts (pma), Pubmed Central (pmc).
Median coherence scores across all search-term based models ("climate", "earth", "environmental" "pollution")
The median is found from 1,116 total coherence scores. Coherence was scored across every combination of:
- TopicCount: 10-40
- Hyperparameter-Alpha: [0.01, 0.31, 0.61, 0.91, symmetric, asymmetric]
- Hyperparameter-Beta: [0.01, 0.31, 0.61, 0.91, automatic, symmetric]