Box-and-Whisker visualization of coherence scores for three corpora types: Caselaw (cas), Pubmed Abstracts (pma), Pubmed Central (pmc).
This figure is for models matching search-term "climate". Visualizations for other search terms and additional interactive elements available at the related URL below.
Coherence was scored across every combination of:
- TopicCount: 10-40
- Hyperparameter-Alpha: [0.01, 0.31, 0.61, 0.91, symmetric, asymmetric]
- Hyperparameter-Beta: [0.01, 0.31, 0.61, 0.91, automatic, symmetric]
Box-and-Whisker visualization of topic coherence scores for three corpora types: Caselaw (cas), Pubmed Abstracts (pma), Pubmed Central (pmc). This figure is for models matching search-term "climate". Visualizations for other search terms and additional interactive elements available at related URL below.
Coherence was scored across every combination of:
- TopicCount: 10-40
- Hyperparameter-Alpha: [0.01, 0.31, 0.61, 0.91, symmetric, asymmetric]
- Hyperparameter-Beta: [0.01, 0.31, 0.61, 0.91, automatic, symmetric]
Heat map visualization of median coherence scores for three corpora: Caselaw (cas), Pubmed Abstracts (pma), Pubmed Central (pmc).
Median coherence scores across all search-term based models ("climate", "earth", "environmental" "pollution")
The median is found from 1,116 total coherence scores. Coherence was scored across every combination of:
- TopicCount: 10-40
- Hyperparameter-Alpha: [0.01, 0.31, 0.61, 0.91, symmetric, asymmetric]
- Hyperparameter-Beta: [0.01, 0.31, 0.61, 0.91, automatic, symmetric]
The data files in this collection are for datasets:
Document Count: 5,000 documents
Corpus: (one of) Caselaw (cas) / Pubmed Abstracts (pma) / Pubmed Central (pmc)
Search Term: (one of) Climate / Earth / Environmental / Pollution
Networked Models at Topic Counts: 15, 20
The filename convention is:
central_datasetAcronym_searchTerm,topicCount.csv
Each row = one topic-node of the networked model
The columns in each CSV include:
Edges: Connection threshold [e.g. 25% means that row depicts nodes' centrality when the edges in the graph are in the top quartile of strength]
ID: Unique ID for the topic-node in the networked model
Score: The betweenness centrality score of that topic-node
Terms: The top terms of the topic-node
The data files in this collection are for datasets:
Document Count: 5,000 documents
Corpus: (one of) Caselaw (cas) / Pubmed Abstracts (pma) / Pubmed Central (pmc)
Search Term: (one of) Climate / Earth / Environmental / Pollution
Networked Models at Topic Counts: 15, 20
The filename convention is:
central_datasetAcronym_searchTerm,topicCount.csv
Each row = one topic-node of the networked model
The columns in each CSV include:
Edges: Connection threshold [e.g. 25% means that row depicts nodes' centrality when the edges in the graph are in the top quartile of strength]
ID: Unique ID for the topic-node in the networked model
Score: The betweenness centrality score of that topic-node
Terms: The top terms of the topic-node
The data files in this collection are for datasets:
Document Count: 5,000 documents
Corpus: (one of) Caselaw (cas) / Pubmed Abstracts (pma) / Pubmed Central (pmc)
Search Term: (one of) Climate / Earth / Environmental / Pollution
Networked Models at Topic Counts: 15, 20
The filename convention is:
central_datasetAcronym_searchTerm,topicCount.csv
Each row = one topic-node of the networked model
The columns in each CSV include:
Edges: Connection threshold [e.g. 25% means that row depicts nodes' centrality when the edges in the graph are in the top quartile of strength]
ID: Unique ID for the topic-node in the networked model
Score: The betweenness centrality score of that topic-node
Terms: The top terms of the topic-node
The data files in this collection are for datasets:
Document Count: 5,000 documents
Corpus: (one of) Caselaw (cas) / Pubmed Abstracts (pma) / Pubmed Central (pmc)
Search Term: (one of) Climate / Earth / Environmental / Pollution
Networked Models at Topic Counts: 15, 20
The filename convention is:
central_datasetAcronym_searchTerm,topicCount.csv
Each row = one topic-node of the networked model
The columns in each CSV include:
Edges: Connection threshold [e.g. 25% means that row depicts nodes' centrality when the edges in the graph are in the top quartile of strength]
ID: Unique ID for the topic-node in the networked model
Score: The betweenness centrality score of that topic-node
Terms: The top terms of the topic-node
The data files in this collection are for datasets:
Document Count: 5,000 documents
Corpus: (one of) Caselaw (cas) / Pubmed Abstracts (pma) / Pubmed Central (pmc)
Search Term: (one of) Climate / Earth / Environmental / Pollution
Networked Models at Topic Counts: 15, 20
The filename convention is:
central_datasetAcronym_searchTerm,topicCount.csv
Each row = one topic-node of the networked model
The columns in each CSV include:
Edges: Connection threshold [e.g. 25% means that row depicts nodes' centrality when the edges in the graph are in the top quartile of strength]
ID: Unique ID for the topic-node in the networked model
Score: The betweenness centrality score of that topic-node
Terms: The top terms of the topic-node
The data files in this collection are for datasets:
Document Count: 5,000 documents
Corpus: (one of) Caselaw (cas) / Pubmed Abstracts (pma) / Pubmed Central (pmc)
Search Term: (one of) Climate / Earth / Environmental / Pollution
Networked Models at Topic Counts: 15, 20
The filename convention is:
central_datasetAcronym_searchTerm,topicCount.csv
Each row = one topic-node of the networked model
The columns in each CSV include:
Edges: Connection threshold [e.g. 25% means that row depicts nodes' centrality when the edges in the graph are in the top quartile of strength]
ID: Unique ID for the topic-node in the networked model
Score: The betweenness centrality score of that topic-node
Terms: The top terms of the topic-node