Research Guides

Biomedical Data


  • The Biomedical Data guide is designed to connect you to Biomedical Research resources available to you via UM Libraries.
  • This guide provides information about statistical tests, finding data sources, data visualization resources, and more.
  • You can access the learning resources offered in my workshops that include topics on SPSS, R, Tableau, and Meta-Analysis.
  • This guide is constantly evolving and contact me if you find a tool or learning resource that could be included here.
Subject Specialist

Thilani Samarakoon

Cameron Riopelle

How do we select the right statistical test?
This page is created to help you choose a statistical test based on your study design, purpose, and type of data.
 
 Test  Variable type  Assumptions  Use/example  Parametric/non-parametric*
Chi-square test for         Independence/Pearson's Chi-square test/ Chi-square test of Association

Click for SPSS video demo
Categorical Independence of observations, expected cell count assumption (expected cell counts are all greater than 5) Compare sample proportions to check if there is an association between the categories. A residual analysis will identify the specific cells that make the greatest contribution to a significant Chi-square test. Non-parametric
Fisher's exact test Categorical Independence of observations, if the expected cell count is less than 5 for any cell, use Fisher’s exact test. Compare sample proportions to check if there is an association between the categories. Used for small samples. Non-parametric
McNemar's test Categorical The dependent variables should be mutually exclusive. Check the differences on a dichotomous dependent variable between two related groups. Non-parametric
Independent samples T-test

Click for SPSS video demo
Continuous A normal distribution (Shapiro-Wilk test),
homogeneity of variances (Levene's test), independence of observations, no outliers
Compare means of two independent
samples
Parametric
Mann-Whitney U Test Continuous Independence of observations, distributions of the independent variable should have the same shape, independent variable consists of two categorical groups. Compare means of two independent
samples when the normality assumption is violated.
Non-parametric (alternative to an independent t-test)
Dependent samples T-test Continuous A normal distribution (Shapiro-Wilk test), homogeneity of variances (Levene's test), independence of observations, no outliers Compare means of two dependent
samples
Parametric
Wilcoxon signed-rank Test Continuous Independent variable consists of two categorical groups, dependent variable  measured at the ordinal or continuous level, distributions of the independent variable should have the same shape
 
Compare means of two dependent
samples when the normality assumption is violated.
Non-parametric (alternative to the dependent t-test)
ANOVA Continuous A normal distribution (Shapiro-Wilk test), homogeneity of variances (Levene's test) independence of observations, no outliers Compare independent sample
means for more than 2 groups. Post Hoc - if equal of variances assumed select Tukey test. Otherwise, select Games-Howell.
Parametric
Kruskal-Wallis Continuous Independence of observations, independent variable consists of two or more categorical groups. Compare independent sample
means for more than 2 groups when the normality assumption is violated.
Non-parametric (alternate to ANOVA, an extension of Mann-Whitney U test)
ANOVA with Repeated Measures Continuous A normal distribution (Shapiro-Wilk test), homogeneity of variances (Levene's test) independence of observations, no outliers, sphericity(Mauchly’s test ) Compare means of 3 or more groups with the same participants Parametric
Friedman Test Continuous A group is measured 3 or more times, random sampling Compare means of 3 or more groups with the same participants when the assumptions of ANOVA with repeated measures are violated or the dependent variable measured is ordinal. Non-parametric (alternate to ANOVA with repeated measures)
Pearson correlation Continuous Association between the two variables is linear, Normal distribution (Shapiro-Wilk test), homogeneity of variances (Levene's test) Examine the linear association and direction
between two continuous
variables. It takes a value between +1 to -1.
Parametric
Spearman correlation Continuous or ordinal There is a monotonic relationship between the two variables. Examine the linear association and direction
between continuous or ordinal
variables when the assumptions have failed for a Pearson correlation. It takes a value between +1 to -1.
Non-parametric (alternate to Pearson correlation)
Kendall's Tau-b Continuous or ordinal There is a monotonic relationship between the two variables. Examine the association and direction
between continuous or ordinal
variables. Specially applied to small sample sizes and for variables with many ranks.  It takes a value between +1 to -1.
Non-parametric (alternate to Pearson correlation and Spearman correlation)
Somers' delta Ordinal Both variables are measured on an ordinal scale. There is a monotonic relationship between the two variables. Examine the association and direction
between two ordinal
variables. It takes a value between +1 to -1.
Non-parametric
Goodman and Kruskal's gamma Ordinal Both variables are measured on an ordinal scale. There is a monotonic relationship between the two variables. Examine the association and direction
between two ordinal
variables. It takes a value between +1 to -1. Does not differentiate between the independent variable and the dependent variable. The test is commonly used for variables with many ranks.
Non-parametric
Linear regression Continuous There should be a linear relationship between the 2 variables, no significant outliers, independence of observations, Normality Examine the linear association
between a continuous outcome
variable, and one or more independent variables
Parametric
Logistic regression Categorical(dichotomous) The dependent variable is dichotomous, independence of observations, linear relationship between any continuous independent variable,s and the logit transformation of the dependent variable Examines the probability that an observation falls into one of two categories of a dichotomous dependent variable based on one or more independent variables that can be either continuous or categorical Parametric

* Parametric tests assume that the population data are normally distributed. Non-parametric tests don't rely on the statistical distribution of the population data.


Other commonly used tests:
  1. Cronbach's Alpha: This method is used to study the reliability of the scale used in a survey (commonly used with Likert scale questions). The scale ranges from 0-1. If the alpha is 0.7 or above, the measure can be considered reliable. For example, a survey was designed with 10 questions to measure if the students are happy at school. The questions are on a Likert scale. The researcher can survey a sample and carry out the Cronbach analysis to estimate the reliability.
  2. Cohen's kappa: This test is used to study the inter-rater reliability when two raters are evaluating a variable on a categorical scale. The scale ranges from 0-1. There is no clear cut-off for a good agreement. Normally 0.6-0.8 is considered good. There are a few assumptions to be met before conducting the test. The categories should be mutually exclusive, the raters are independent, and the same raters evaluate all the questions.
  3. Fleiss' kappa: This test is used to study the inter-rater reliability when two or more raters evaluate a variable on a categorical scale. If the raters are not fixed, meaning the same 2 raters are not evaluating the questions. 

Resources:



 
  1. Data.gov
https://www.data.gov/
Data.gov lounged in 2009, offers public access to data sets generated by the U.S. Federal Government. You will find federal, state, and local data, tools, and other resources to conduct research. The site contains more than 180 000 data sets and offers several filters to narrow your search.
 
  1. Healthdata.gov
https://healthdata.gov/
The site contains health-related data sets on various topics, including community health, medical devices, substance abuse, and hospitals. Data sets are organized by the topic. Some data sets are readily available, and in other cases, you will be directed to another site.
 
  1. ClinicalTrials.gov
https://clinicaltrials.gov/
The site is home to more than 350,000 research studies conducted in more than 200 countries. The U.S. National Library of Medicine runs this resource.
The site offers a comprehensive guide on how to customize a search.
https://clinicaltrials.gov/ct2/help/how-find/index
 
  1. National Center for Health Statistics (NCHS)
https://www.cdc.gov/nchs/index.htm
NCHS is a part of the Centers for Disease Control & Prevention (CDC) and provides data and health statistics on crucial public health topics.  
  1. Health, United States, is an annual report on trends in health statistics. It includes trend tables and figures on selected topics such as mortality, life expectancy, causes of death, and health care expenses, among other topics. https://www.cdc.gov/nchs/data/hus/hus18.pdf
  2. FastStats is a site within the NCHS that offers quick access to data, statistics, and reports on specific health topics. The topics are organized in alphabetic order and will direct you to more resources on the topic.
  3. The National Vital Statistics System provides access to most of the data on birth and deaths in the United States. There are links on the webpage that links to statistical reports, data sources, and data visualizations.
 
  1. Surveillance, Epidemiology, and End Results (SEER)
https://seer.cancer.gov/
SEER collects cancer incidence rates and includes statistics on mortality, survival, stage, and tends in rates. A request must be submitted to access the data.
https://seer.cancer.gov/data/access.html
Cancer Query Systems - allows access to cancer data stored in an online database. The web-based interface is easy to use and retrieve statistics using a query-based system.
SEER*Stat - The data downloaded can be analyzed using the SEER* Prep and SEER*Stat statistical software. https://seer.cancer.gov/data-software/software.html
Please refer to the webpage for installation instructions and tutorials.
 
  1. National Institute of Mental Health Data Archive (NDA)
https://nda.nih.gov/
NDA collects and stores data from and about human subjects across many research fields. It holds data collected by researchers funded by the National Institute of Alcoholism and Alcohol Abuse (NIAAA) and autism researchers. Also, it stores data from the Adolescent Brain and Cognitive Development Study, the Human Connectome projects, and the Osteoarthritis Initiative.
 
  1. Data Resource Center for Child & Adolescent Health
https://www.childhealthdata.org/home
The website offers access to national, state, and regional data collected via the National Survey of Children’s Health (NSCH). The survey collects data on mental health, dental health, development, and many other child health indicators to improve child, youth, family, and community health and well-being. You can submit a data request to access the raw data or use the interactive query base system to search the data.


 
      8. Substance Abuse and Mental Health Data Archive (SAMHDA)

https://www.datafiles.samhsa.gov/
The site offers public access to data on substance use and mental health. In addition, it provides data analysis tools: Public-use Data Analysis System (PDAS) and restricted-use Data Analysis System (RDAS).
 
  1. Health Information National Trends Survey (HINTS)
https://hints.cancer.gov/
The survey collects data about the American public’s knowledge of, attitudes toward, and use of cancer- and health-related information. The data sets help researchers identify how adults use different communication channels to understand and obtain health-related information. HINTS data are free to download and analyze.
 
  1. National Trauma Data Bank Research Dataset (NTDBRDS)
https://www.facs.org/Quality-Programs/Trauma/TQP/center-programs/NTDB/about
NTDBRDS is maintained by the American College of Surgeons and is the most extensive collection of U.S. trauma registry data. The researchers must submit an online request to obtain the data sets that are available at a cost.

     11. Healthcare Cost and Utilization Project (HCUP)

HCUP's databases contain information on inpatient stays, emergency department visits, and ambulatory care.
 
Nationwide HCUP Databases
 
National (Nationwide) Inpatient Sample (NIS)NIS Database Documentation
  1. Kids’ Inpatient Database (KID)
KID Database Documentation
  1. Nationwide Ambulatory Surgery Sample (NASS)
NASS Database Documentation
  1. Nationwide Emergency Department Sample (NEDS)
NEDS Database Documentation
  1. Nationwide Readmissions Database (NRD)
NRD Database Documentation
 
State-specific HCUP Databases
  1. State Inpatient Databases (SID)
SID Database Documentation
  1. State Ambulatory Surgery and Services Databases (SASD)
SASD Database Documentation
  1. State Emergency Department Databases (SEDD)
SEDD Database Documentation
 
Some of these data sets do not offer free access, and purchase details can be found at Purchasing FAQs.
 
HCUPnet
https://hcupnet.ahrq.gov/
HCUPnet is a part of the HCUP project and offers a free on-line query system to access data on inpatient stays (NIS, KID, SID, and NRD), emergency department visits (NEDS, SEDD, SID), and community (SID). You create an analysis set-up by clicking on “Create new analysis.” Once the analysis is completed, it can be customized using the “My analysis” menu.
What is data visualization?

Data visualization is the visual presentation of data and an essential step in data analysis.

 

As the image below illustrates, a good visualization should be the intersection of data, function, and design.



Visual elements like graphs, maps, tables, and plots help the audience identify trends, patterns, and distributions in the data. Data visualization is useful in converting complex data sets into meaningful information that the user quickly understands.
Researchers in biomedical fields collect large amounts of data that are becoming increasingly complex.

Resources:

 


 

Creating a data visualization
Biomedical fields collect large amounts of data that are becoming increasingly complex. Data visualizations continue to develop tools to create new kinds of visualizations to address this complexity.

How do we create a meaningful visualization?
  • Know your audience
It is essential to understand your audience's skill level, interest, and focus before creating a visualization.
 
  • Choose the right graph/table
We create visualizations to show composition, comparison, distribution, and relationships among data. The graph you choose should deliver a meaningful, accurate, and clear message.
How to choose the right graph for your data.
The data visualization catalogue.
 
  • Declutter
Gridlines, unnecessary use of colors, non-standard fonts, and 3D objects are some elements that add clutter to a graph. The audience will be distracted and find it difficult to read a cluttered graph.
Declutter-story telling with data.
Charts Do's and Don'ts.
 
  • Focus attention
We can use visual encoding to add more dimensions to the data and focus the audience's attention. Color, position, length, size, orientation, and shape are known as retinal variables used for visual encoding.
Practical Rules for Using Color in Charts.
Choose Appropriate Visual Encodings.

 
Tools
Tableau is capable of creating high-quality visualizations and dashboards. It is a great tool for exploratory data analysis with built-in recommendations.
Tableau offers free one-year licenses to students. Click to request your license.
Tableau online resources are a great way to start learning the software.
  R is a free software environment for creating visualizations. It requires some programming knowledge.
R offers additional packages to cater to any visualization task.
ggplot2: Creates custom plots.  Refer to Cookbook for R to learn ggplot2.
GGally: Creates correlation matrix and survival plots.
gplots: Creates heat maps, Venn diagrams.
leaflet: Builds interactive maps.
Click to learn more about R packages.
  There are tools like Biowheel for exploring high-dimensional biomedical data.
  Infographics create visual communications and provide an overview of the topic.
AHRQ Research Data Infographics
Picktochart
Canva





 
I conduct a workshop series in Fall and Spring that covers topics in R, SPSS, Tableau, and Meta-Analysis.
Click here to register.

You can access the material covered in the workshops here. After going through the content, please reach out to me at thilani.samarakoon@miami.edu for assistance.
 
Subject Specialist

Thilani Samarakoon

Cameron Riopelle

Powered by SubjectsPlus