Investigators Identify Characteristics to Better Define Long COVID


Using machine learning, they find patterns in electronic health record data to help ascertain which individuals likely have the condition.

Investigators have identified characteristics of individuals with long COVID and those who are likely to have it by using machine learning techniques.

The investigators, who were supported by the National Institutes of Health (NHI), analyzed a collection of electronic health records (EHR) available for COVID-19 research to help better identify who has long COVID.

Investigators used the EHR data, from the National COVID Cohort Collaborative (N3C), a centralized national public database led by the NIH’s National Centers for Advancing Translation Sciences, to identify more than 100,000 likely cases of long COVID, as of October 2021 and 200,000 cases as of May 2022.

“It made sense to take advantage of modern data analysis tools and a unique big data resource like N3C, where many features of long COVID can be represented,” Emily Pfaff, PhD, a clinical informaticist at the University of North Carolina at Chapel Hill, said in a statement.

The N3C data includes information representing more than 13 million individuals nationwide and nearly 5 million positive COVID-19 cases. The database helps assist rapid research on emerging questions about COVID-19 health outcomes, risk factors, therapies, and vaccines.

In the study published in The Lancet Digital Health, the investigators examined data on patient demographics, diagnosis, health care use, and medication in the health records of 97,995 individuals who had COVID-19 and were in the N3C database.

They combined this information with data of nearly 600 individuals with long COVID from 3 long COVID clinics to create 3 machine learning models to identify individuals with the condition.

Investigators “trained” the computational methods using machine learning to sift through large amounts of data to reveal new insights about long COVID. The models identified patterns that could help investigators understand patient characteristics and identify individuals with long COVID.

The models focused on identifying individuals who potentially had long COVID in 3 groups of N3C, including all individuals with COVID-19, patients hospitalized with COVID-19, and those who had COVID-19 but were not hospitalized.

The models were accurate at identifying individuals who were at risk for long COVID by comparing them with those who were at the long COVID clinics.

The machine learning systems classified approximately 100,000 individuals in the N3C database who were close matches to those with long COVID, investigators said.

The models searched for common features, including doctors’ visits and new medications and new symptoms, in individuals with a positive COVID-19 diagnosis who were at least 90 days out from their acute infection.

Additionally, the models identified individuals as having long COVID if they went to a long COVID clinic or demonstrated long COVID symptoms and likely had the condition but had not been diagnosed.

The research is part of a larger initiative, Research COVID to Enhance Recovery (RECOVER), which aims to improve the understanding of the long-term effects of COVID-19, known as post-acute sequelae of SARS-CoV-2 infection.

The program will accurately identify individuals with long COVID, develop approaches for prevention and treatment, and answer questions about the effects through clinical trials, observational studies, and more.


Scientists identify characteristics to better define long COVID. EurekAlert. News release. May 16, 2022. Accessed May 18, 2022.

Recent Videos
Image credit:  Gorodenkoff |
Sun Screen, Photosensitivity, Pharmacy | Image Credit: sosiukin -
Catalyst Trial, Diabetes, Hypertension | Image Credit: grinny -
Image Credit: © Anastasiia -
Various healthy foods -- Image credit: New Africa |
LGBTQIA+ pride -- Image credit: lazyllama |