By tracking the number of times Wikipedia articles on influenza and other health-related topics were read, researchers were able to produce accurate estimates of influenza activity in near real-time.
New research suggests that data on visits to Wikipedia, the user-generated online encyclopedia, may help to accurately estimate the prevalence of flu in the United States in near real time.
The Centers for Disease Control and Prevention continuously monitors levels of influenza-like illness
(ILI) throughout the country and releases weekly reports on its spread, but this data is typically 1 to 2 weeks old when it is published. To help track levels of ILI more promptly, Google launched Google Flu Trends in 2008. The system, which uses Google search terms related to flu activity to estimate current levels of ILI, initially produced impressively accurate—and timely—reports on disease levels. During the 2009 H1N1 pandemic and the more severe 2012-2013 flu season
, however, the accuracy of Googles Flu Trends estimates suffered, apparently due to the effect of increased media attention on search patterns.
The current study
, published online on April 17, 2014, in PLOS Computational Biology
, sought to develop more accurate methods of estimating real-time flu prevalence in the United States using data available from Wikipedia. The researchers crafted a model that estimates the level of ILI based on the number of times Wikipedia articles on flu or other health topics are accessed each day. To test the accuracy of the model, retrospective flu activity estimates from December 2007 through August 2013 were compared with official counts from the CDC and with estimates generated by Google Flu Trends.
The results indicated that the Wikipedia model produced accurate estimates of ILI levels up to 2 weeks before data covering the same time period was available from the CDC. A full model that included 32 health- or flu-related Wikipedia articles and another model that included 24 of the articles were both shown to be accurate. The absolute average difference between the full model and limited model estimates and CDC data were 0.27% and 0.29%, respectively, while Google Flu Trends estimates and CDC data differed by an average of 0.42%.
In addition, estimates from both Wikipedia models remained accurate even when media coverage was unusually high during the 2009 epidemic and the 2012-2013 influenza season. The Wikipedia models also predicted the peak week of ILI 17% more often than did Google Flu Trends.
“With further study, this method could potentially be implemented for continuous monitoring of ILI activity in the US and to provide support for traditional influenza surveillance tools,” the authors write. “Although it has not been investigated here, there is potential for this method to be altered for the monitoring of other health-related issues such as heart disease, diabetes, sexually transmitted infections, and others.”