What is big data? No matter what definition is used, the term “big data” is most simply and best described by the sheer volume of data involved, which can offer the opportunity for increased insight and better decision-making that could not be accomplished by analyzing smaller data sets. Nonetheless, to uncover underlying trends, correlations, and relationships that would typically be absent from one-dimensional data alone, the analysis and management of huge data sets demand consideration of several other important aspects of big data.
IBM describes big data by four key aspects: 1) the volume of data, 2) the speed at which data is generated, 3) the aggregation of distinctly different data types, and 4) the validity and security of data. These aspects are known as the four Vs: volume, velocity, variety, and veracity (IMB, 2014).
Big data has required the development of new non-relational database structures capable of handling unstructured data, computational algorithms that can effectively use all dimensions of big data, and parallel and cloud computing infrastructure to enable fast processing and sharing of data. According to the McKinsey Global Institute, the use of big data could revolutionize and create value across all sectors of the global economy, save the United States $300 billion annually in healthcare costs, and increase operating margins for retailers by 60% (Manyika et al., 2011). Furthermore, free or low-cost sources of unstructured data, such as word searches on Internet engines and online discussion sites, may provide near real-time information on disease outbreaks.
Despite its potential, big data remains vulnerable to traditional data-analysis challenges such as sampling error and bias, the failure to correct significance levels for multiple comparisons, and the correlation-causation inference that is characteristic of working with retrospective data. Nevertheless, development and implementation of tools that use big data in food safety have considerable potential to improve microbial food safety and quality.
Similar to other areas, the amount of food safety-related data being generated by the government, industry, and academia is increasing rapidly. While specific information on the amount of data being generated is often not easily accessible, the use of big data is very apparent in routine subtyping of foodborne pathogens. Techniques that interrogated only a small proportion of bacterial genomes (e.g., pulse field gel electrophoresis) are being replaced by whole genome sequencing (WGS), which provides information on each of the approximately three to six million nucleotides that make up typical bacterial foodborne pathogen genomes. For example, the private-public partnership 100K Foodborne Pathogen Genome Project aims to sequence 100,000 foodborne pathogen genomes. Similarly, the U.S. Centers for Disease Control and Prevention, the U.S. Food and Drug Administration (FDA), and the U.S. Dept. of Agriculture’s Food Safety and Inspection Service as well as public health agencies in other countries have begun routinely sequencing foodborne pathogen isolates. For example, since fall 2013 all human clinical Listeria monocytogenes isolates obtained in the United States are subjected to WGS by either state or federal public health agencies. In addition to the large data sets that are being generated specifically for food safety applications, food safety professionals increasingly recognize the value of using larger data sets that are not specifically for food safety applications. For example, the use of geographical information systems technology (GIS) and geo-referenced data for predicting or identifying pre-harvest food safety hazards (particularly in the produce area) shows considerable potential to yield new science-based approaches to food safety hazards. The food industry collects large data sets, often through real-time monitoring, that could be used in more in-depth analyses along with other data sets to improve food safety and optimize food safety investments. This article highlights a few examples of how big data can be used to develop and implement improved food safety practices and how big data could help food safety professionals make better decisions.
GIS is a computer-based tool for mapping and analyzing things on earth. The technology integrates common database operations, such as query and statistical analysis, with the visualization and geographic analysis offered by maps. With regard to food safety, GIS combines information on geographical features and attribute data (i.e., characteristics/information related to a specific location) to identify associations between the environment and a pathogen. The first application of geographic analysis was in 1854 when Dr. John Snow, a London physician recognized as one of the pioneers of modern GIS and epidemiology, mapped the location of cholera deaths and water wells. He used maps along with personal interview data to identify the source of the disease: the Broad Street water pump.
Today, GIS is applied to predict the spatial and temporal occurrence of foodborne pathogen contamination in produce production environments. Furthermore, GIS has aided growers to understand the transmission dynamics of foodborne pathogens in the environment as well as various spatial-temporal factors (e.g., climate trends, proximity to landscape features, soil properties) that influence the potential for produce contamination events. The ultimate goal is to prioritize risks on farms and to develop a preventive approach to pre-harvest food safety. The application of GIS in produce food safety has shown incredible promise, such as helping growers make more informed decisions about field practices and develop targeted pathogen-surveillance programs. For example, the FDA and National Aeronautics and Space Administration have collaborated to develop GIS-Risk, a program that links GIS data with predictive risk-assessment modes to forecast when, where, and under what conditions microbial contamination of crops is likely to lead to human illness (Oryang et al., 2014). Furthermore, Strawn and others (2013) used a GIS framework to predict spatial locations of L. monocytogenes reservoirs based on proximity to various landscape features and level of soil moisture in the produce production environments of the State of New York. They showed field locations near impervious land cover class had a predicted L. monocytogenes prevalence of 20% while field locations away from impervious land cover class had a predicted L. monocytogenes prevalence of only 5%. Growers can therefore identify locations on farms that are at high risk for contamination and implement intervention measures to minimize the risk of transfer to produce (see Figure 2). Additionally, researchers observed that the incidence of Escherichia coli O157:H7 increased significantly after heavy rains in a California produce growing region (Cooley et al., 2007). This finding suggested that during intense weather and subsequent flooding events, pathogen levels in the environment may be elevated. Therefore, monitoring data on rainfall totals or river flow rates may aid growers in forecasting risk of potential contamination events.
Overall, the application of GIS to produce safety research has generated massive amounts of new data on the ecology of different organisms in the environment and data on various spatial-temporal-based scenarios that influence the likelihood of contamination events. In this big data driven era, GIS is one tool that helps researchers store, capture, process, analyze, and visualize large datasets. While the promise of GIS to complex food safety issues is being demonstrated, further integration of multiple large data sets (e.g., WGS data, real-time data acquired via drones) will be critical to further improve food safety throughout the farm-to-fork continuum. Application of GIS tools to address pre-harvest food safety of plant-based foods will specifically be facilitated by the rapid growth of precision agriculture, which focuses on improving yield and optimizing various production inputs.
The FDA has created, validated, and applied for real-time regulatory use an open-source WGS integrated network of state, federal, and industry partners. The network is known as GenomeTrakr and represents the first distributed genomic food shield for detecting and tracing foodborne pathogen outbreaks back to their sources. WGS information guides investigators to specific food products, plants, and farm sources for pathogen outbreaks, providing valuable insight into the origin of contaminated food. This capability is particularly important because the FDA has a limited number of food inspectors and the U.S. food supply is becoming more global. Sample collection and sequence cataloging from food production sites can help monitor compliance with the FDA’s rules on safe food-handling practices, enhancing preventive controls for food safety. A recent example involved the 2014 suspension of a U.S. producer of a Mexican-style cheese linked to numerous illnesses caused by L. monocytogenes. WGS was employed to confirm the link between the food and facility isolates and those derived from clinical cases. The usefulness of this new technology for source tracking had previously been demonstrated when it provided enough high-resolution micro-evolutionary single nucleotide polymorphism changes to pinpoint the sources and ingredients of a Salmonella outbreak in spiced meats in 2009 and was used to confirm L. monocytogenes persistence for 12 years in a food processing facility.
Big data has the ability to change the conventional strategy for prevention: Historically, food safety professionals have relied on food safety audits or inspections to determine if a food establishment was in compliance with food safety standards and regulations. However, at best, food safety audits are a snapshot of an establishment’s condition at a single point in time. For example, retail food-inspection results were not a good predictor of whether or not a food establishment would be linked to or cause an outbreak because of the low frequency of visits, which ranged from once a year to just a few times per year.
One nationwide retailer, Wal-Mart Stores Inc., is leveraging big data for food safety purposes. Wal-Mart utilizes handheld information technology, Bluetooth communication, and state-of-the-art temperature measuring devices to check the internal temperatures of every batch of rotisserie chickens cooked, ensuring a safe internal temperature. In a single period, health inspectors across the country checked rotisserie chicken cooking temperatures in Wal-Mart stores approximately ten times. During the same time frame, a third-party inspection firm checked rotisserie chicken cooking temperatures approximately 100 times, a tenfold increase of the checks during regulatory inspections. However, by leveraging data obtained over this same period of time through an internal handheld self-check system, Wal-Mart recorded 1.4 million internal cooking temperatures of rotisserie chickens. This approach provided much greater insight than what could have been obtained through inspections or audits alone. Leveraging big data and the information it provides appears to be an innovative and effective way to enhance regulatory compliance and track compliance with desired standards.
Big data tools such as metagenomics also increasingly offer new approaches to control and reduce microbial food spoilage. Food spoilage results from complex combinations of microbiological factors and physiochemical factors of the matrix, and the relationships between causative agents and physiochemical changes associated with spoilage are poorly defined. In some foods, such as fresh pork sausage, microbial growth (as measured by traditional methods) and spoilage are even temporally unlinked by as much as 30 days, leading to the suspicion that microbial growth plays only a small role in spoilage. Using large-scale parallel 16S rRNA-based pyrosequencing, researchers described in detail the dramatic changes in abundances of microbial species that occur over the shelf life of a refrigerated model sausage product, effectively resulting in multiple ecological successions of taxa with one wave of microorganisms rising to high abundances and displacing the previous wave (Benson et al., 2014). These successions occurred despite little change in the absolute abundance of the populations detected by traditional plating, illustrating the powerful resolution afforded by metagenomic analysis. The addition of antimicrobials changed the picture dramatically, yielding an essentially static community for the first 30 days of refrigeration, followed by an abrupt decline in relative abundances of nearly the entire population except for a single microorganism. Combining changes in microbiota composition with chemical signatures of the matrix over time further established high degrees of correlation between abundances of specific taxa and significant changes in the chemical composition of the sausage, providing a list of possible taxa as major causes of the onset of spoilage. Detailed trace-back analyses comparing the distributions of specific taxa from ingredients and final product also identified the ingredients, specifically the spice blend, as a major source of the most abundant taxon in the spoiled product. Importantly, it was the combination of high-resolution microbiota data and traditional plating data that enabled a full understanding of the ecosystem behavior to reduce the likelihood of spoilage, thus enhancing the quality of the product.
Opportunities for big data applications in food safety and microbial spoilage beyond the ones detailed in this article appear to be abundant, yet food scientists and food microbiologists have used only a small amount of the relevant data generated and available. Hence, there is a considerable need for a comprehensive multidisciplinary approach across industry, government, and academia to develop the people, tools, and infrastructure to facilitate application of big data in food science. The challenges on this path are multifaceted and range from the rather mundane, such as switching from paper-based to electronic-based record-keeping schemes, to the complex, such as implementation of computational tools that can integrate and analyze structured and unstructured data (e.g., video, satellite images, audio) to reveal food-safety-relevant associations. An important next step will be to create data that show that analyses of big data can also successfully predict future microbial food safety and quality outcomes. In addition, there is an urgent need to train future food sdafety professionals and food scientists to use and analyze big data sets and interact successfully with data scientists. The ultimate creation of a big data culture in the food industry can facilitate considerable advancements in food safety, food quality, and sustainability.
Laura K. Strawn, Eric W. Brown, Jairus R. D. David, Henk C. den Bakker, Pajau Vangay, Frank Yiannas, and Martin Wiedmann