Big data: Volume, Variety, Velocity, Veracity
October 30, 2017
Last week, a student asked me whether our new MSc module “Big Data Epidemiology” would be covering “machine learning” techniques and enthusiastically told me all about how they intend to apply such techniques to their own research. The short answer to the student’s question was “Yes, but only briefly”. The long answer requires some exploration into what we mean by “big data epidemiology” and consideration of what machine learning can (and perhaps more importantly, cannot) do for researchers.
Data is often considered “big data” if it can be described in terms of the “four V’s”: volume (there’s a lot of it), variety (the data takes lots of different forms), velocity (the data changes or is updated frequently) and veracity (the data may be of poor/ unknown quality).1 On our module we use electronic healthcare records databases that can contain information about millions of patients, collected over several years (volume). These databases contain several different types of clinical information such as test results and continuous measurements, coded diagnoses and prescription data (variety). New data is being generated constantly in practice (velocity) but, because it is collected for clinical rather than research purposes, may be less accurate than data actively collected in traditional, bespoke studies (veracity). Primarily, we consider how this data can be used to conduct observational research studies, in a similar way to traditional cross-sectional, cohort and case-control studies.
With such huge volumes of data, it can be difficult for mere mortals to make sense of it and usefully understand it. This is where machine learning comes in. Put simply, a computer (machine) is programmed to complete a specific task and should perform better at the task (learn) with more experience.2 In the context of healthcare research, more experience is often gained when the program has access to new information from more patients. For example, a study published earlier this year by the University of Nottingham used machine learning techniques to predict risk of cardiovascular disease (CVD), given certain patient characteristics and risk factors.3 This is called “supervised” learning since the researchers set out to predict a specific known outcome or “target” (CVD).4 Many traditional statistical modelling techniques, such as regression, are a form of supervised machine learning. In contrast, unsupervised machine learning seeks to identify patterns or associations within datasets without any input from humans about the specific outcome. Unsupervised learning may be more useful in areas where we have little prior knowledge about how patient factors are related or when new factors (e.g. biomarkers) emerge.
So why are we not covering these methods in detail on our course? Firstly, we have yet to see convincing examples of novel machine learning techniques performing better than traditional statistical methods. In the Nottingham study, prediction using logistic regression performed similarly to other methods. Secondly, without careful supervision, machine learning may produce results that, whilst interesting, are not clinically useful. One method used by the authors of the Nottingham study showed that patients who did not have a recorded body mass index were less likely to have cardiovascular disease. This is intuitively understandable given that most GPs will only record weight if they think it is a significant health issue, but isn’t much use for preventing CVD. Thirdly, the definition of machine learning implies, to some extent, that giving a computer ever more experience will solve all manner of problems. However, throwing more data into the mix is not a panacea and the old adage of “rubbish in, rubbish out” still applies.
A lot of what is covered on the “Big Data Epidemiology” module aims to minimise the “rubbish in” in the first instance. This may be done through careful or specialist study design that limits confounding, understanding how to identify and extract data with minimal risk of error and bias, and recognition of when linked data sources (e.g. hospital records) may be required to in addition to primary care records. We’re ready to expand the machine learning component of the course as the methods develop, but for now we’ll be focussing on the initial steps of study design, data extraction and traditional methodologies.
- IBM Big Data Hub. The Four V’s of Big Data. http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg. Accessed October 24, 2017.
- Meyfroidt G, Guiza F, Ramon J, Bruynooghe M. Machine learning techniques to examine large patient databases. Best Pract Res Clin Anaesthesiol. 2009;23(1):127-143. doi:10.1016/j.bpa.2008.09.003.
- Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can Machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS One. 2017;12(4):1-14. doi:10.1371/journal.pone.0174944.
- Deo RC. Machine Learning in Medicine. Circulation. 2015;132(20). http://circ.ahajournals.org/content/132/20/1920. Accessed January 10, 2017.
Sarah Stevens, Deputy Module Coordinator, Big Data Epidemiology
Statistician/ Epidemiologist and member of the Oxford CPRD group
Find out more about Big Data Epidemiology and the new MSc in EBHC Medical Statistics from CEBM and the Department of Continuing Education, University of Oxford