510K
All device manufacturers must register with the Food and Drug Administration (FDA). Per Section 510(k) of the Food, Drug and Cosmetic Act, device managers must notify the FDA of their intent to market a medical device at least 90 days in advance. This is known as Premarket Notification - also called PMN or 510(k). Compare with premarket approval (PMA). (http://www.fda.gov/MedicalDevices/ProductsandMedicalProcedures/DeviceApprovalsandClearances/510kClearances/)

Annotation
A label or piece of information associated with a sample. An annotation may refer to a subset of the sample (e.g., a region of interest in an image) or the entirety of the sample.

Backpropagation
An efficient algorithm for computing the gradient (derivative) of a neural network’s cost function with respect to model weights (i.e., “how much is each weight in the model responsible for the error in the output?”). These gradients are then used to update the model and reduce its error. Currently, backpropagation is the de-facto standard by which neural networks are trained.

Bagging
See Bootstrapping Aggregation

Bayesian Inference
A method of estimating the probability of a hypothesis by incorporating observed evidence and a prior estimate of the hypothesis’s likelihood.

Bootstrapping
The process of repeated resampling, with replacement, from a set of data to generate a sample distribution. The sample distribution can be used to estimate a property of interest as well as the estimate’s uncertainty.

Bootstrapped Aggregation
Bootstrap aggregating, also called bagging, is an ensembling technique in which multiple copies of a model are trained on bootstrapped samples of the data. The algorithm’s prediction is selected to be the consensus among the various models (average for regression, voting for classification). Bootstrap aggregating is commonly employed to reduce overfitting.

Boosting
Boosting is a machine learning technique that iteratively trains an ensemble of weak learners to form a strong one. When training a new model to add to the ensemble, sample data is weighted according to the current accuracy of the ensemble’s prediction with inaccurate predicitons being weighted more heavily. This encourages the new model to improve upon the ensemble’s weaknesses. (https://en.wikipedia.org/wiki/Boosting_(machine_learning))

CADe
In radiology computer-aided detection (CADe) systems assist in the interpretation of medical images, typically by marking conspicuous structures and areas in an image. The computer output is used as a second opinion combined with the radiologists’ interpretation. (https://en.wikipedia.org/wiki/Computer-aided_diagnosis)

CADx
In radiology called computer-aided diagnosis (CADx) systems assist in the interpretation of medical images by evaluating conspicuous structures and providing a diagnosis.
(https://en.wikipedia.org/wiki/Computer-aided_diagnosis)

CART
An algorithm for constructing decision trees from data.

Class II Device
Class II devices are higher risk devices than Class I and require greater regulatory controls to provide reasonable assurance of the device’s safety and effectiveness. As an example, if machine learning algorithms were used as clinical decision support for the radiologist as opposed to providing a primary interpretation, the algorithms would require Class II clearance. See 510(k) and PMA.
(http://www.fda.gov/AboutFDA/Transparency/Basics/ucm194438.htm)

Class III Device
Class III devices are generally the highest risk devices and are therefore subject to the highest level of regulatory control. Class III devices must typically be approved by FDA before they are marketed. As an example, if machine learning algorithms were used for primary imaging diagnosis without human intervention, the algorithms would require Class III clearance. See 510(k) and PMA.
(http://www.fda.gov/AboutFDA/Transparency/Basics/ucm194438.htm)

Classification
A particular machine learning task in which the algorithm is asked to categorize the input using one of a discrete set of classes.

Cloud Computing
The use of a remote computing infrastructure, commonly offered as a paid service by a third-party vendor, to achieve a business goal. This lies in contrast to on-premises solutions where software is run on local machines.

Clustering
A series of unsupervised learning algorithms in which samples with similar features are grouped together. This contrasts from classification where the groups are defined user labels.

Convolution
A mathematical operation characterized by the repeated application of a function to all subsets or windows of a sample of interest. In imaging, convolutions are commonly used to apply a small patch (“kernel”) to all locations of a larger image, allowing for the development of location-invariant feature detectors such as edge detectors.

Convolutional Neural Network
A type of neural network which leverages convolutions to learn location-independent feature descriptors.

Data Augmentation
Data augmentation is a technique for reducing overfitting by applying functionally unimportant transformations to the training data to encourage the learning algorithm to form decision boundaries using functionally relevant features.

Data Science
The science of developing empirical models, from structured and unstructured data, to make meaningful predictions that provide insights into processes of interest.

Decision Tree
A model characterized by a series of nested decision boundaries. By applying a series of criteria, the model predicts an appropriate output value.

Deep Neural Network
A deep neural network is a network with a large number of layers. The study of these models is known as deep learning.

Feature
An individual measurable element or characteristic of a phenomenon being observed for use in machine learning algorithm for pattern recognition, classification, and regression. Example medical imaging features used in machine learning algorithms include lesion size, shape, gray value, density, boundary description, texture, location.

Graphics Processing Unit (GPU)
A specialized computer chip originally designed to render graphics to a computer monitor. Because these operations are highly parallelizable and computationally less complex, a GPU contains a large array of simpler cores relative to a CPU which contains fewer but more complex cores.

Hyperparameter
A parameter which is not learned by the model but instead specified by the developer prior to training an algorithm. Often, a series hyperparameters must be assessed to determine which values provide the best level of performance. This is accomplished by repeatedly training the algorithm with different hyperparameter configurations and comparing their performance on an independent set of data (the validation set).

Imputation
The process of replacing missing data with substituted or estimated values. This may help reduce the introduction of bias caused by missing data.

k-Means Clustering
An unsupervised learning algorithm in which samples are classified into k groups based on the similarity of their features (rather than any user-specified label). Class assignment is based on proximity to the mean of the group’s features. The number of classes is a hyperparameter that must be set by the developer.

k-Nearest Neighbor
An algorithm which bases its prediction on the k nearest training samples in the feature space. When used for classification, the algorithm outputs the most common label amongst the k nearest training samples. In regression problems, the average of the nearest training samples is output.

Linear Discriminant Analysis
A classification algorithm which linearly separates two classes of data by maximizing inter-class variability while minimizing intra-class variability.

Linear Regression
A method which predicts a continuous, unbounded value of interest by computing a linear combination of one or more explanatory variables.

Logistic Regression
A method which predicts a continuous value of interest bounded between 0 and 1 by computing a linear combination of one or more explanatory variables. Logistic regression is often used in classification where the outputted value is treated as a probability.

Naïve Bayes
A relatively simple classification technique which, by assuming statistical independence between features, uses the laws of conditional probability to provide an efficient means of classification. This algorithm was used in some of the earliest spam filters.

Natural Language Processing (NLP)
An area of machine learning focused on the parsing and understanding of written and spoken word.

Neural Networks
A class of machine learning algorithms that consist of series of nested, non-linear functions. Unlike traditional regression and classification algorithms, neural networks continue to demonstrate increasing performance with vast quantities of data. As a result, given a sufficiently large training set, a well-trained neural network can contain upwards of tens of millions of parameters.

Overfitting
Error introduced into a model by fitting to a dataset of insufficient size to meaningfully distinguish signal from noise. Overfitting results in the model learning noise in the training set rather than the underlying signal of interest.

Precision
The percentage of positively labeled samples which are true positives.

PMA
Premarket approval refers to medical devices that have applied for and received FDA approval prior to marketing. Compare with premarket notification (PMN) or 510(k).

Principal Component Analysis (PCA)
Principal component analysis is a dimensionality reduction technique that creates a new, smaller set of features which maximally preserve the variation between samples.

Random Forests
Random forests are a method for learning an ensemble of decision trees. Each tree is trained on a bootstrapped sample of the training data. Additionally, each split of the decision tree is learned using a random sample of the input features.

Recall
See sensitivity.

Receiver Operating Curve
A receiver operating curve (ROC curve) highlights the change in classification algorithm performance as the threshold for labeling a sample positive is changed. It is a plot of the false positive rate (1 – specificity) on the x-axis vs. sensitivity on the y-axis.

Segmentation
A class of machine learning algorithms which attempt to individually classify every pixel or voxel in an image rather than the entire image itself. For instance, while a classification algorithm would label an image as containing a cancerous tumor, a segmentation algorithm would highlight which pixels/voxels form the cancerous tumor.

Semi-Supervised Learning
Semi-supervised learning represents a class of machine learning algorithms where a combination of labeled and unlabeled data is used for algorithm training.

Sensitivity
The percentage of positive samples which are correctly labeled.

Specificity
The percentage of negative samples which are correctly labeled.

Statistical Inference
Is the process of deducing properties of an underlying distribution by analysis of data.

Supervised Learning
Supervised learning represents a class of machine learning algorithms where only labeled data is used for algorithm training.

Support Vector Machine
A Support Vector Machine (SVM) is a supervised learning algorithm which separates data into 2 or more classes along the direction of maximum separability.

Test Set
A test set is a set of data previously unseen to a trained machine learning algorithm used to assess model performance (accuracy, sensitivity, specificity, etc.).

Training Set
A training set is a set of data used to train individual parameters in a given model.

Transfer learning
Transfer learning is the process of using a neural network trained for one task on a second, different task. Typically, the original network is used as a “starting point” for training on the new dataset. This technique allows one to train more complex algorithms than would typically be possible on a given dataset.

Unsupervised Learning
Unsupervised learning represents a class of machine learning algorithms that draw inferences from unlabeled data. A common example is clustering, where samples are bunched into similar groups which may or may not share a semantic meaning.

Validation Set
A validation set is a set of data, separate from the training set, used to select the best-trained candidate algorithm. This decision must be made using the validation set to prevent the risk of overfitting.


General Overview References
Erickson BJ, Korfiatis P, Akkus Z, and Kline TL. Machine Learning for Medical Imaging. Radiographics 37(2), 2017.

Gillies RJ, Kinahan PE and Hricak H. Radiomics: Images Are More than Pictures, They Are Data. Radiology 278(2):563-577. February 2016.

Greenspan H, Van Ginneken B, Summers RM. Guest Editorial Deep Learning in Medical Imaging: Overview and Future Promise of an Exciting New Technique. IEEE Trans on Medical Imaging 35(5):1153-1159, May 2016.

Kohli M, Prevedello LM, Filice RW, and Geis JR. Implementing Maching Learning in Radiology Practice and Research. AJR 208, 2017.

Kourou K, Exarchos TP, Exarchos KP, et al. Machine Learning Applications in Cancer Prognosis and Prediction. Computational and Structural Biotechnology Journal. 13:8-17, 2015.

General On-Line Videos
Intro to Machine Learning: https://www.youtube.com/watch?v=wjTJVhmu1JM

Open Source Tools
Caffe2
https://caffe2.ai

Spark MLlib
https://spark.apache.org/mllib

PyTorch
https://pytorch.org

R and RStudio
https://cran.r-project.org
https://www.rstudio.com

Scikit-Learn
https://scikit-learn.org

TensorFlow
https://www.tensorflow.org

Torch
http://torch.ch