3.3. Data acquisition and pre-processing
Last updated
Last updated
Data acquisition was performed via R and bash scripts, either manually (for ArrayExpress data) or through Bioconductor project packages (e.g., GEOquery, (DAVIS; MELTZER, 2007)) for data from GEO.
Once the raw data and their respective metadata were obtained, with clinical and experimental characteristics, control samples were normalized using a "barcode" approach, called Universal exPression Code (PICCOLO et al., 2013)UPC, (PICCOLO et al., 2013 ). The UPC is a method applied individually to each sample and estimates the probability of the probes being expressed (active), based on the hypothesis that expression values originate from two populations: inactive genes, where the expression measure represents the variation background, and active genes, consisting of background variation plus a signal. As can be seen in the graph below (Figure 3.3), after a sample is normalized by UPC, the values of its probes can vary between 0 and 1, where only a small portion of the probes is active (expression value > 0.5).
The presence of batch effect and identification of possible outlier samples were evaluated by principal component analysis (PCA), which allows evaluating the relative distances of the samples in the plane of their principal components. With the PCA it is possible to detect outliers, that is, samples that have a high distance from the others, and also clusters of samples that possibly reflect technical variables, such as the date of the experiment or the scanner used. Samples with a large amount of variation concerning the other samples, and with variation not explained by phenotypic variables, were removed.