IRON normalization normalizes all samples against a common reference chip. This chip can either be an existing chip, or an artificial chip generated computationally. The "findmedian" and "pairgen" commands can be used to identify or create a reference chip. "iron" and "iron_generic" are used to normalize CEL files or text spreadsheets of intensities. Run each program with the --help option for a full list of available options. I recommend using findmedian to identify the least-distant "median" sample, rather than computationally creating an artificial one. *DO NOT* use log-transformed intensities as input to any of these programs unless you really really mean to do so, since all data is log-transformed internally within the software. findmedian can now input pre-logged data and treat it appropriately with the --nolog2 option, but iron_generic and pairgen still expect the input to be unlogged. *** Example workflows: *** # normalize CEL files; you can find the median sample in findmedian.txt # findmedian -c cdf_file_dir/ *.CEL > findmedian.txt iron -c cdf_file_dir/ --norm-iron="median_sample.CEL" *.CEL -o output.txt # normalize Agilent microarray US*.txt files # # agilent_to_spreadsheet.pl will perform local backgrouns subtraction, # zero out detected bad spots, and shift intensities so that the minimum # output intensity is 1.0 (this shift is to make sure that there are no # negative or zero intensities due to local background subtraction). # # iron_generic will then use missing-data-aware RMA background subtraction, # treating zeroed out fields (bad spots) as missing, to perform # non-specific background correction. # # Between RMA background correction and IRON normalization, the intensity # shifts introduced by agilent_to_spreadsheet.pl will be corrected for. # agilent_to_spreadsheet.pl US*.txt > input_data.txt findmedian --spreadsheet input_data.txt > findmedian.txt iron_generic --norm-iron="median_sample" input_data.txt -o output.txt # normalize RNA-seq data, if we aren't observing any non-linear curvature # in sample vs. sample log/log plots # findmedian --spreadsheet input_data.txt > findmedian.txt iron_generic --rnaseq --norm-iron="median_sample" input_data.txt -o output.txt # normalize RNA-seq data that exhibits similar non-linear intensity- # dependent behavior as microarray data # # note that \ is standard UNIX shell syntax for wrapping a line # findmedian --spreadsheet input_data.txt > findmedian.txt iron_generic --rnaseq --iron-non-linear --iron-weight-exponent=4 \ --norm-iron="median_sample" input_data.txt -o output.txt # normalize proteomics data # findmedian --spreadsheet --pearson input_data.txt > findmedian.txt iron_generic --proteomics --norm-iron="median_sample" input_data.txt -o output.txt # metabolomics: see discussion towards the end of this file # findmedian --spreadsheet --pearson --probeset-exclusions=unidentified_rowids.txt \ --probeset-spikeins=spikein_rowids.txt \ pos_ion_mode_input_data.txt > findmedian_pos.txt iron_generic --proteomics --iron-exclusions=unidentified_rowids.txt \ --iron-spikeins=spikein_rowids.txt \ --norm-iron="median_sample" pos_ion_mode_input_data.txt -o output.txt *** iron_generic meta-flags: *** The following meta-flags set several options at once: --microarray (default, if no meta-flags are specified) --bg-rma --log2 --iron-non-linear --iron-weight-exponent=4 --iron-fit-only-y --iron-no-condense-training --floor-none --iron-check-saturation --iron-ignore-low --rnaseq --bg-none --unlog --iron-untilt --iron-weight-exponent=0 --iron-fit-only-y --iron-condense-training --floor-none --iron-no-check-saturation --iron-no-ignore-low --proteomics --bg-none --unlog --iron-global-scaling --iron-weight-exponent=0 --iron-fit-both-x-y --iron-condense-training --floor-none --iron-no-check-saturation --iron-ignore-low *** Other common example use cases: *** Find median CEL file at the probe level, when memory usage is not a concern: findmedian -c cdf_file_dir/ *.CEL > findmedian.txt Find median CEL file at the probeset level, when datasets are VERY large: findmedian -c cdf_file_dir/ --probesets *.CEL > findmedian.txt Find median sample in a spreadsheet of unlogged intensity data: findmedian --spreadsheet input_data.txt > findmedian.txt Find median sample in a spreadsheet of unlogged proteomics data, where we do not observe any negative intensity-dependent effects (aside from increased missing data, which is already dealt with appropriately). Pearson correlation will give better results than RMSD in these cases: findmedian --spreadsheet --pearson input_data.txt > findmedian.txt Extract the median sample from the findmedian.txt output under UNIX/Linux: cat findmedian.txt | grep "^Median CEL:" | cut -f 4 As of libaffy v2.2, you can also simply use: tail -1 findmedian.txt Create a CEL file containing the median intensity for each probeset: pairgen -c cdf_file_dir/ --median *.CEL -o median.CEL Create a CEL file containing the average intensity for each probeset: pairgen -c cdf_file_dir/ --average *.CEL -o average.CEL NOTE -- pairgen will generate a CEL file that works with libaffy, but may not be valid enough to work with other software packages, such as Affymetrix Power Tools or Bioconductor. NOTE -- "median_sample.CEL" is enclosed in double-quotes in the examples below in case it contains special characters, such as spaces, that have special meaning on the UNIX command line. Normalize CEL files against the reference chip: iron -c cdf_file_dir/ --norm-iron="median_sample.CEL" *.CEL -o output.txt Normalize unlogged sample intensities in a spreadsheet: iron_generic --norm-iron="median_sample" input_data.txt *.CEL -o output.txt NOTE -- mixing and matching CEL files and spreadsheet data is currently unsupported. Thus, for iron_generic, pairgen cannot be used, and findmedian should be used to identify the median sample within the spreadsheet. Proteomics data exhibits a very different density distribution than microarray data, and generally only exhibits global scaling shifts in intensities between samples. Non-specific RNA binding background correction should not be applied to proteomics data or RNA-seq data, since this cannot occur in these assays. Intensities should be output unlogged so that missing values (zeroes) will be preserved as zeroes, rather than output as -nan. '\' indicates wrapping the command line to the next line: iron_generic --proteomics --norm-iron="median_sample" input_data.txt -o \ output.txt As of libaffy v2.2, use of the --log2 flag will output original zeroes as blanks (originally negative values are still left as NaNs, so it is more obvious that unexpected behavior occurred). Thus, it should be generally safe to use the --log2 flag with proteomics and RNA-seq data now (we don't expect to encounter any negative values). *** Special considerations for metabolomics data, *** *** rows/probesets to exclude from various analyses: *** In general, settings that are appropriate for proteomics are also appropriate for metabolomics. However, there are some additional complexities that should be taken into account for metabolomics data, such as spike-ins and unidentified peaks. Spike-ins are unrelated to the biological content of the sample, and thus should be excluded from all calculations (including findmedian and normalization). We have observed that unidentified peaks, at least in metabolomics data, tend to be enriched for background contaminants, and thus would skew the normalization towards normalizing against the background rather than the biological signal. Thus, unidentified peaks should generally be ignored during normalization training, but still normalized along with all of the other non-spikein peaks. Once spikein and unidentified metabolites have been identified (externally to iron), IRON can read in lists of row identifiers (one identifier per row) to handle them appropriately. NOTE -- positive (+) and negative (-) ion mode data should be processed separately, since they are separate injections!! findmedian --spreadsheet --pearson --probeset-exclusions="unidentified_rowids.txt" \ --probeset-spikeins="spikein_rowids.txt" \ pos_ion_mode_input_data.txt > findmedian_pos.txt iron_generic --proteomics --iron-exclusions="unidentified_rowids.txt" \ --iron-spikeins="spikein_rowids.txt" \ --norm-iron="median_sample" pos_ion_mode_input_data.txt -o output.txt Although originally implemented to address issues with metabolomics data, --iron-exclusions and --iron-spikeins have been extended to be specified at the probeset level with the CEL file based "iron" program as well. The copy of each probe associated with the excluded/spikein probesets will be dealt with appropriately at the probe level, as well as at the probeset level. See additional notes on probes that belong to multiple probesets, as well as probes found in both excluded and spikeins probesets, as discussed in libaffy/mas5/iron_norm.c: affy_pairwise_normalization().