libaffy

Affymetrix has developed a series of gene chip microarrays that have become increasingly popular in gene expression studies. In order to facilitate the use of these microarrays at the Moffitt Cancer Center, we have developed a C software library for accessing the various file types generated by their software and image capture devices.

libaffy consists of a set of routines for accessing the various file types (DAT, CEL, CDF) and post-processing them using a variety of algorithms, including MAS5.0, RMA, and IRON. The purpose of this library is a codebase upon which many different experimental or practical projects could be based.

Releases
Binaries
Documentation
Background
Contacts
Publications

Releases

*** WARNING *** versions prior to 2.3.0 generated sub-optimally normalized output for iron_generic --rnaseq on "per million" data, such as CPM/FPKM/TPM.

*** WARNING *** versions prior to 2.1.5 generated incorrectly normalized output for exon arrays in which a probe can be a member of multiple probesets

libaffy-2.3.0.tar.gz [Latest release 2023-09-25]
Bug fixes:
- CR/LF/CRLF end-of-line differences between Mac/Unix/PC should no longer cause any file input problems
  (open all files as binary, replace all remaining calls to fgets() and similar functions with EOL-safe functions).
  This was previously resulting in program aborts with some binary CEL files when compiled and run on ARM Macs.
- iron_generic: --rnaseq, --proteomics, --microary (default) set new flags, below, as appropriate.
- iron/iron_generic: fix crash on completely or mostly empty/dark samples.
  IRON was crashing when no good data points were found in common between a sample and the reference sample.
- agilent_to_spreadsheet.pl: change probe ControlType check to treat "0" as non-control.
  This should fix some ArrayExpress files that were returning no valid data before.
- Various minor edits so that it builds on Ubuntu.
New options:
- iron/iron_generic: new --iron-no-ignore-low flag; exclude values <= 0.00001 from IRON training, instead of <= 1.0.
  Needed for proper normalization of "per-million" data, such as RNA-Seq CPM/FPKM/TPM,
  or any data where we expect good data to lie between near-zero and 1.
  RNA-Seq counts data, where all "good" values are assumed to be >= 1,
  should now use: --rnaseq --iron-ignore-low to re-enable the previous behavior.
- iron/iron_generic: new --iron-no-check-saturated flag; disables automatic 16-bit microarray scanner saturation detection.
  16-bit saturation detection (ignore likely saturated values during IRON training) was unlikely to have triggered
  on any real-world RNA-Seq or mass spectrometry datasets, since the conditions are highly unlikely to occur,
  but it is best to disable it for --rnaseq and --proteomics out of an abundance of caution.
Documentation changes:
- findmedian: corrected --ignore-weak description (ignore <= 0, not <= 1).
  This is only a documentation fix, it has always been implemented as <= 1.
- Renamed README_IRON to README, so git will display it.
- Renamed old README to README_INSTALL so git will display newly renamed README_IRON.
- Various minor documentation typo corrections.
libaffy-2.2.0.tar.gz [Released 2020-08-21]
New options and functionality changes:
- programs will no longer scan the working directory for CEL files if no input files are specified
- iron_generic: --rnaseq now enables --iron-condense-training
  (improves fits by minimizing contribution from quantized low read counts)
- iron_generic: --floor-to-min option is recognized as a valid flag now
- iron_generic: added --microarray meta-flag to iron_generic
- iron_generic: added --floor-non-zero-to-one flag
  (resets flags set by --proteomics and --rnaseq to microarray defaults)
- iron: added --probeset-norm and --no-probeset norm flags to enable/disable probeset-level normalization
  (iron --norm-quantile --median-polish --no-probeset-norm yields identical output to the "rma" program)
- findmedian: added -o option to specify output file name instead of STDOUT
Minor output format changes:
- iron_generic: output missing (0) data as blanks if log2 output is selected
- iron_generic: print GlobalFitLine header with --iron-untilt, do not print GlobalScale header unless --iron-global-scaling is specified
- findmedian: move Average RMSD line up, print additional median sample line as last line for easier parsing
- output spreadsheets now use ProbeID as top-left field name insetad of the output file name
- print path to CDF file to STDOUT after the rest of the settings summary
- changed --bioconductor-compatability description to better reflect that identical output hasn't been guaranteed for some time now
Minor bug fixes and other edits:
- expanded README_IRON with more examples and documentation
  documented --microarray --proteomics --rnaseq meta-flags
- various minor changes to SCons* files for python 2/3 compatability
- iron_norm.c: fixed potential incorrect use of CEL-embedded masks when used in pm-only mode
  This could have only occured if the "rma" binary was run with --norm-iron *and* the CEL file contained non-zero masks.
  CEL-embedded masks are uncommon, and --norm-iron was removed as an rma option many years ago, so this edge case should not have been triggered for many years now.
- fixed typo in WIN32 list_files(), "struct new" should be a pointer
- renamed min() and max() macros to avoid conflicts with other libraries on some Windows systems
- iron_generic: changed probeset summarization function from median polish to Tukey's Bi-weight
  (change has no effect on the output relative to previous versions, since each probeset consists of only one probe)
- various minor edits to squash -Wall warnings, should have no impact on existing code flow
- various minor edits to allow for emscripten (emcc) compilation
- edited SConstruct files to throw most warnings with cflags=optimize;
  (this should help to avoid future OSX compilation issues, since clang aborts on warnings GCC ignores by default)
- deleted long unused (and bugged) chip_distance.c
libaffy-2.1.9.1.tar.gz [Released 2019-08-13]
- Changed a few unsigned char strings to char so it builds under OSX again
- Print the value of average RMSD, rather than the address of its pointer
  (this went unnoticed for years since, due to typical printf weirdness, a valid looking value was printed)
- Release OSX binaries
libaffy-2.1.9.tar.gz [Released 2019-08-13]
- agilent_to_spreadsheet.pl: handle more ArrayExpress-corrupted Agilent "raw" files
- findmedian: fix potential --pearson correlation overflow/underflow issues with large datasets
- Work around more broken control probeset definitions in "official" HuEx-1_0-st-v2.text.cdf
- Added --ignore-chip-mismatch flag to not abort when multiple chip types are detected
  (use when they are known to truly be the same, even if the CEL file claims otherwise)
- Commented datExtractor out of make_clean.sh to finish deprecating it
libaffy-2.1.8.tar.gz [Released 2019-04-25]
- Fix various crashes when loading newer binary CDF files and misformatted BrainArray CDF files
- --iron-condense-training only applies to probesets now
  (use this flag for Affymetrix miRNA-4.0 arrays and other exceedingly dark chips)
- finished implementing exclusions/spikeins for iron program
  (the flags only did anything in findmedian and iron_generic before)
- Improved/added ignoring AFFX/control probes/probesets during findmedian, IRON, and quantile normalization
  Quantile normalization includes them, findmedian ignores them, and iron ignores them during normalization training (they are still normalized)
  NOTE -- this will likely cause minor differences in output data !!
- Fixed --norm-mean normalization on newer 1:many probe:probeset CEL files (previous behavior was incorrect)
- Fixed segfault in findmedian --probesets caused by new exclusion/spikein options
- datExtractor program is now officially unsupported; contact eric.welsh@moffitt.org if you still use it
libaffy-2.1.7.1.tar.gz [Released 2018-08-27]
- Fixed AFFY_ERROR return values upon binary CDF file load errors
- Included working datExtractor binary in binary releases
  (I had forgotten to temporarly recompile 2.1.7 with -DSTORE_CEL_QC just for datExtractor)
- Compiles on OSX (will upload OSX binaries "soon")
libaffy-2.1.7.tar.gz [Released 2018-08-23]
- Fixed long-standing bugs in binary CDF file support.
  If you find one that still crashes, please email me.
- findmedian: fixed --meancenter and --pearson bugs/issues
  --pearson now recommended for mass spec proteomics/metabolomics data
- iron, iron_generic:
  Preserve input zeroes as zeros after IRON normalization
  (they weren't always output as zero after normalization before).
  Added --iron-codense-training and --iron-no-condense-training (default, original behavior).
        When training the normalization, all data points with identical values are condensed to a single data point.
        This is especially important for mass spec proteomics/metabolomics data.
        Condensing is disabled by default (original behavior); --proteomics also enables it, --rnaseq also disables it.
  Added --iron-exclusions=filename flag to specify a list of row identifiers to exclude from training.
        Useful for samples with a secondary distribution (sometimes contaminates),
        where the secondary distribution is dense and IRON fits through the "wrong" dense region.
  Fixed some command line options to not print bogus single letter shortcuts.
- agilent_to_spreadsheet.pl:
  Support ArrayExpress mangled files
  Support mixed 2-ch and 1-ch files in the same analysis
- %lf/%f, %ld/%d mis-match issues affecting GlobalScale stderr statistics printing
- Inserted Log2Scale column into GlobalScale stderr messages
- Changed wording of various "chip" messages to "sample" messages
- More function renaming to remove new Xcode introduced header conflicts on OSX
- See changelog.txt accompanying the source code for more detailed changelog
libaffy-2.1.6.tar.gz [Released 2015-08-12]
- will now compile on OSX 10.10 (Clang LLVM)
- will now compile on latest Cygwin + MinGW32 combination: i686-pc-mingw32-gcc v4.7.3
  (older gcc v3.x -mno-cygwin compilers no longer supported)
- added scons cflags_bundle=datextractor for easier building of datExtractor
  (defines STORE_CEL_QC, which causes additional memory overhead in the other programs)
- added additional initializations to affy_rma_set_defaults()
  which resulted in corrected printing of settings used in the "rma" program
  (program exhibited correct default behavior, such as enabling quantile normalization, but default settings were not printed correctly)
- findmedian.c: fixed --nolog2 flag bug
  (it was not previously applying to --spreadsheet mode)
- agilent_to_spreadsheet.pl:
  corrections to comment text at the top of the file,
  more intelligent control probe skipping,
  spots found to be bad in one channel are now flagged as bad in the other channel as well
  (since the physical cause is likely to be affecting the other channel, too)
libaffy-2.1.5.tar.gz [Released 2014-06-04]
- various major updates to add support for HTA-2 chips and correct bugs with libaffy's handling of exon chips in general
  (exon chip support still requires unofficial CLF/PGF -> CDF converted CDF files)
- exon chips: assymetric chips (#row != #cols) no longer crash
- all: -DSTORE_XY_REF and -DSTORE_CEL_QC flags added to load mostly un-used information
  (these default to undefined to save memory)
  --iron-global-scaling now prints scale for reference chip (1.0)
- -DSTORE_CEL_QC defaults to undefined to save memory,
  however this causes the datExtractor program to no-longer work
  unless libaffy is recompiled with -DSTORE_CEL_QC
- bg-rma: better support for missing values / zeroes
- iron: modified to better deal with weak / missing signals
- CDF/CEL files: had swapped rows/cols when allocating memory, leading to segfaults on assymetric arrays
- rma/mas5/iron: abort on corrupt CEL files, unless --salvage is used
- mas5/iron: most functions that use MM probes will no longer crash or yield strange results when the chip is missing MM probes
  (they will simply do nothing, since MM probes are missing)
- mas5/iron: changed mean normalization scaling to use only intensities > 0
  (this was required so that data with many zeros or missing values would normalize properly)
- findmedian: floor CEL file data at 1 to prevent NaNs (spreadsheets were already floored)
- findmedian: --ignore-weak is now default, since it is needed to work well on data with missing values,
  and has no effect if there are no missing values
- findmedian: new option --include-weak disables the --ignore-weak option
- findmedian: divide distances by normalized number of points in comparison
  (improves median detection when --ignore-weak is used on data with weak/missing signals)
- affydump: abort cleanly when netcdf is requested but is unavailable
- iron_generic: added --proteomics option to set appropriate proteomics defaults:
  --bg-none --unlog --iron-global-scaling --iron-weight-exponent=0 --iron-fit-both-x-y --floor-none
- iron_generic: added --rnaseq option to set appropriate rnaseq defaults:
  --bg-none --unlog --iron-untilt --iron-weight-exponent=0 --iron-fit-only-y --floor-none
libaffy-2.1.4.tar.gz [Released 2013-11-11]
- New program: agilent_to_spreadsheet.pl; converts Agilent .txt scan files into spreadsheets suitable for input to iron_generic
- RMA background subtraction: added support for missing values, treat zeroes as missing values
- IRON: added experimental --iron-fit-both-x-y flag. Fit normalization curve against both X and Y (rather than the default of only Y).
  Results in "better looking" normalization, but may alter rank orders.
- iron_generic: added support for > 65536 probes
- findmedian: fixed crash in --probeset mode introduced in 2.1.2
- findmedian: added --log2 (default, same as original behavior) and --nolog2 (no transformation) options
  to transform data prior to distance calculations
libaffy-2.1.3.tar.gz [Released 2013-06-13]
- findmedian: fixed crash in --spreadsheet mode introduced in 2.1.2 due to checking for corrupt binary CEL files
  (there can be no corrupt CEL files if the input is a spreadsheet)
- Modified RMA background subtraction to not crash when the highest density region is at or near the minimum value in the dataset
  (I have only seen this occur in datasets with many zero intensities)
- Modified RMA background subtration to not adjust intensities that are originally zero, leaving them as zero
libaffy-2.1.2.tar.gz [Released 2013-05-13]
- The findmedian program now correctly handles input consisting of only a single chip
- Increased sensitivity of corrupt binary CEL file detection (mask and outlier coordinates out-of-bounds)
- Fixed errors with importing masks and outliers in Calvin (generic) CEL files, introduced in libaffy v2.0
- Added non-default --salvage option to attempt to salvage corrupt binary CEL files,
  for which the intensities appear to be valid (no support for corrupt text CEL files yet).
  Corrupt masks, outliers, stdev, etc. may indicate corrupt intensities,
  even if the intensities appear valid, so *USE AT YOUR OWN RISK*.
libaffy-2.1.1.tar.gz [Released 2013-04-29]
- Added README_IRON text file with brief description and example usage for IRON normalization
- Only remove file extentions in output sample names if ending in .CEL or .TXT/.TEXT (case insensitive)
- iron_generic: No longer checks for CEL files in current directory if no inputs given
- iron_generic: Renamed --no-normalize option to --norm-none for consistency with the other programs
libaffy-2.1.tar.gz [Released 2013-04-01]
- Fixed crash in IRON when median polish was mixed with non-RMA background subtraction
- Fixed initialization bug that resulted in incorrect IRON normalization within the iron_generic program
- Fixed issue where IRON could remove all points from the training set when N points is small
- Fixed rare case of reading beyond IRON fit equation windows bounds, particularly when N points is small
- Fixed --norm-none to include disabling mean probeset scaling
- Added --iron-weight-exponent=N option to IRON software to control pseudo-density weighting
- Added brief README documentation for make-like installation commands
- Extended IRON --norm-quantile to apply probeset-level quantile normalization
- Renamed --iron-linear flag to --iron-global-scaling to more properly describe its functionality
- Text file input is faster, due to fewer reallocs, at minimal cost in extra memory usage
libaffy-2.0.tar.gz [Released 2012-09-21]
- Addition of IRON pair-wise normalization and related tools
  (iron, iron_generic, findmedian, pairgen)
- Added support for Calvin format binary CEL files
- Implemented MAS5-style Present/Marginal/Absent calls
- Preliminary support for Affymetrix whole exon arrays via the text version of the official "unsupported" CDF files
  (transcript-level probesets are not present in the CDF files)
- Added support for "incremental RMA", to allow saving an RMA model and apply it to new CEL files
- Various fixes and features
- Bioconductor compatibility flag is deprecated and has not been regression tested
- Switched from Make to SCons (Python-based) build environment
  (overall gain in portability, the previous Makefiles were brittle)
- Improved support for compiling under Win32 GCC 3.x Cygwin+MinGW
- Documentation has not been updated since v1.3 (2006), run programs with --help for current options
libaffy-1.3.tar.gz [Released 2006-11-30]
- Added option to ignore AFFX control probes during RMA normalization
- Added option for quantile normalization to MAS5
libaffy-1.2.tar.gz [Released 2006-07-11]
- Improved agreement in MAS5.0 code with actual MAS5.0 results
libaffy-1.1.tar.gz [Released 2006-07-07]
- Fixed endian conflict in Linux
- Strips pathname and extension when writing expressions
libaffy-1.0.tar.gz [Released 2005-11-15]

Binaries

*** WARNING *** versions prior to 2.1.5 generated incorrectly normalized output for exon arrays in which a probe can be a member of multiple probesets

libaffy-2.3.0
- Linux 64-bit, statically linked
libaffy-2.2.0
libaffy-2.1.9.1
libaffy-2.1.9
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1.8
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1.7.1
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1.7
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1.6
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1.5
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1.4
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1.3
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1.2
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1.1
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.1
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked
libaffy-2.0
- Linux 64-bit, statically linked
- Windows 32-bit, statically linked

Documentation

Note: Documentation linked here is always built from the latest release tag in the source control repository, and may be out of sync with older versions. Documentation may also be out of date and not reflect newer versions which have not been properly documented yet.

Background

When we first began experimenting with Affymetrix microarrays, much of the useful software for accessing this data from the raw files was not available. The excellent Bioconductor project has been most noticeable in filling that gap. However, we feel that having a library of routines available for public consumption would allow for greater flexibility. In addition, many of the data structures and algorithms have been implemented with an eye both towards readability and performance (particularly memory usage). In this, we feel libaffy succeeds very well.

Publications

Welsh et al.; BMC Bioinformatics, 2013. 14:153
Eschrich et al.; Engineering in Medicine and Biology Society, 2008. EMBS 2008. 30th Annual International Conference of the IEEE
Eschrich SA, Hoerter AM; Bioinformatics, 2007. Jun 15;23(12):1562-4

Contacts

Eric A. Welsh (eric.welsh@moffitt.org)
Steven A. Eschrich (steven.eschrich@moffitt.org)