AlphaPept: Workflow and Files

Core

The core function of Alphapept is interface.run_complete_workflow(). This function requires a settings file (a dictionary containing the settings and file paths). Filewise, we store settings files as *.yaml file. When calling the core function, it will run a complete workflow based on the settings given.

GUI

When starting the AlphaPept GUI via the shortcut that the one-click installer created or via python (python -m alphapept gui), the AlphaPept server will be started. It can be accessed via a browser and provides a graphical user interface (GUI) to the AlphaPept functionality. The server extends the core function to a processing framework. The server is centered around three folders, Queue, Failed, and Finished, which will be created in the .alphapept-folder in the user’s home directory. Whenever a new *.yaml-file is found in the Queue-folder, the server will start handing this over to the core function and start processing. There are three ways to add files to the Queue-folder: 1. Via the New experiment-tab in the GUI 2. Manually copying a *.yaml-file into the Queue-folder 3. Automatically via the File watcher.

The File watcher can be set up to monitor a folder; whenever a new file matching pre-defined settings is copied to the folder, it will create a *.yaml-file and add it to the Queue-folder.

Whenever an experiment succeeds, the *.yaml-file will be appended by summary information of the experiment and moved to the Finished-folder. As the *.yaml-file is only very small in size (~kB), it is intended to serve as a history of processed files.

Whenever an experiment fails, the *.yaml-file will be moved to the Failed-folder. It can be moved from there to the Queue-folder for reprocessing.

FASTA files

When FASTA files are downloaded via the AlphaPept GUI, the files are stored in the FASTA folder in the alphapept home directory (e.g. C:\Users\username\.alphapept\fasta).

History and Results

AlphaPept screens all *.yaml-files in the finished folder and plots a run history based on the summary information. This is especially useful for QC or comparison purposes. Additionally, the *.yaml-files can be used to investigate the results of a run.

Output Files

For each run, AlphaPept creates several output files: - For each raw file, there will a .ms_data.hdf-file with raw-specific data, such as feature_table, first_search, second_search, peptide_fdr and identifications. Note that the table peptide_fdr does not apply the set fdr cutoff. - For the entire experiment, there will be a results.hdf (name can be defined in the settings), which contains experiment-specific data, such as protein_fdr and the protein_table (containing quantified proteins over all files). - Additionally to the results.hdf, there will be a *.yaml-file which contains the run settings and summary information of the run. This *.yaml can be used to serve as a template to rerun other files with the same settings. - If a database is created from FASTA-files there will be a database.hdf (name can be defined in the settings). This contains theoretical spectra and can be reused for other experiments (and speedup total analysis time)

The ms_data.hdf, results.hdf and database containers can be accessed via the alphapept.io library. The GUI also allows to explore these files. Additionally, the results.hdf can be directly loaded via the pandas-package (e.g. pd.read_hdf('results.hdf', 'protein_table').

For easier access, AlphaPept directly exports the most relevant tables as *.csv: - for each raw file: a _ids.csv-file with the best peptide-spectrum match per sepctrum. - for each raw file: if calibration was successfull, a _calibration.png to show the fragment calibration. - results_peptides.csv: The identified peptides after protein FDR. - results_proteins.csv: A table containing quantified proteins per file. Each column that additionally ends with _LFQ has the lfq intensity. This is after delayed normalization and extraction of optimal protein ratios (see the MaxLFQ paper). The column w/o _LFQ. Has the protein intensity after delayed normalization. Note: When LFQ is disabled, there is only one column per File and this is w/o delayed normalization. This leads to different intensities when comparing results w/ and w/o LFQ enabled and checking the non-LFQ-table. - results_protein_summary.csv: A table containing quantified proteins per file. This contains additional summary, e.g. the number of sequences that were found to identify the protein.

Column headers

Below is a description of the column headers in the output files.

protein_fdr

Name Description
delta_m_ppm_abs absolute value of delta_m in ppm
b-H2O_hits b-ion hit with a water loss
b-NH3 hits b-ion hit with a NH3 loss
hits_b number of b ion hits
charge charge of the peptide
db_idx index to the theoretical database
decoy is the sequence a decoy or a hit (Yes / No)
decoys_cum cumulative number of decoys in table (used for FDR calculation)
delta_m mean mass delta when comparing experimental fragments to theoretical fragments when searching
delta_m_ppm delta_m in ppm
dist a metric used to measure the distance of an MS1 feature (quantification) to a matching MS2 spectrum (identification). This is important in the mapping of MS1 features to MS2 spectra
fasta_index index to the fasta file that you use for searching
fdr calculated false discovery rate value for this peptide in the table. As the PSM score decreases more decoys will be found the FDR score increases until you reach your FDR threshold cutoff, below which we don’t count any more hits.
feature_idx index to feature table from feature finding
feature_rank multiple ms1 features will be mapped to a single ms2 spectra. The score_rank indicates the how close the feature was to the spectrum in comparison to other features in close distance.
fwhm fwhm of the feature
hits total number of b- and y-ion hits. A hit occurs when a theoretical fragment can be found within the frag_tol of an experimentally recorded fragment
ms1_int_apex intensity at feature apex
fragments_int_ratio mean intensity ratio: experimental fragment intensity divided by theoretical intensity (if no db intensity is available db intensity is set to 1) for each matched ion
ms1_int_sum summed intensity of the MS1-feature
fragment_ion_idx index to ion dataframe for this PSM
fragment_ion_int intensity of each matched ion
fragment_ion_type type of the matched ion
mass mass
fragments_matched_int_sum sum of the intensity of fragments found in the PSM
fragments_matched_int_ratio ratio of the fragments_matched_int_sum to the total intensity in a spectrum
fragments_matched_n_ratio number of ions that were found (hits) divided by the number of AAs in the matched sequence
mz mz
n_AA number of AAs in the sequence
n_internal number of internal cleavage sites
n_fragments_matched number of matched ions
n_missed number of missed cleavages
sequence_naked naked sequence (i.e. withouth modifications)
prec_offset mass offset from theoretical precursor mass to actual precursor mass
prec_offset_ppm prec_offset in ppm
prec_offset_raw_ppm prec_offset_raw in ppm
prec_offset_raw mass offset from theoretical precursor mass to actual precursor mass (before recalibration)
precursor peptide + charge
q_value q_value for FDR estiamtion
query_idx index to query (for the spectrum index use raw_idx)
score_rank one query can have multiple matching PSMs. This is the score_rank of the PSM to the query with respect to the score. See also raw_rank
precursor_rank if a precursor is measured multiple times, this is the score_rank with respect to the score
raw_idx index to the experimental spectrum
raw_rank one spectrum can have multiple matching PSMs. This is the score_rank of the PSM to the spectrum with respect to the score.
rt retention time of the feature
rt_apex retention time of the feature (at maximum intensity)
rt_end end timepoint of feature
rt_start start timepoint of feature
scan_no scan number
score score used for FDR estiamtion
sequence matched sequence
target boolean flag if sequence is target or decoy
target_cum cumulative number of targets (used for FDR estiamtion)
fragments_int_sum summed intensity of all fragments in experimental spectrum
x_tandem x_tandem score
y-H2O_hits number of y-ion hits with water loss
y-NH3_hits number of y-ion hits with ammonia loss
hits_y number of y-ion hits
filename full filename of raw data
shortname short filename of raw data (w/o folder path)
protein identified protein
protein_group identified protein_group
razor flag if peptide was used for protein grouping with razor approach
protein_idx index to protein (with respect to FASTA)
decoy_protein flag if protein is decoy or not
n_possible_proteins number of possible proteins for this peptide
score_protein_group score of the protein group

Downstream analysis

AlphaPept offers some basic plots in the results section (e.g., volcano, heatmap, and PCA). The *.csv-format should be generic to use with multiple other tools. Feel free to reach out in case you have ideas for plots or find that the output format not supported or has required columns missing. To reach out, report an issue here or send an email to opensource@alphapept.com.

Using with Perseus

Perseus offers a generic table import, so you can directly use the results_proteins.csv.

Example: Volcano-Plot

An excellent tutorial for creating volcano-plots with Perseus can be found here.

Below a quickstart to use AlphaPept with Perseus (tested with 1.6.15.0) The file used here is PXD006109 from the test runner (multi-species quantification test) with six files (three each group).

  1. Open Perseus.

  2. Drag and drop the results_proteins.csv in the central pane of Perseus. The Generic matrix upload-window will open.

  3. Select the appropriate columns (e.g., LFQ for LFQ-intensities) and select them for Main with the >-Button. The first row is empty. Assign this for text. Click OK, and the table should be loaded.

  4. Click on the f(x)-button and press OK on the window that opens to apply a log2(x)-transformation.

  5. Click on Annot. rows > Categorical annotation rows to assign a group for each file. Select multiple entries and click on the checkmark to assign multiple groups at the same time. Click OK to close the window.

  6. Click on the Volcano plot-symbol in the upper right Analysis-column. For the tutorial, we keep the standard settings and press OK.

  7. You can double-click on the small volcano plot to show the plot.

Enjoy your volcano-plot.