AlphaPept: Workflow and Files
Core
The core function of Alphapept is interface.run_complete_workflow()
. This function requires a settings file (a dictionary containing the settings and file paths). Filewise, we store settings files as *.yaml
file. When calling the core function, it will run a complete workflow based on the settings given.
GUI
When starting the AlphaPept GUI via the shortcut that the one-click installer created or via python (python -m alphapept gui
), the AlphaPept server will be started. It can be accessed via a browser and provides a graphical user interface (GUI) to the AlphaPept functionality. The server extends the core function to a processing framework. The server is centered around three folders, Queue
, Failed
, and Finished,
which will be created in the .alphapept
-folder in the user’s home directory. Whenever a new *.yaml
-file is found in the Queue
-folder, the server will start handing this over to the core function and start processing. There are three ways to add files to the Queue
-folder: 1. Via the New experiment
-tab in the GUI 2. Manually copying a *.yaml
-file into the Queue
-folder 3. Automatically via the File watcher.
The File watcher
can be set up to monitor a folder; whenever a new file matching pre-defined settings is copied to the folder, it will create a *.yaml
-file and add it to the Queue
-folder.
Whenever an experiment succeeds, the *.yaml
-file will be appended by summary information of the experiment and moved to the Finished
-folder. As the *.yaml
-file is only very small in size (~kB), it is intended to serve as a history of processed files.
Whenever an experiment fails, the *.yaml
-file will be moved to the Failed
-folder. It can be moved from there to the Queue
-folder for reprocessing.
FASTA files
When FASTA files are downloaded via the AlphaPept GUI, the files are stored in the FASTA folder in the alphapept home directory (e.g. C:\Users\username\.alphapept\fasta
).
History and Results
AlphaPept screens all *.yaml
-files in the finished folder and plots a run history based on the summary information. This is especially useful for QC or comparison purposes. Additionally, the *.yaml
-files can be used to investigate the results of a run.
Output Files
For each run, AlphaPept creates several output files: - For each raw file, there will a .ms_data.hdf
-file with raw-specific data, such as feature_table
, first_search
, second_search
, peptide_fdr
and identifications
. Note that the table peptide_fdr
does not apply the set fdr cutoff. - For the entire experiment, there will be a results.hdf
(name can be defined in the settings), which contains experiment-specific data, such as protein_fdr
and the protein_table
(containing quantified proteins over all files). - Additionally to the results.hdf
, there will be a *.yaml
-file which contains the run settings and summary information of the run. This *.yaml
can be used to serve as a template to rerun other files with the same settings. - If a database is created from FASTA
-files there will be a database.hdf
(name can be defined in the settings). This contains theoretical spectra and can be reused for other experiments (and speedup total analysis time)
The ms_data.hdf
, results.hdf
and database containers can be accessed via the alphapept.io
library. The GUI also allows to explore these files. Additionally, the results.hdf
can be directly loaded via the pandas-package (e.g. pd.read_hdf('results.hdf', 'protein_table')
.
For easier access, AlphaPept directly exports the most relevant tables as *.csv
: - for each raw file: a _ids.csv
-file with the best peptide-spectrum match per sepctrum. - for each raw file: if calibration was successfull, a _calibration.png
to show the fragment calibration. - results_peptides.csv
: The identified peptides after protein FDR. - results_proteins.csv
: A table containing quantified proteins per file. Each column that additionally ends with _LFQ
has the lfq intensity. This is after delayed normalization
and extraction of optimal protein ratios
(see the MaxLFQ paper). The column w/o _LFQ
. Has the protein intensity after delayed normalization
. Note: When LFQ is disabled, there is only one column per File and this is w/o delayed normalization. This leads to different intensities when comparing results w/ and w/o LFQ enabled and checking the non-LFQ
-table. - results_protein_summary.csv
: A table containing quantified proteins per file. This contains additional summary, e.g. the number of sequences that were found to identify the protein.
Column headers
Below is a description of the column headers in the output files.
protein_fdr
Name | Description |
---|---|
delta_m_ppm_abs | absolute value of delta_m in ppm |
b-H2O_hits | b-ion hit with a water loss |
b-NH3 hits | b-ion hit with a NH3 loss |
hits_b | number of b ion hits |
charge | charge of the peptide |
db_idx | index to the theoretical database |
decoy | is the sequence a decoy or a hit (Yes / No) |
decoys_cum | cumulative number of decoys in table (used for FDR calculation) |
delta_m | mean mass delta when comparing experimental fragments to theoretical fragments when searching |
delta_m_ppm | delta_m in ppm |
dist | a metric used to measure the distance of an MS1 feature (quantification) to a matching MS2 spectrum (identification). This is important in the mapping of MS1 features to MS2 spectra |
fasta_index | index to the fasta file that you use for searching |
fdr | calculated false discovery rate value for this peptide in the table. As the PSM score decreases more decoys will be found the FDR score increases until you reach your FDR threshold cutoff, below which we don’t count any more hits. |
feature_idx | index to feature table from feature finding |
feature_rank | multiple ms1 features will be mapped to a single ms2 spectra. The score_rank indicates the how close the feature was to the spectrum in comparison to other features in close distance. |
fwhm | fwhm of the feature |
hits | total number of b- and y-ion hits. A hit occurs when a theoretical fragment can be found within the frag_tol of an experimentally recorded fragment |
ms1_int_apex | intensity at feature apex |
fragments_int_ratio | mean intensity ratio: experimental fragment intensity divided by theoretical intensity (if no db intensity is available db intensity is set to 1) for each matched ion |
ms1_int_sum | summed intensity of the MS1-feature |
fragment_ion_idx | index to ion dataframe for this PSM |
fragment_ion_int | intensity of each matched ion |
fragment_ion_type | type of the matched ion |
mass | mass |
fragments_matched_int_sum | sum of the intensity of fragments found in the PSM |
fragments_matched_int_ratio | ratio of the fragments_matched_int_sum to the total intensity in a spectrum |
fragments_matched_n_ratio | number of ions that were found (hits) divided by the number of AAs in the matched sequence |
mz | mz |
n_AA | number of AAs in the sequence |
n_internal | number of internal cleavage sites |
n_fragments_matched | number of matched ions |
n_missed | number of missed cleavages |
sequence_naked | naked sequence (i.e. withouth modifications) |
prec_offset | mass offset from theoretical precursor mass to actual precursor mass |
prec_offset_ppm | prec_offset in ppm |
prec_offset_raw_ppm | prec_offset_raw in ppm |
prec_offset_raw | mass offset from theoretical precursor mass to actual precursor mass (before recalibration) |
precursor | peptide + charge |
q_value | q_value for FDR estiamtion |
query_idx | index to query (for the spectrum index use raw_idx) |
score_rank | one query can have multiple matching PSMs. This is the score_rank of the PSM to the query with respect to the score. See also raw_rank |
precursor_rank | if a precursor is measured multiple times, this is the score_rank with respect to the score |
raw_idx | index to the experimental spectrum |
raw_rank | one spectrum can have multiple matching PSMs. This is the score_rank of the PSM to the spectrum with respect to the score. |
rt | retention time of the feature |
rt_apex | retention time of the feature (at maximum intensity) |
rt_end | end timepoint of feature |
rt_start | start timepoint of feature |
scan_no | scan number |
score | score used for FDR estiamtion |
sequence | matched sequence |
target | boolean flag if sequence is target or decoy |
target_cum | cumulative number of targets (used for FDR estiamtion) |
fragments_int_sum | summed intensity of all fragments in experimental spectrum |
x_tandem | x_tandem score |
y-H2O_hits | number of y-ion hits with water loss |
y-NH3_hits | number of y-ion hits with ammonia loss |
hits_y | number of y-ion hits |
filename | full filename of raw data |
shortname | short filename of raw data (w/o folder path) |
protein | identified protein |
protein_group | identified protein_group |
razor | flag if peptide was used for protein grouping with razor approach |
protein_idx | index to protein (with respect to FASTA) |
decoy_protein | flag if protein is decoy or not |
n_possible_proteins | number of possible proteins for this peptide |
score_protein_group | score of the protein group |
Downstream analysis
AlphaPept offers some basic plots in the results section (e.g., volcano, heatmap, and PCA). The *.csv
-format should be generic to use with multiple other tools. Feel free to reach out in case you have ideas for plots or find that the output format not supported or has required columns missing. To reach out, report an issue here or send an email to opensource@alphapept.com.
Using with Perseus
Perseus offers a generic table import, so you can directly use the results_proteins.csv
.
Example: Volcano-Plot
An excellent tutorial for creating volcano-plots with Perseus can be found here.
Below a quickstart to use AlphaPept with Perseus (tested with 1.6.15.0
) The file used here is PXD006109
from the test runner (multi-species quantification test) with six files (three each group).
Open Perseus.
Drag and drop the
results_proteins.csv
in the central pane of Perseus. TheGeneric matrix upload
-window will open.Select the appropriate columns (e.g., LFQ for LFQ-intensities) and select them for Main with the
>
-Button. The first row is empty. Assign this for text. ClickOK,
and the table should be loaded.Click on the
f(x)
-button and pressOK
on the window that opens to apply alog2(x)
-transformation.Click on
Annot. rows
>Categorical annotation rows
to assign a group for each file. Select multiple entries and click on the checkmark to assign multiple groups at the same time. ClickOK
to close the window.Click on the
Volcano plot
-symbol in the upper rightAnalysis
-column. For the tutorial, we keep the standard settings and pressOK
.You can double-click on the small volcano plot to show the plot.
Enjoy your volcano-plot.