---
title: General comments about the infrastructure
author:
  name: Adriano Rutz
  orcid: 0000-0003-0443-9902
citation:
  url: https://taxonomicallyinformedannotation.github.io/tima/vignettes/tima.html
comments:
  giscus:
    repo: taxonomicallyinformedannotation/tima
  hypothesis:
    showHighlights: always
creative_commons: CC BY
date: today
format:
  html:
    smooth-scroll: true
google-scholar: true
knitr:
  opts_chunk:
    collapse: true
    comment: "#>"
lang: en
license: CC BY
opengraph:
  image:
    src: https://github.com/taxonomicallyinformedannotation/tima/blob/main/man/figures/logo.svg
    alt: Taxonomically Informed Metabolite Annotation
vignette: >
  %\VignetteIndexEntry{General comments about the infrastructure}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

Before using this vignette, install `tima` from r-universe and run
`tima::install_tima()` once in an interactive session.

This vignette describes the philosophy behind the infrastructure of **TIMA**.

## Philosophy

Our main goals were **flexibility** and **reproducibility**.

### Flexibility

To ensure flexibility, we tried to split the process in as much tiny parts as needed.
So you can decide whether to skip an optional part, add your own processing, etc.
We tried to cover most use cases, but of course they are not exhaustive.
If you feel like something useful to other users is missing, please fill an [issue](https://github.com/taxonomicallyinformedannotation/tima/issues).

### Reproducibility

After some time using TIMA, you will probably wonder:
"*What was the parameters I used to generate this file?*" ...
Or a collaborator might ask you to share your data and parameters.
Writing them down each time might be time-consuming and not really in line with modern computational approaches.
Therefore, we chose to implement all parameters of all steps (almost...) as YAML files.
They are human-readable and can be used in batches.
If you do not like YAML, parameters of each step can also be given as command line arguments.
They will then be saved as YAML you will be able to share.

To ensure optimal reproducibility and avoiding re-computing endlessly steps that did not change, we decided to build a [{targets}](https://books.ropensci.org/targets/) pipeline.
Each step of the whole pipeline will be described next.


```{r targets, echo=FALSE, eval=FALSE, message=FALSE, warning=FALSE, out.width="100%"}
try(
  expr = {
    library(targets)
    Sys.setenv(TAR_WARN = "false")
    targets::tar_visnetwork(
      names = starts_with("ann"),
      exclude = c(
        "benchmark",
        "par_",
        "paths",
        "_is",
        "_exp"
      ) |>
        tidyselect::contains(),
      targets_only = TRUE,
      degree_from = 8
    )
  },
  silent = TRUE
)
```

## Use

All coming steps admit you already installed `tima`:

```{r install, eval=FALSE, include=TRUE}
install.packages(
  "tima",
  repos = c(
    "https://taxonomicallyinformedannotation.r-universe.dev",
    "https://bioconductor.org/packages/release/bioc",
    "https://cloud.r-project.org"
  )
)
tima::install_tima()
tima::get_example_files()
```

We now recommend you to read the following vignettes:

- <https://taxonomicallyinformedannotation.github.io/tima/vignettes/articles/0-validating.html> **Start here!** Validate your data before running the pipeline
- <https://taxonomicallyinformedannotation.github.io/tima/vignettes/articles/I-gathering.html>
- <https://taxonomicallyinformedannotation.github.io/tima/vignettes/articles/II-preparing.html>
- <https://taxonomicallyinformedannotation.github.io/tima/vignettes/articles/III-processing.html>
- <https://taxonomicallyinformedannotation.github.io/tima/vignettes/articles/IV-benchmarking.html>

### tl;dr

**Important:** Always validate your data to catch issues early!

```{r validate_first, eval=FALSE, include=TRUE}
validate_inputs(
  features = "data/source/example_features.csv",
  spectra = "data/source/example_spectra.mgf",
  metadata = "data/source/example_metadata.tsv",
  sirius = "data/interim/annotations/example_sirius.zip",
  feature_col = "row ID",
  filename_col = "filename",
  organism_col = "ATTRIBUTE_species"
)
```

If you do not feel like going through all the steps, then just do 🚀:

```{r run_app, eval=FALSE, include=TRUE}
tima::run_app()
```

If you do not even need a GUI ☠️:

```{r run_tima, eval=FALSE, include=TRUE}
tima::run_tima()
```

In case you just want to change some small parameters between jobs, a convenience function is available:

```{r change_params, eval=FALSE, include=TRUE}
tima::change_params_small(
  fil_pat = "myExamplePattern",
  fil_fea_raw = "myExampleDir/myExampleFeatures.csv",
  fil_met_raw = "myExampleDir2SomeWhereElse/myOptionalMetadata.tsv",
  fil_sir_raw = "myExampleDir3/myAwesomeSiriusProject.zip",
  fil_spe_raw = "myBeautifulSpectra.mgf",
  ms_pol = "pos",
  org_tax = "Gentiana lutea",
  hig_evi = TRUE,
  summarize = FALSE
)
```


## Biological Weighting and Core Metabolism

TIMA's core principle is that candidates that are close to your sample's taxonomic origin receive higher biological scores.
For example, if you are analyzing a plant extract (*Gentiana lutea*),
metabolites reported from plants will score higher than those only reported from bacteria or animals.

### The Special Case: "Biota" Superdomain

However, some metabolites are *universal*, they are part of the shared core metabolism found across all domains of life.
To ensure these universal metabolites are always considered in annotations regardless of sample taxonomy,
TIMA implements a special `"Biota" Superdomain`.
When preparing structure-organism pair libraries (e.g., from BiGG metabolic models),
metabolites present in all model organisms are assigned to the special "Biota" organism with:

- `organism_taxonomy_01domain = "Biota"`
- `organism_taxonomy_ottid = 0`

Note: this Biota placeholder is different from the PubChem Lite exposomics
placeholder used for xenobiotics (`organism_taxonomy_ottid = "93302", cellular organisms`), which
is only a label for that library source and should not be interpreted as the
Biota superdomain.

During biological weighting, any candidate from the Biota domain receives the maximal biological score.
This ensures that core metabolic pathways are never inappropriately filtered out due to taxonomic mismatches,
while still maintaining the taxonomic prioritization for organism-specific specialized metabolism.