The HiSim Graph Tool
Why Use the HiSim Graph Tool
- A HiSim Graph is a tool for grouping functional genomic datasets based on the genes they contain. For example: The user may want to determine what a set of experiments on alcohol preference have in common, and what makes various experiments unique from one another. Alternatively, one may wish to take a large set of studies of related phenomena and identify their shared or distinct substrates. In this situation one may want to know whether there is a shared biological basis for addiction and learning, and if so, what the substrate is. The user might also want to examine studies of a large number of related disorders and determine whether a more appropriate biologically-based classification can be constructed.
- The HiSim Graph Tool is designed to address these goals; it presents a tree of hierarchical relationships for a set of input GeneSets. The structure is determined solely from the gene overlaps of every combination of GeneSets.
Understanding the HiSim Graph Tool
- It's best to use the HiSim Graph Tool with a knowledge on what set intersections are:
- If GeneSet A contains Gene A, Gene B, and Gene C, and also GeneSet B contains Gene A, Gene B, and Gene D. Then the intersection of GeneSet A and GeneSet B will contain Gene A and Gene B, because an intersection of sets are whatever is contained in all sets intersected.
- In terms of GeneSets, the smallest intersections (fewest GeneSets, most genes) are at the lower levels, and the largest intersections (most GeneSets, fewest genes) are at the top. When thinking about the genes in all the GeneSets, the roles are reversed (smallest number of genes at the top, largest number of genes at the bottom).
- HiSim Graphs must be interpreted in the context of the input GeneSets. The above example represents differentially expressed genes in multiple brain regions of alcohol preferring rats from a single study. The highest intersection represents a gene differentially expressed in all 5 brain regions. In this case, the highest intersection represents the highest amount of correspondence between data sets. As you descend the hierarchy, genes become more specific to the brain regions tested.
If one were to start with multiple alcohol preference measures from different studies, the top of the HiSim Graph represents the correspondence between the experiments (such as well-characterized alcohol preference genes), and as you descend the graph the intersections describe more specific features shared between experiments (such as stress response or tissue source).
When starting with more loosely related inputs, interpretation becomes more difficult. If one started with alcohol preference, nicotine dependence, and traumatic brain injury data (Figure 2), the top of the HiSim Graph would represent more generic processes such as neural plasticity in this case.
Using the HiSim Graph Tool
- Access the HiSim Graph Tool through the My Projects tab under the Analyze Genesets option.
- To generate a HiSim Graph, you must first select gene sets from a project. Projects may be created and updated by uploading GeneSets, searching the GeneWeaver database, or through the use of other tools in the GeneWeaver system. See the documentation for uploading GeneSets, Search, or Manage GeneSets to learn more about these functions. To select an entire project or multiple projects for analysis, check the box next to the project name. To select individual GeneSets within a project, click on the ‘+’ beside the project name and check individual GeneSets using the check boxes. Next, click on the HiSim Graph icon in the Analysis tools box to the left of the project list.
- The HiSim Graph opens in a new tab, and can be interactively panned and zoomed with the mouse. More details of each intersection can be viewed by clicking on the individual nodes in the tree. A link at the bottom of the frame allows download of the pdf.
There are a number of options available to optimize the HiSim Graph analyses. You may access the following options on the Analyze Genesets page by clicking on the green '+' symbol to the left of the HiSim Graph Tool.
- When the resulting HiSim Graph is unimaginably large, a bootstrapping filter is applied to reduce the output size. This step removes edges that are weakly supported by the underlying data, for example, those partitions of GeneSet subgroups that are driven by a single gene difference between the groups. If you would like the large, unfiltered graph instead, set this option to True to disable bootstrapping. Be warned this may stretch the graph's size.
- Include homology to integrate multi-species data. This is done by using homologene mappings to relate identifiers across species. If homology is excluded, data from multiple species will be segregated into separate trees.
- The minimum number of genes for an intersection. The default of 1 means that all intersections will be displayed. Increasing the value means that intersections with fewer genes will not be displayed in the output, decreasing noise and displaying more robust correspondence between GeneSets. This generally has the effect of removing the topmost nodes.
- The maximum number of genes to display in a node to enhance readability. This option does not change the resulting structure of the map. The default of 4 means that if there are 5 or more genes in an intersection, the node will simply say, for example, “5 genes” instead of listing all 5. You can increase this number to display more information at once.
- The minimum intersection size you want to see in the HiSim Graph. The default is 1 to show everything. If you were not interested in pairwise intersections for example, you would set this value to 3. This has the effect of “cutting” the tree at the specified height from the bottom. Since lower-order intersections are typically much more numerous than multi-way intersections, this option can greatly reduce the number of nodes left to examine.
- The HiSim Graph can ultimately address questions among highly curated data such as how much dimension reduction does gene overlap provide. For example, one may take a large set of gene sets associated with mood disorders and ask whether the data are similar enough to group together, i.e., of all possible subset intersections, how many are populated, and is this result better than chance?
- The maximum number of permutations to run. This option is disabled (0) by default since it can take a long time to run for large input sets. The genes contained in each GeneSet are permuted over the union of all genes in the input sets, controlling for the size of each GeneSet. The permutation tests measure the likelihood of getting a similar tree structure (Parsimony) or of getting a similar aggregation of genes in each intersection (Gene Aggregation). Note that this is a maximum value since the actual results may be fewer due to the time limit.
- Parsimony is a simple measure of the percentage of observed intersections out of all possible intersections. This is mathematically defined as:
- Gene Aggregation is a measure of the total node/tree probability. Each node is scored based on the intersection of genes and gene sets. Then the product of these scores is used to assign an overall tree aggregation probability:
- The maximum amount of time to spend doing permutations. For example, if Permutations is set to 100,000 and this value is 5 minutes, the result with either have 100,000 permutations (if they finished within 5 minutes), or will be truncated to the number of permutations which were able to finish within 5 minutes. It goes without saying the more time you give to PermutationTimeLimit, the more accurate your results will be.