System Architecture

Overview

From a user’s perspective, the PATRIC systems are primarily interfaced through the PATRIC website using a standard modern web browser. The PATRIC website server software is designed to be hosted by industry standard application containers and is deployable in a number of different configurations. The server software, as well as the client interface (browser application) rely upon a number of additional systems in order to successfully build, deploy, and provide querying and analysis services to these applications.

Direct support for the website application is provided by a number of different databases and services. These services typically support the interactive capabilities of the website and its users. For example, Solr database instances provide all of the scientific querying capabilities against the PATRIC data. The PATRIC website and application aggregates the data and capabilities of PATRIC services to present them interactively to the user.

Several other key components are used—and are critical—to the PATRIC project, but don’t directly support the PATRIC website itself in production. This includes data analysis services, software, and databases used to collect, analyze and annotate PATRIC data prior to release and deployment to the production PATRIC website. Additionally, software and/or scripts to manage the data, services, and database loading and extraction are required.

Software Architecture

The Software Architecture section of this document describes the general use and interaction of the components that make up the PATRIC website and its direct and indirect components. Some components of the architecture are third-party components and their architectures and deployment will not be detailed here except where relevant to the understanding of the overall architecture described by the PATRIC Systems Documentation.

PATRIC Website

The Browser Application

The user’s web browser is the host of the entire PATRIC website, which is more accurately described as a Web Application. Logically, the set of pages that the PATRIC Web Server’s provide make up the entirety of the PATRIC Web Application. A user’s state is maintained across all of these pages at any one time and from the user’s perspective they are navigating through the interactive space of PATRIC. Each page can be considered to be hosting one or more individual applications that communicate with the web server to provide an interactive experience

The browser application is written with ECMAScript (Javascript, JS) and DojoJS. It communicates with the server via HTTP Requests (AJAX). The browser application is part of the PATRIC Server application and intermingled with other content on pages generated by the PATRIC Server. However, browser application run in the user’s web browser, on a different network endpoint from the server, may be restarted (when a page reloads) at any time, may be composed of “mashup” data from external sites, and so they require independent consideration from the server side of the PATRIC website.

The browser application provides full support to common modern web browsers, with support for specific UI functionalities degrading when not supported by the underlying browser. Instead of requiring all users to conform to a specific set of browsers to use the website at all, we prefer to provide the best support possible for modern browsers, and support for older browsers via fallback mechanisms or degraded functionality. Browsers known to currently work are Chrome, Firefox, Safari, and IE7+. Some applications (pages) may require Flash for fully functionality.

Source Code: https://github.com/PATRIC3/p3_web

Web Application Server

This component serves the web content to client browsers. It is currently comprised an ExpressJS application running in a NodeJS webserver. It serves HTML, CSS, Javascript, and images to client browsers. The bulk of the user interface is implemented in Browser Application, which itself is built upon the Dojo javascript library.

Source Code: https://github.com/PATRIC3/p3_web

Static Content

Static content refers to electronic documents contains website Use Case / Tutorial, command line interface Use Case / Tutorial, user guides and PATRIC news. The contents of these documents are served independently of the main web server software and are publicly accessible. This site provides an RSS feed, which the main website application consumes and displays on its front page. Files are converted to html using the Python-based Sphinx documentation generator. The files are stored in the PATRIC GitHub repository.

Source Code: https://github.com/PATRIC3/p3_docs

Workspace

The Workspace is an online document-based data store where data is organized into user-owned directories, analogous to DropBox or GoogleDrive. Any top-level directory may be shared with multiple users to enable collaborative work on uploaded data (also analogous to DropBox or GoogleDrive).

Source Code: https://github.com/PATRIC3/Workspace

Workspace API:

The Workspace is connected to the rest of the PATRIC tools and website via a programmatic JSON RPC API.

The API has 11 commands:

  • create: allows for the creation of a directory or a data object itself
  • get: allows for retrieval of an object from the workspace
  • ls: list the objects present in a particular directory of the workspace
  • copy: copy an object from one location to another
  • delete: delete an object
  • set_permissions: set permissions on a top-level directory to share with another user
  • list_permissions: list permissions currently set for a top-level directory
  • get_download_url: allows for retrieval of a RESTful URL to download an object
  • get_archive_url: allows for retrieval of a RESTful URL to download an archive of multiple objects
  • update_metadata: allows for the manipulation of metadata associated with an object or directory in the Workspace
  • update_auto_meta: an internal function enabling the update of automated-metadata for an object

The associated resource is: https://p3.theseed.org/services/Workspace

Data formats:

Objects of any type may be stored in the workspace, but most typically objects are simple text files, often stored in JSON format. Additionally, all objects are assigned a type (e.g., Genome, Model, FeatureSet), and this type indicates how the object is treated when viewed on the PATRIC website, as well as the handling of the object by automated processing scripts built into the workspace. The types accepted by the workspace are configurable and completely extensible.

Database structure:

The workspace uses MongoDB to store the directory structure, directory permissions, object lists, and object metadata. The objects themselves are stored either in Shock (typically for very large objects) or in a simple file-system. Because of its connection to Shock, the workspace supports federated data storage, which enables the handling of big data.

Object processing:

When an object is saved to the workspace, it always undergoes a processing step, the specific actions of which depend on the type on the object. This step computes automated metadata for the object to facilitate object query and summary, but it can also handle other tasks as needed (e.g., indexing in Solr).

Download service:

In order to support transparent and efficient downloading of data files from the workspace, the Download Service allows the PATRIC website to provide URL-based access to private files in the workspace. Access to these URLs do not require a password; to ensure privacy, they are un-guessable hashes and are only valid for a short time.

Data API

The data API provides access to querying, retrieval, and indexing of public PATRIC data and for private annotated data. The API provides a REST interface to the rich data PATRIC provides. The data can be retrieved directly by ID or it can be queried using the Request Query Language (RQL) syntax or using Solr syntax. As queries are submitted to the API they are modified and submitted to the backend data sources (Solr) to retrieve the data that is visible to the user. Users are able to view public data, any data they own, or any data that another user has shared with them.

Source Code: https://github.com/PATRIC3/p3_api

Data API:

The data API has two functions for each data type:

  • get()
  • query()

The associated resources are, respectively:

In addition to the API for querying and retrieving data, there is also an API endpoint for submitting new data to the system to be indexed in the database.

Command-line Interface (CLI)

PATRIC is an integration of different types of data and software tools that support research on bacterial pathogens. The typical biologist seeking access to the PATRIC data and tools will usually explore the web-based user interface. However, there are many instances in which programatic or command-line interfaces are more suitable, specially for querying data or submitting jobs in batch mode. For users that wish command-line access to PATRIC, we provide the tools described in this document. We call these tools the P3-scripts. They are intended to run on your machine, going over the network to access the services provided by PATRIC.

Source Code and Client Application: https://github.com/PATRIC3/PATRIC-distribution/

Currently, the following commands are available to the community:

p3-abstract-clusters p3-get-feature-sequence p3-put-genome-group
p3-aggregate-sss p3-get-features-by-sequence p3-rast
p3-aggregates-to-html p3-get-genome-contigs p3-related-by-clusters
p3-all-drugs p3-get-genome-data p3-rep-prots
p3-all-genomes p3-get-genome-drugs p3-rm
p3-blast p3-get-genome-expression p3-rmdir
p3-build-kmer-db p3-get-genome-features p3-role-matrix
p3-closest-seqs p3-get-genome-group p3-sequence-profile
p3-co-occur p3-gto p3-set-to-relation
p3-collate p3-gto-dna p3-signature-clusters
p3-config p3-gto-fasta p3-signature-families
p3-count p3-gto-scan p3-signature-peginfo
p3-count-families p3-head p3-similar-proteins-by-blast
p3-cp p3-identical-dna p3-similar-proteins-by-family
p3-drug-amr-data p3-identical-proteins p3-sort
p3-echo p3-identify-clusters p3-stats
p3-extract p3-inAandB p3-submit-genome-annotation
p3-extract-gto p3-inAnotB p3-submit-genome-assembly
p3-feature-gap p3-inAorB p3-tbl-to-fasta
p3-feature-upstream p3-job-status p3-tbl-to-html
p3-file-filter p3-join p3-tests
p3-find-couples p3-kmer-compare p3-whoami
p3-find-features p3-list-feature-groups  
p3-find-in-clusters p3-list-genome-groups  
p3-format-results p3-login  
p3-function-to-role p3-logout  
p3-generate-close-roles p3-ls  
p3-generate-clusters p3-mass-cluster-run  
p3-genome-amr-data p3-match  
p3-genome-fasta p3-merge  
p3-genus-species p3-mkdir  
p3-get-contig-data p3-pick  
p3-get-drug-genomes p3-project-subsystems  
p3-get-family-data p3-put-feature-group  
p3-get-family-features    
p3-get-feature-data    
p3-get-feature-group    

Databases

PATRIC data is stored Solr and indexed in its entirety (all fields) as PATRIC releases data. Solr then provides read-only searching services to both the server and browser side of the PATRIC via HTTP requests. A standard Solr 6 installation can host the PATRIC data, but the deployment of Solr can be accomplished in a number of different ways that can have a dramatic impact on performance for many of the PATRIC activities. The performance of the Solr service is heavily memory dependent. It is important, at a minimum, to be able to fit the entire set of data indexes into memory. Additionally, cache and other such tunable parameters can require additional memory. In any deployment, this physical limitation of the available resources is likely to be one of the key defining factors for Solr configuration and performance.

Source Code: https://github.com/PATRIC3/patric_solr_cloud

User Service

The user service provides user profile management and authentication for the PATRIC system. The user system provides a REST interface to read and modify a user’s profile. It also provides authentication services for the PATRIC web application and related components. The backend services consume authentication tokens that are generated by the user service.

Source Code: https://github.com/PATRIC3/p3_user

Web/Proxy Server

All PATRIC websites and web applications run behind a web server which is used to host static files, proxy requests to underlying application servers, and in some cases load balancing among web server instances. This component is not strictly required for deployment of the PATRIC infrastructure in basic form, but greatly simplifies deployment and is the current method used for load balancing. NGINX is deployed on hosts with websites on the standard HTTP and HTTPS ports (80,443), while the underlying applications are deployed on unused ports. nginx is then configured to proxy requests to these localhosts using its Named Virtual Hosting system.

App Service

The PATRIC resource supports a number of computational services (e.g., genome assembly and annotation, model production, etc.). These services are hosted on an extensible set of computational resources at Argonne. The interface between the user’s interaction with the PATRIC website and the computational resources is called the App Service. The App Service presents a unified view of all supported services, allowing the user to submit requests, monitor progress, and view results within a common framework on the PATRIC website. For the developers, the App Service enables the development of new applications without the need to handle the details of process execution and management.

Source Code: https://github.com/PATRIC3/app_service

App Service API:

The App Service is connected to the rest of the PATRIC tools and website via a programmatic JSON RPC API. The API has 6 commands:

  • enumerate_apps
  • start_app
  • query_tasks
  • query_task_summary
  • query_task_details
  • enumerate_tasks

The associated resource is: https://p3.theseed.org/services/app_service

Hardware Deployment

The hardware hosted at Argonne National Laboratory on behalf of the University of Chicago’s bioinformatics computing core supporting the PATRIC services are as follows:

  • Production support services
    • 24 x E5-2620 CPUs
    • 256 GB RAM
  • Production support services
    • 40 x E5-2640 CPUs
    • 768 GB RAM
  • User Data Management and Compute Scheduling
    • 12 x E5-2620 CPUs
    • 256 GB RAM
  • Solr Cloud servers (x3)
    • 32 Xeon Gold 6134 CPUs
    • 760 GB RAM
    • 5.3 TB SSD storage
  • ARAST Server and Primary Compute
    • 12 x E5-2620 CPUs
    • 256 GB RAM
  • Compute server
    • 12 x E5-2620 CPUs
    • 256 GB RAM
  • Compute server (3)
    • 32 x Xeon Gold 6134 CPUs
    • 786 GB RAM
  • Loadbalanced / Failover Proxy Server
    • 2 systems, each 4 CPUs, 64GB RAM, 10Gb network

Storage is provided to the above systems through Fibre Channel SAN storage. The SOLR portion of PATRIC and the FTP site are currently consuming approximately 10 TB of storage.