Systems Architecture

Overview

From a user’s perspective, the PATRIC systems are primarily interfaced through the PATRIC website using a standard modern web browser. The PATRIC website server software is designed to be hosted by industry standard application containers and is deployable in a number of different configurations. The server software, as well as the client interface (browser application) rely upon a number of additional systems in order to successfully build, deploy, and provide querying and analysis services to these applications.

Direct support for the website application is provided by a number of different databases and services. These services typically support the interactive capabilities of the website and its users. For example, Solr database instances provide all of the scientific querying capabilities against the PATRIC data. The PATRIC website and application aggregates the data and capabilities of PATRIC services to present them interactively to the user.

Several other key components are used—and are critical—to the PATRIC project, but don’t directly support the PATRIC website itself in production. This includes data analysis services, software, and databases used to collect, analyze and annotate PATRIC data prior to release and deployment to the production PATRIC website. Additionally, software and/or scripts to manage the data, services, and database loading and extraction are required.

Software Architecture

The Software Architecture section of this document describes the general use and interaction of the components that make up the PATRIC website and its direct and indirect components. Some components of the architecture are third-party components and their architectures and deployment will not be detailed here except where relevant to the understanding of the overall architecture described by the PATRIC Systems Documentation.

PATRIC Website

The Browser Application

The user’s web browser is the host of the entire PATRIC website, which is more accurately described as a Web Application. Logically, the set of pages that the PATRIC Web Server’s provide make up the entirety of the PATRIC Web Application. A user’s state is maintained across all of these pages at any one time and from the user’s perspective they are navigating through the interactive space of PATRIC. Each page can be considered to be hosting one or more individual applications that communicate with the web server to provide an interactive experience.

The browser application is written with ECMAScript (Javascript, JS) and DojoJS. It communicates with the server via HTTP Requests (AJAX). The browser application is part of the PATRIC Server application and intermingled with other content on pages generated by the PATRIC Server. However, browser application run in the user’s web browser, on a different network endpoint from the server, may be restarted (when a page reloads) at any time, may be composed of “mashup” data from external sites, and so they require independent consideration from the server side of the PATRIC website.

The browser application provides full support to common modern web browsers, with support for specific UI functionalities degrading when not supported by the underlying browser. Instead of requiring all users to conform to a specific set of browsers to use the website at all, we prefer to provide the best support possible for modern browsers, and support for older browsers via fallback mechanisms or degraded functionality. Browsers known to currently work are Chrome, Firefox, Safari, and IE7+. Some applications (pages) may require Flash for fully functionality.

Web Application Server

This component serves the web content to client browsers. It is currently comprised an ExpressJS application running in a NodeJS web server. It serves HTML, CSS, Javascript, and images to client browsers. The bulk of the user interface is implemented in the Javascript, which itself is built upon the Dojo javascript library.

Static Content

Static content refers to electronic documents contains website tutorial, command line interface tutorial, user guides and PATRIC eNews. The contents of these documents are served independently of the main web server software and are publicly accessible. This site provides an RSS feed, which the main website application consumes and displays on its front page. Files are converted to html using the Python-based Sphinx documentation generator. The files are stored in the PATRIC GitHub repository.

Workspace

Description:

The Workspace is an online document-based data store where data is organized into user-owned directories, analogous to DropBox or GoogleDrive. Any top-level directory may be shared with multiple users to enable collaborative work on uploaded data (also analogous to DropBox or GoogleDrive).

API:

The Workspace is connected to the rest of the PATRIC tools and website via a programmatic JSON RPC API.

The API has 11 commands:

  • create: allows for the creation of a directory or a data object itself
  • get: allows for retrieval of an object from the workspace
  • ls: list the objects present in a particular directory of the workspace
  • copy: copy an object from one location to another
  • delete: delete an object
  • set_permissions: set permissions on a top-level directory to share with another user
  • list_permissions: list permissions currently set for a top-level directory
  • get_download_url: allows for retrieval of a RESTful URL to download an object
  • get_archive_url: allows for retrieval of a RESTful URL to download an archive of multiple objects
  • update_metadata: allows for the manipulation of metadata associated with an object or directory in the Workspace
  • update_auto_meta: an internal function enabling the update of automated-metadata for an object

The associated resource is:

Data formats:

Objects of any type may be stored in the workspace, but most typically objects are simple text files, often stored in JSON format. Additionally, all objects are assigned a type (e.g., Genome, Model, FeatureSet), and this type indicates how the object is treated when viewed on the PATRIC website, as well as the handling of the object by automated processing scripts built into the workspace. The types accepted by the workspace are configurable and completely extensible.

Database structure:

The workspace uses MongoDB to store the directory structure, directory permissions, object lists, and object metadata. The objects themselves are stored either in Shock (typically for very large objects) or in a simple file-system. Because of its connection to Shock, the workspace supports federated data storage, which enables the handling of big data.

Object processing:

When an object is saved to the workspace, it always undergoes a processing step, the specific actions of which depend on the type on the object. This step computes automated metadata for the object to facilitate object query and summary, but it can also handle other tasks as needed (e.g., indexing in Solr).

Download service:

In order to support transparent and efficient downloading of data files from the workspace, the Download Service allows the PATRIC website to provide URL-based access to private files in the workspace. Access to these URLs do not require a password; to ensure privacy, they are un-guessable hashes and are only valid for a short time.

Data API

The data API provides access to querying, retrieval, and indexing of public PATRIC data and for private annotated data. The API provides a REST interface to the rich data PATRIC provides. The data can be retrieved directly by ID or it can be queried using the Request Query Language (RQL) syntax or using Solr syntax. As queries are submitted to the API they are modified and submitted to the backend data sources (Solr) to retrieve the data that is visible to the user. Users are able to view public data, any data they own, or any data that another user has shared with them.

API:

The data API has two functions for each data type:

  • get()
  • query()

The associated resources are, respectively:

In addition to the API for querying and retrieving data, there is also an API endpoint for submitting new data to the system to be indexed in the database.

The data API is now available through a command line interface (CLI). Currently, the following commands are available to the community:

p3-abstract-clusters p3-get-genome-data p3-pick

p3-all-genomes p3-get-genome-features p3-put-feature-group

p3-config p3-get-genome-group p3-put-genome-group

p3-echo p3-list-feature-groups p3-related-by-clusters

p3-extract p3-list-genome-groups p3-signature-clusters

p3-get-family-data p3-login p3-signature-families

p3-get-family-features p3-logout p3-signature-peginfo

p3-get-feature-group p3-match p3-whoami

Databases

PATRIC data is stored Solr and indexed in its entirety (all fields) as PATRIC releases data. Solr then provides read-only searching services to both the server and browser side of the PATRIC via HTTP requests. A standard Solr 6 installation can host the PATRIC data, but the deployment of Solr can be accomplished in a number of different ways that can have a dramatic impact on performance for many of the PATRIC activities.

The performance of the Solr service is heavily memory dependent. It is important, at a minimum, to be able to fit the entire set of data indexes into memory. Additionally, cache and other such tunable parameters can require additional memory. In any deployment, this physical limitation of the available resources is likely to be one of the key defining factors for Solr configuration and performance.

User Service

The user service provides user profile management and authentication for the PATRIC system. The user system provides a REST interface to read and modify a user’s profile. It also provides authentication services for the PATRIC web application and related components. The backend services consume authentication tokens that are generated by the user service.

Web/Proxy Server

All PATRIC websites and web applications run behind a web server which is used to host static files, proxy requests to underlying application servers, and in some cases load balancing among web server instances. This component is not strictly required for deployment of the PATRIC infrastructure in basic form, but greatly simplifies deployment and is the current method used for load balancing.

NGINX is deployed on hosts with websites on the standard HTTP and HTTPS ports (80,443), while the underlying applications are deployed on unused ports. nginx is then configured to proxy requests to these localhosts using its Named Virtual Hosting system.

App Service

The PATRIC resource supports a number of computational services (e.g., genome assembly and annotation, model production, etc.). These services are hosted on an extensible set of computational resources at Argonne. The interface between the user’s interaction with the PATRIC website and the computational resources is called the App Service. The App Service presents a unified view of all supported services, allowing the user to submit requests, monitor progress, and view results within a common framework on the PATRIC website. For the developers, the App Service enables the development of new applications without the need to handle the details of process execution and management.

API:

The App Service is connected to the rest of the PATRIC tools and website via a programmatic JSON RPC API.

The API has 6 commands:

  • enumerate_apps
  • start_app
  • query_tasks
  • query_task_summary
  • query_task_details
  • enumerate_tasks

The associated resource is:

  • https://p3.theseed.org/services/app_service

Hardware Deployment

The hardware hosted at Argonne National Laboratory on behalf of the University of Chicago’s bioinformatics computing core supporting the PATRIC services are as follows:

  • Production support services
    • 12 x E5-2620 CPUs
    • 256 GB RAM
  • Production support services
    • 12 x E5-2620 CPUs
    • 256 GB RAM
  • User Data Management and Compute Scheduling
    • 12 x E5-2620 CPUs
    • 256 GB RAM
  • Solr server
    • 160 CPUs
    • 1.5 TB RAM
    • 4.4TB SSD storage
  • ARAST Server and Primary Compute
    • 12 x E5-2620 CPUs
    • 256 GB RAM
  • Compute server
    • 12 x E5-2620 CPUs
    • 256 GB RAM
  • Load balanced / Failover Proxy Server
    • 2 systems, each 4 CPUs, 64GB RAM, 10Gb network

The main server hardware supporting this application at Virginia Tech are as follows:

  • Primary SOLR Server
    • 48 Cores
    • 384 GB RAM
  • Secondary SOLR Server
    • 16 Cores
    • 48 GB RAM
  • Private Cloud Infrastructure
    • Head Nodes
      • 4 Nodes * 32 cores
      • 128 GB RAM
    • Hypervisors
      • 16 Nodes * 16 Cores (256 Cores)
      • 1024 GB RAM

Storage is provided to the above systems through Fibre Channel SAN storage. The SOLR portion of PATRIC and the FTP site are currently consuming approximately 10 TB of storage.