Generic Runsheet Generation Guide
Integration Overview
Many instruments require, or can intake, a “runsheet” or “sample manifest” file. Often, these files are simple plain-text files that just need a few values plugged in dynamically. The “RunSheet2” API in the client/content is designed to facilitate this process in a way that is flexible but still acceptably performant.
Rationale
Although it is possible to create a basic runsheet using python and espclient directly, the generic runsheet generation facilities were written to:
Ensure that clients have the ability to change how runsheet data is populated via the UI (configuration) rather than needing to modify a python script.
Ensure best-practices around data fetch: one-off python scripts often access data inefficiently instead of using bulk data fetch. For instance, calling the expression
column_value_for_uuid
with a single entity UUID at a time instead of calling it once for all entities.Ensure best-practices around edge-case handling, such as the same entity present more than once in single worksheet (i.e. from different experiments). These sorts of use-cases are frequently missed when writing one-offs and are often not caught during testing, but are correctly handled by the generic runsheet facilities.
Integration Requirements
Internal Requirements
espclient installed in the environment that will be generating the runsheet
Fundamentals
The general purpose runsheet support consists of three pieces:
The
esp.data.access
api for configurable random data access to ESPA configuration file that defines how to lay out the runsheet
The
espclient.runsheet
module and classes contained therein for actually generating the runsheet.
esp.data.access
The esp.data.access
module provides a functional API for flexible and reasonably efficient1 random access data retrieval. Under the hood, it uses the standard espclient OO models for much of the heavy data lifting, but also makes use of server-side expressions and queries for batched data lookups.
The two primary entry points into the module are the build_accessor
function and the esp.data.access.EntityDataAccessor
protocol2. For full details, see the python documentation for that module, but briefly:
build_accessor
receives an “accessor string” and returns an object that is a subclass of or compatible withesp.data.access.EntityDataAccessor
. The accessor strings allow you to declare a type of lookup you eventually want to perform on a set of entities. For example:build_accessor('tag:esp:qcstatus')
will return an accessor equivalent to{{tagged_value(['esp:qcstatus'])}}
in LIMS.build_accessor('sheet:QC Results.QC Status')
will return an accessor equivalent to{{cell('QC Status', ‘QC Results’)}}
.The accessor string has a full BNF grammar documented in
esp.data.access
, but additional operations include:The ability to “chain” lookups - the first non-null value will be the resolved value
e.g.:
tag:mytag;fixed:N/A
will use the tagged valuemytag
if found, but if no value is found or the value is null, it will “fall back” to the fixed string “N/A”.
The ability to transform lookup values
e.g.:
tag:mytag|null_to_empty|strip
which will transform null value to the empty string and also ensure that leading/trailing whitespace is trimmed off.
The ability to resolve different “generations” of entities for entity-centric lookups
eg.:
tag:mytag@@Patient
The ability to use a specific sample to resolve a particular accessor even when resolution would normally otherwise be tracked to a different sample/set of samples.
eg:
slm>>>sheet:Run Info.Run Name
: this would pull the value from theRun Name
field of theRun Info
protocol, but it would be pulling for the sample in theslm
sample set.
esp.data.access.EntityDataAccessor
defines a protocol for fetching (and in some cases setting) data on a batch of entities. The primary method of an accessor isget
, as:from esp.models import Sample myentities = Sample.search(names=['entity1', 'entity2']) accessor = build_accessor("sampleinfo:name") mynames = accessor.get(myentities) print(mynames) # ['entity1', 'entity2']
The get
method may also be called with a single entity as accessor.get(my_entity)
and will return data in the same shape as the input arguments (ie: pass a single entity, receive a single value; pass a list of values of the same length as the input list). The data accessor API is used heavily by the general runsheet machinery but has many additional uses. For instance, the default bartender integration uses it to fill in the csv “database” of entity data needed for a given label template.
Out-of-box data accessors
The following data accessors are available out-of-box. The “Accessor” column lists the prefix used for build_accessor
and the data.access
class name of the accessor in parenthesis. A *
after the prefix indicates indicates the @@
suffix is supported for the accessor prefix, which is used to fetch data from a generation other than the entity passed to get
. Unless otherwise noted, the value after @@
supports anything available via the entity_generation
expression.
Accessor | Description | Examples |
---|---|---|
| Get primary entity properties from Entity objects. Supported properties:
|
|
(SampleFieldValue) Alias: | Get the value of an entity’s custom field. Note that fields are currently resolved by field name, not field ID. Formally, fields are resolved via the ESP Client’s “entity.variables” property, which currently (ESP <= 3.0) uses the field name as the dictionary key. |
|
(ProtocolValue) | Get the value the most recently set non-null value for a specific field of a specific protocol. Uses |
|
(SheetValue) | Retrieve a value from the active SampleSheet. This accessor attempts to resolve the appropriate row given the input sample data + the protocol of the sheet. If the same entity is present multiple times from different experiments, this accessor will raise an Error unless |
|
| Get the most recently set, non-null value based on one or more tags (relies on |
|
| Get a property of the active SampleSheet. Supported properties:
|
|
| Get a property of the active Experiment. Supported properties:
|
|
| Get values from metadata about the active user. Supported properties:
| |
| Get a fixed value/hard-coded string |
|
| Get a value from a param group |
|
| Return values by evaluating L7|ESP expressions on the server. The expression must return one of:
The dataccessor processing will automatically add the |
|
| Given an entity and the name of a Workflow, return information fromthe most recent matching WorkflowInstance for the Sample+Workflow. The difference between WorkflowInstanceValue and ExperimentInfoValue is similar to the difference between ProtocolValue and SheetValue, respectively. That is, ExperimentInfoValue is tied to a specific SampleSheet whereas WorkflowInstanceValue is the most recently created workflow instance for the entities with matching workflow name. Otherwise, they are functionally equivalent (all the same properties are available). Note: this accessor may be used for prototyping purposes but should not be used in production as the implementation has not yet been optimized. |
|
Custom data accessors
It is possible to write your own data accessors that tie back into the rest of the runsheet and data accessor machinery. Full details are documented in the esp.data.access
module and are beyond the scope of this document, but a basic accessor can be implemented quickly, with the only required methods being from_string
and _get_list
:
import esp.data.access as access from esp.models import Entity # 3.0. Use Sample if you need 2.4 + 3.0 compatibility. @access.dataaccessor('divider') class ExampleDataAccessor(access.EntityDataAccessor): """ Simple data accessor that makes a divider string. """ def __init__(self, char, mult, *args, **kwargs): self.char = char self.mult = mult # IMPORTANT - YOU MUST CALL THE SUPER init! (it's also just good practice, # but for data accessors, it is mandatory) super(ExampleDataAccessor, self).__init__(*args, **kwargs) @classmethod @access.strip_specid def from_string(cls, string, label=None) char, mult = string.split(',') mult = int(mult) return cls(char, mult, label) def _get_list(self, samples, params=None): val = char * mult return [val]*len(samples) example = access.build_accessor('divider:=,5') example.get([Entity('one'), Entity('two')]) # returns ['=====', '=====']
In a custom python script, implementing the DataAccessor as above is sufficient to register it for use in build_accessor
. Custom data accessors can also be registered via the l7.espclient.extensions
entry point. For more information, see also “Extending ESP” in the user guide.
Out-of-box transformations
The client ships with several transformations available out-of-box. As of 3.0.0, the following transformations are available:
error_on_missing
: If a value for a given entity isNone
or the empty string, raise an exception during processing.The default behavior is to simply return the None or empty value for that entity, so use this transform for cases where the value must not be empty. For instance, in an Illumina runsheet, the value for
Index
might be specified assamplefield:I7 Index|error_on_missing
to ensure the I7 Index is not blank.
formatted_location
: Given a LIMS location data value, return a “pretty” value. E.g.:sheet:Protocol.Location|formatted_location
might return “My Container:A1”location_slot
: Given a LIMS location data value, return the slot or, if multiple slots, a comma-separated list of slots. E.g.:sheet:Protocol.Location|location_slot
might return “A1” or “A1,A2”.location_name
: Given a LIMS location data value, return the container name. E.g.:sheet:Protocol.Location|location_name
might return “My Container”sub(pattern, replacement, count=0)
: Run a regular expression-based search-and-replace on the string. E.g. ifsheet:Protocol.Field
returns the string “abcefgabc”, thensheet:Protocol.Field|sub('abc', ‘xyz', 1)
would returnxyzefgabc
andsheet:Protocol.Field|sub('abc', 'xyz')
would returnxyzefgxyz
.null_to_empty
: transform a null value to an empty stringstrip
: strip whitespace from the ends of a stringint
: convert a string value to an integer
Runsheet Configuration
A runsheet configuration is a simple yaml (or json) configuration that can be processed by the RunSheet2
class of the espclient.runsheet
module. A runsheet configuration is a yaml or JSON key-value mapping file with at least the sections
key. The sections detail how to map information from L7|ESP to the output runsheet file using the data accessor strings above. The espclient.runsheet
module is then used to transform the configuration to the output file, filling in the L7|ESP data along the way. Although specific instrument/application connectors may require additional keys in the configuration (e.g.: Illumina support uses a params
section), no additional keys are required in the general case. A fully-specified section (ie not relying on defaults) has the following keys:
- name: My Section type: table samples: primary supress_name: False show_headers: True name_format: "[{}]" values: - Source: sheet:Current.Well prepad_section: False postpad_section: False
OR
{ "type": "table", "name": "My Section", "samples": "primary", "values": [ { "Source": "sheet:Current.Well" } ] "name_format": "[{}]", "postpad_section": false, "prepad_section": false, "show_headers": true, "supress_name": false, }
with meanings as detailed below. Unless otherwise specified, the keys are optional with defaults as indicated.
type (required, choices=[table,key-value,value])
There are three types of sections: key-value, table, and value. The configuration keys for each section type are the same. The section type controls how the final values are rendered.
A
key-value
section renders one row per entity per value with two values per row: the key, and the value. For instance, the Illumina runsheet “Settings” section is key-value, outputting data such as:Experiment,My Experiment Date,2020/10/05
A
value
section renders one row per entity per value with one value per row. For instance, the Illumina runsheet “Reads” section is avalue
section, producing data such as:150 150 10 10
A
table
section renders one row per entity + a header row, with number of columns equal to the number of values. For instance, the illumina runsheet “Data” section is atable
section, producing data such as:Sample,I7_Index_ID,index SAM001,A1,AGCTCGT SAM002,A2,AGCTGCG
name (required)
Every section must have a name to uniquely identify the section.
samples (required)
Every section must define the name of the “sample set” used to fill the datas. The available names are application-dependent, but commonly, there is a primary
sample set, and often this is the only available sample set. The Illumina module makes both primary
and slm
sets available, with libraries
an alias for primary
for backwards compatibility. See the documentation of the particular application to know what sets are available. In the context of data accessors, the sample set is the list of entities passed to the get
method of the data accessors.
suppress_name
If true, the section name is not output to the runsheet. If it is false (the default), each section is preceded by a line in the shape determined by name_format
name_format
The name_format determines the format of the section name when output. It is specified as a python format string that will be provided a the section name as the sole input string. The default is [{}]
. For instance, a section named Data
would have a section name of [Data]
.
postpad_section
If true, an extra newline will be placed after the section prior to rendering the next section. The default is false
.
prepad_section
If true, an extra newline will be placed prior to the section prior to rendering it. The default is false
.
show_headers
Only used for type=table. If true
(the default), the table header line(s) will be rendered. Otherwise, the table header line(s) will be suppressed.
values (required)
values
is a list of key-value pairs. The key must be unique within the list. The value is a data accessor string. For instance:
values: - Sample Name: sampleinfo:name - File Name Convention: fixed:GlobalFiler - Results Group: fixed:GlobalFiler - Sample Type: fixed:Sample - Field 1: 'sampleinfo:name@@Individual|null_to_empty' - Field 2: 'sampleinfo:name@@Family|null_to_empty'
Given an entity hierarchy such as:
Family 1
Individual 1
Sample 1
Individual 2
Sample 2
Using the values
configuration above with the inputs of Sample 1
and Sample 2
and a section type of table
would yield:
Sample Name,File Name Convention,Results Group,Sample Type,Field 1,Field 2 Sample 1,GlobalFiler,GlobalFiler,Sample,Individual 1,Family 1 Sample 2,GlobalFiler,GlobalFiler,Sample,Individual 2,Family 1
The same hierarchy and values
configuration with a section type of key-value
and only supplying Sample 1
:
Sample Name,Sample 1 File Name Convention,GlobalFiler Results Group,GlobalFiler Sample Type,Sample Field 1,Individual 1 Field 2,Family 1
The same hierarchy and values
configuration with a section type of value
only supplying Sample 1
:
Sample 1 GlobalFiler GlobalFiler Sample Individual 1 Family 1
Using the espclient.runsheet models.
The espclient.runsheet module holds the classes used to generate run sheets given the configuration. The entry point to this functionality is the RunSheetTemplate
class. A RunSheetTemplate is a programmatic representation of the configuration, where all data accessors have been resolved. Templates can be configured programmatically, but if you already have the configuration as yaml or json, you can easily load those into python and use the from_config
class method as:
from esp.data.access import build_accessor from esp.models import Configuration from espclient.runsheet import RunSheetTemplate conf = Configuration('My Runsheet Configuration') template = RunSheetTemplate.from_config(conf.config, accessor=build_accessor)
from_config
supports up to three arguments: the configuration, an optional visitor
, and the optional accessor
argument. For backwards compatibility, accessor defaults to an older, deprecated implementation of data accessors, so build_accessor
should normally be passed to accessor. The visitor
is for advanced usage scenarios and is an implementation of the visitor pattern (see also https://en.wikipedia.org/wiki/Visitor_pattern ). Once you have a template, it can be used with a list of entities to create a runsheet object:
# for ESP < 3.0, use Sample instead of Entity. # Sample also works for 3.0 for backwards compatibility. from esp.models import Entity, SampleSheet entities = Entity.search(names=['ESP000001', 'ESP000002]) ss = SampleSheet('My Sheet') runsheet = template.resolve_runsheet( {'primary': entities}, ss, {'worksheet': ss}, # Not required mode='v2' )
In the example above, we also provided an L7|ESP SampleSheet object. This isn’t strictly necessary, but is required for accessors such as SheetInfo
, ExperimentInfo
, and SheetValue
. As with the accessor=build_accessor
, mode should be set to v2
to use the new implementation of data accessors (which you should use because it is generally much faster with cleaner internals than the deprecated implementation).
Once you have the runsheet object, there are a variety of operations that can be performed. The object is a wrapper around a list of “RunSheetSection” objects, each of which have section-type-dependent capabilities.
You can:
Generate a text output (note: for safe/properly-escaped csv files, ESP uses the python
csv
module):sheet = runsheet.to_csv()
export to csv with default separator (','); resulting string stored in sheet.sheet = runsheet.to_csv(sep='\t')
export to csv with tab separator; resulting string stored in sheet.runsheet.to_csv('path/to/file.csv')
export to csv, with contents dumped topath/to/file.csv
output file.
Get the resolved data for a given section
runsheet.sections[0].resolved_data
- this is a list of (ordered) dictionaries. The rows are ordered in the same order as the entities passed toresolve_runsheet
, but note that if any accessors returned lists of lists, there will be more rows inresolved_data
than entities.
Convert a single section to csv:
runsheet.sections[0].to_csv(writer)
.For
table
sections: convert to a pandas data frame:runsheet.sections[0].to_data_frame()
The ability to interact with the RunSheet in an object-oriented way makes it possible to combine the configuration-driven approach of generic runsheet with programatic manipulation of the resulting structure for advanced use-cases. For instance, the runsheet configuration might specify a table of values, so given the the starting code:
from esp.models import Entity, SampleSheet # for ESP < 3.0, use Sample instead of Entity. Sample also works for 3.0 for backwards compatibility. entities = Entity.search(names=['ESP000001', 'ESP000002]) ss = SampleSheet('My Sheet') runsheet = template.resolve_runsheet({'primary': entities}, ss, {'worksheet': ss}, mode='v2')
a simple application might do:
runsheet.to_csv('output_file.csv')
Where a more complex application might produce a custom output:
resolved_data = runsheet.sections[0].to_data_frame() # use pandas functions for additional data manipulations. resolved_data['Complex Column'] = resolved_data.apply(lambda row: float(row.A)*float(row.B), axis=1) with open('output_file.txt', 'w') as f: f.write('PREAMBLE\n') f.write('========\n') resolved_data.to_csv(f) f.write('========\n') f.write('POSTFIX\n')
Considerations for multiple output values for one or more configured values
Although data accessors usually resolve one value for each entity provided, they may also resolve a list of values for each entity. That is, given the list of entities [s1, s2, s3]
, a call to accessor.get(entities)
would usually return [v1, v2, v3]
, but may also return: [[v1.1, v1.2], [v2.1], [v3.1, v3.2, v.3.3]]
. When constructing a runsheet, all data accessors used in table generation must either return the same number of results for each entity, or return a single result for each entity. That is, given entities s1, s2, and s3 and data accessors “x1” and “x2”, the following returns are valid or not as indicated in the table.
X1 returns | X2 returns | Resulting Rows |
---|---|---|
[1, 2, 3] | [4, 5, 6] | [[1, 4], [2, 5], [3, 6]] |
[[1,2], [3], [4, 5, 6]] | [7, 8, 9] | [[1, 7], [2, 7], [3, 8], [4, 9], [5, 9], [6, 9]] |
[[1,2], [3], [4, 5, 6]] | [[7, 8, 9], [10], [11, 12, 13]] | None - an error will be raised because the values for each element of both X1 and X2 are lists, but the number of elements for entity |
Putting it together
Many liquid handlers accept a “run sheet” which is a listing of source wells, source volumes, dest wells, and dest values. The following configuration shows how you might quickly write a configuration for a liquid handler pooling operation: putting all entities into a single target well. For the sake of simplicity, this example assumes a worksheet with protocol named “Current” that has fields “Well” (the source well) and “Volume” (volume to transfer).
Configuration:
Liquid Handler: sections: - name: Main Table suppress_name: True type: table samples: primary values: - Source: sheet:Current.Well - Source Volume: sheet:Current.Volume - Dest: fixed:A12
With the script written as:
import argparse from esp.data.access import build_accessor import espclient.runsheet.RunSheetTemplate as rs from esp.models import Configuration from esp.models.analysis import LimsAnalysis parser = argparse.ArgumentParser(description='Generate a simple liquid handler picklist') parser.add_argument("output_path", help="path to write the output file to") opts = parser.parse_args() la = LimsAnalysis.current() template = rs.from_config(Configuration('Liquid Handler').config, accessor=build_accessor) runsheet = template.resolve_runsheet( {'primary': la.samples}, la.sample_sheet, {'worksheet': la.sample_sheet}, mode='v2').to_csv() with open(opts.output_path) as f: f.write(runsheet)
Footnotes
1The dataaccess API is on part with SampleSheet save performance in that it will perform operations such as: caching certain client-side objects and (re)-using them from cache and bulk fetch of data values for all entities for a given accessor. Formally, it’s possible to write custom code or a custom query to fetch all needed data values in a single API call with greater computational efficiency, but at greater implementation time and maintenance cost.
2In python, a protocol
defines the set of methods an object is expected to support along with arguments those methods accept and the type of values that must be returned. In some languages, this is called an Interface
. . Note that prior to python 3.8, protocol
was more of an abstract concept in Python, either implemented via abstract base classes, or purely via documentation. As of python 3.8, there is dedicated support language support for protocols, but the client supports python >= 3.5 (ESP < 3.0) and python >= 3.7 (ESP >= 3.0), so we haven't updated the code with this formal support for protocols yet.