GenEpiO Design
GenEpiO aims to provide a single, open-source, globally accessible set of terms to use in databases and software user interfaces, consequently it is called an “application ontology” (as opposed to a taxonomic ontology that specializes in a single domain). It is currently being incorporated into the Integrated Rapid Infectious Disease Analysis www.irida.ca project.
GenEpiO uses the OWL capability to organize terms in a main hierarchy according to the Basic Formal Ontology (BFO) and Ontology for Biomedical Investigations (OBI) schemas. This provides some amount of plug-and-play compatibility with other ontologies in the OBOFoundry family.
In our first pass, most of the terms we have collected pertain to measurements and observables (data) within laboratory practice, genomic analysis, and epidemiology/clinical records. We have focused on placing these terms within BFO/OBI under the “information content entity”category. We provide detail here to encourage feedback about logic implications, interoperability, and relative simplicity.
Here individual measurables are described. At the very least organizing them by their basic data type is a stable and fundamental characteristic that is immediately useful for data exchange and user interface construction. The data type categorization facet is a bottom-up ontology approach, but we expect it to be accompanied by other higher-level semantic categorizations. A logical definition for each term within its narrower domain – e.g. a lab test within the context of its inputs and outputs, or its technology – could be defined as well, especially for those terms GenEpiO is responsible for introducing, but those definitions will be an ongoing process.
Many clinical and environmental measurables, and process/event related named points in time (e.g. “exposure start” and “symptom onset”) have been placed under OBI categorical, scalar and time-measurement categories. All categorical items like the disease or symptom hierarchies have a basic data type of “URI”, meaning to select a categorical value is to select some vocabulary item that must have a globally accessible URI (a URI essentially enables an entity to be a categorical datum, and an ontology is at the very least a data dictionary of such things). Categorical measurables like “Symptom” are marked as an OBI “categorical value specification“, but moreover as a GenEpiO “categorical tree specification“, which allows us to list (in a hierarchy if desired) particular pick list choices for these items – with choices usually imported from other ontologies. We will also provide a tool (much like this search widget) to enable users to select more fine-grained choices if permitted.
Other non-categorical datums have a GenEpiO “has primitive value spec” relation that points directly to a primitive decimal, integer, string or date-time data type. We introduced this relation since we could not find a sufficient distinction between reference to complex data types – described by quantifier relations in OWL to other entities, and the primitive XML schema data types that appear directly in RDF statements (literal, long, int, nonNegativeInteger, etc.). More on the complex data type issue follows.
Many other needs arise when it comes to the particulars of form data entry but we will keep most of that configurable functionality separate in an Interface Model Ontology. Only with consensus of term use can we move towards a future where querying across ontology-driven databases is simplified.
The OBI Data Representational Model area content reflects an exploration of how more complicated GenEpiO entities could be harmonized. Currently this shows IRIDA-specific data structures but after a round of feedback these could/will be generalized. Individual users may find their existing systems don’t have equivalent mappings; or they may find entities like “symptom record” need extending – this is what we look forward to Consortium participation to determine. We view GenEpiO as the platform to make decisions on preferred global data types in cases where consensus is strong.
To describe an entity composed of various bits of information, in other words to describe its “complex data type”, we use one or more OBI “has member” relations to point to another entity as an informational component; and we reserve “has primitive value spec” for pointing to those entities that only have a primitive data type, i.e. they point to a “measurement data item”. The decomposition of GenEpiO “data representation model” classes always ends with entities having primitive data types. Practically, a datum like an “age” measurement having a primitive data type like “positiveInteger” shows up in a user interface data entry forms as a form input. A datum having a complex “geo coordinate” data type appears as two angular measurements, latitude and longitude with decimal inputs.
In GenEpiO we are including standards and their entities (fields) for services that are central to the genomics effort – namely the NCBI library entities like BioProject and BioSample that control sequence submissions. NCBI related specifications are under the “NCBI model” term. As well, “GenEpiO model” contains a number of building block elements like a symptom record (i.e. one or more symptoms and the date they occurred on).
We have placed an initial list of units here (e.g. celsius, meter, degree, megabyte, gigabyte) that most measurables (aside from categorical data) need in order to support data entry and exchange. This list can be heavily extended to enable automatic unit conversion. We envisage being able to mark particular units as preferred such that incoming data (by hand or by network) can be converted on the fly. In the ontology, any measurable datum that a unit applies to is marked with an IAO “has measurement unit label” relation to that unit (the label is a bit misleading, since it is the measurement unit itself we want to refer to).
There are ambiguities around how OBI handles complex data types, including units, which a few ontologies – QUDT, OM, OU – have tried to resolve (by having a data type’s validation rules defined explicitly). Part of the issue stems from the historical fact that RDF / XML data format allows one to select a primitive data type for the value of a triple, but this metadata is implicit knowledge that OWL reasoners don’t really consider (as we understand it), so unit/data type ontologies are added to enable more reasoning. Unfortunately since a simple but robust data type schema is not harmonized in the aforementioned ontologies or specified via OWL or BFO or OBI, we have to make decisions about what GenEpiO should do.
We take the approach that when a datum (a measurement say) is known to be an instance of a class of entity, the entity is the proper (generalized) place for defining the data type or unit of the datum, so that the data type (unit) doesn’t literally need to be stored with the datum. Restated, some theoretic entities like Pi have a universal value that one can attach directly to their concept. Most other measurable concepts like pH (or even the speed of light) have measurable instances we call datums, and for these we can pair the data type with the measurable concept rather than the datum instance. (This is done often in engineering “dimensional analysis” in which one checks to ensure units are compatible through a formula transformation that ignores actual quantities or named variables). When we know a datum is a pH measurement there is no need to include an RDF statement that specifies the datum’s unit – the unit is inferred from the ph measurement class definition.
Also included is an experimental “n-dimensional specification” that should provide the framework for geospatial analysis of genomic or other place-related data. One can see geo-spatial properties being inherited down to subclasses, i.e. a state or province inherits a number of properties as a result of being (in turn) a “categorical location datum”, a “geopolitical region”, and ultimately, an “identified area on Earth”. We expect that will make it much easier to query and transform data to yield epidemiological statistics and visualizations.
A note on the utility of data and object properties: A data property is the quickest OWL way to associate a value of a certain type with an entity; and an object property is the quickest way to make known some association between two concepts. However, the more we define the structure of a concept (like density, a ratio between a countable entity/population and a spatial region/area/volume) with respect to its parts, the closer we get to a future that can automatically associate the resources we need to answer a query about the concept. By defining population density as having a numerator of some population count, and a denominator as some region, a search for appropriate population counts and related region areas at a given time-point can theoretically be conducted. Defining density just as a name of a property doesn’t help in that campaign. In that respect properties are shortcuts that don’t add semantic value. Instead, we are pursuing a model in which a quality of an entity at some point in time (like population density of a given population) has a “has value” data property relation to its observed value.
There will be debate about whether some measurables should be categorical or not. For example, to provide categorical entries for cities/towns within a particular region one would require an ontology like the GEO or GAZETTEER ontology to be comprehensive enough. If they aren’t then we must allow for free text entry in addition to or instead of a selection list of city names. We expect the consortium’s needs will require both possibilities.
The above inheritance behaviour illustrates a design approach in which subclasses are differentiated from each other by features unique to them, and which essentially provide meaning. To describe an entity’s context (e.g. X is part of Y) yields a list of facts or properties that form expectations about its information bearing content, and about how it can be applied in processes. Protege shows this inheritance automatically as long as each differentia is stated on its own in the class hierarchy view “SubClass Of” section (rather than as an equivalent one-line conjunction, so we try to avoid that).
One note: work on the food ontology requirement for GenEpiO is being carried out separately in the FOODON project.
One can still find numerous ontology examples that show a primitive measurable datum is stated as an instance of a class, also with “has value specification” relations leading to nodes representing both the unit and the value – but we discourage unit description at this level unless units can’t be normalized at a higher entity.
Secondly, there are a number of cases where a node’s “has value specification” points to a value alone, and no data type. For this intention we prefer the data property “has specified value” relation (aka SIO “has value“). Note though that while “has specified value” is defined as >”A relation between a value specification and a number that quantifies it.” we think it is more appropriate to constrain this to “number” further down in the hierarchy, e.g. when defining “has scalar value specification”. Currently GenEpiO doesn’t state any measurable datums, and no genomic epidemiology constants to date.
We recognize that a formal ontology needs to respect the various upper level concepts at play, and for this a term’s label and description will remain as posted from the original ontologies that GenEpiO imports from. However, for those users who are not in the ontology business – and especially those for whom an application ontology surfaces only in software user interface (UI) form inputs, reports and help screens, we are providing a preferred UI label and where suitable, a preferred UI description for plain language use. Import of labels is still an awkward practice because of the number of alternative label/definition options available: ‘alternative term’, ‘editor preferred term’, ‘definition’ vs ‘description’, etc. . The key point is that most of those labels may not target a particular type of user that the software must cater to. The GenEpiO UI label attempts to cover this generally.
If a term is imported, and has ancestors and descendants, we won’t import the ancestors unless they provide a convenient navigational role in some user or search interface, or a common categorical heritage for a number of ontology terms. Similarly we won’t import children of used terms unless we want to prime categorical selection lists with their values, or have them on hand for data import/export reasons (we can provide child term dynamic lookups instead). In other words, we are interested in a GenEpiO that is internally consistent, and which imports from other ontologies minimally to satisfy that consistency; rather than importing 3rd party ontologies as a whole in case that helps identify axiomatic contradictions. Ideally there would be a “world-consistency” engine that works out the existence of any contradictions that an ontology has with respect to the inclusion of 3rd party ontologies (and their imports too, etc..). We realize we are trading simplicity for logical prowess. We would like to have a casual user browse GenEpiO and be able to see that each term falls within the constellation of Genomic Epidemiology phenomena.