Project to enhance

Structured Data Linter

by Ankita Dhandha

Presentation Notes

This is a living document. “Lorem ipsum” is placeholder for future content.

opens links on buttons & in attachments.

Attachments ↓ provide additional detail.

Table of Contents is upper left 🡔 navigate from there.

Left‐side content ← previews target content & usually is scrollable.

Presentation works on 💻 & 📱.

Living Documents

SDL Announcements

2011

2020

SDL Agenda and Product Plan

Release–0.3.9:  SDL 3.9 baseline from Gregg Kellogg

Release–1.0:  SDL 3.9 running under AWS/Lambda

Release–2.0:  SDL running under AWS/EC2

Release–3.0:  SDL with new client‐side features and UX

Release–4.0:  SDL with new server‐side support for SHACL/ShEx

Release–5.0:  SDL with new server‐side support for ontologies

Release 1.0 Design Goals

Implement SDL on AWS/Lambda

Learn Lambda application limits to configure SDL as a Lambda application

Learn SDL/Lambda processing limits to determine graph size and complexity for SDL analysis

Test SDL/Lambda jobs that are too large/complex to run on SDL/Heroku (Gregg’s native implemetation)

Begin to prepare SDL customers for two platforms: one for simple jobs on AWS/Lambda; one for complex jobs (future release)

Example 1-21

Define a simple JSON–LD @Graph

Test simple @Graph on Schema.org Markup Validator (SMV)

Use A/B Testing: compare SMV and SDL reports using identical @Graph (SMV/Graph sameAs SDL/Graph) (≡)

On SMV report, click an @Type (PublicationIssue and/or ScholarlyArticle) to see report detail

Run Schema Validator

Example 1-22

begin A/B testing

JSON–LD test on AWS/Lambda running SDL

SDL “search results preview” is same information embedded in SMV analysis but served in human-readable format

On SDL report, scroll down to see JSON–LD graph processing and analysis

JSON–LD on AWS/Lambda

Example 1-31

continuing A/B testing

JSON–LD graph defines Natural Languages used to present content to readers (and intelligent devices ) in language of their choice

@Language graph is more complex than previous A/B test

Select SMV report about @Types [e.g. Class (33 items) and/orDefinedTerm (3 items)] for detail

Run Schema Validator

Example 1-32

continuing A/B testing

@Language graph on AWS/Lambda

SDL “search results preview” is same information in SMV but served in human‐readable format

Scroll down to see JSON–LD graph processing and analysis

When processed on AWS/Lambda, SDL generates report about 2,204 “triples” defined in @Language graph

AWS/Lambda (be patient …)

Example 1-41

continuing A/B testing

This JSON–LD graph is the Ontomatica Knowledge Graph

Knowledge Graph uses 〜 20 @Type objects such as @Corporation, @Product, @Offer and @Dataset

SMV intergrates ( links ) valid @Type and @Property relationships to create a single view — “Corporation”

Run Schema Validator

Example 1-42

continuing A/B testing

@Language graph on AWS/Lambda

SDL “search results preview” is same information in SMV but served in human‐readable format

Scroll down to see JSON–LD graph processing and analysis

When processed on AWS/Lambda, SDL generates report about 2,204 “triples” defined in @Language graph

AWS/Lambda (be patient …)

Example 1-51

Force Directed Graph (FDG) of Ontomatica’s Knowledge Graph

Facts (entities & relationships) in FDG are identical to JSON–LD facts in SMV & SDL reports

Rotate/zoom/move FDG to see specific entities & relationships

Link highlighted in red features main entities on pages [ mainEntityOfPage ]

Knowledge Graph

Release 1.0 Issues

AWS/Lambda “duration window” limits file size for SDL processing

AWS/Lambda “size window” limits integration of optional SDL features

SDL/AWS/Lambda will process larger & more complex graphs than SDL/Heroku (Gregg’s SDL platform)

SDM server is faster than default AWS/Lambda server & will process graphs files up to 2.5MB

Rel-1 Issues on GitHub

Release 2.0 Design Goals

Use client–side methods to add features to SDL reports

Use CSS grid to create cells for specific SDL features

Use CSS lightbox to preview “cell + content”

Upon cell selection, lightbox displays “cell + content” preview in full screen

Build-out “cell + content” design with existing SDL features such as table analysis, error messages and reasoner messages

Build-out “cell + content” design with new features such as graph visualization

Attachments include information that complements slide content. The following uses playful “lorem ipsum” style text to illustrate presentation of additional information.


Abbervail Dream
Blue Diamonds
Bright amazing and wonderful
Dancing around the flames
Everybody knows bird is word
Caramel Sensation
Dairy Cream
Frosty the snowman is a boss
Girls just want to have fun
Got some popsicles in the cellar
Elusive Enchantment
Fat Chance Cinnamon
Insomnia gives me time to
Inspiration slaps me in the face
Last chance for one last dance
Good Luck Charm
Hershey's Kiss
Laugh all day for no reason
Life is a box of chocolates
Live like there is no tomorrow
Ice Cream Mix
Jack Daniels
Make it up as you go
Moms cookies make everything
My room is an organized mess
Kitty Hawk
Last Man Standing
Pluto is still a planet
Six words can mean the world
Sleeping with a giant bear
Made You Look
Nabisco Cracker
Sour candy makes me twitch
The sky is not the limit
There always gonna be another
One in a Million
Peach Blossom
There no place like grandmas
Why whisper what you shout
Your the apple to my pie

Release 2.0 Design Notes

Jarno van Driel proposed new SDL features

New features are presented & discussed on Google Docs

Preview document using link below

Jarno’s Doc page

Example 2-21

Feature SDL table (current example injects sample data from Wikidata)

Feature one or more visualized graphs using processors e.g. D3.JS

Feature hierachical view of structured data—similar to Schema Markup Validator

Feature parser statistics

Feature reasoner analysis (snippets)

Feature warnings & errors (here preview shows ~10% of full page content)

Rel-2 prototype full

Example 2-22

Production version of Jarno design

Sample uses simple case from SDL/AWS/Lambda Example 1-22

Cells feature:  (1) search results preview  (2) RDF  (3) TTL  (4) RDFa  (5) JSON–LD beautified  (6) RDF Grapher  (7) tabular report  (8) parser statistics  (9) linter message from reasoner

Footer includes link to SDL Release 2.0 prototype running on AWS/Lambda

Rel-2 production page

Example 2-31

On following pages are seven views of a single JSON data source

Example 2-31: Circle Packing

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Example 2-32: Sunburst (next)

Circle Packing page

Example 2-32

Sunburst

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Example 2-33: Sunburst Zoom (next)

Sunburst page

Example 2-33

Sunburst Zoom with Labels

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Example 2-34: Collapsible Boxes (next)

Sunburst Zoom page

Example 2-34

Collapsible Boxes

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Example 2-35: Node-Link Tree (next)

Collapsible Boxes page

Example 2-35

Node-Link Tree

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Example 2-36: Treemap (next)

Node-Link Tree page

Example 2-36

Treemap

With server–side assistance, a newly generated JSON data structure could be similarly visualized

Treemap page

Example 2-41

A force directed graph (FDG) visualizes schema.org @Type and @Property specifications & relationships

Source data conforms to subject–predicate–object (?s ?p ?o) format

In contrast, flare.json structure (Examples 2-31 2-37) uses hierarchical structure based on RDFS:subClassOf

With server–side assistance, a newly generated JSON data structure could be similarly visualized

schema.org FDG page

Item 2-51

Develop consensus in SDL community & among interested parties about final design for Release 2.0 interface

Will need SDL server–side changes to generate JSON structure for D3.JS processing

Will need SDL server–side changes to generate JSON structure for Force Directed Graph processing

Rel-2 Issues on GitHub

Release 3.0 Design Goals

Create SDL preparation methods & production platform to analyze large graphs

SDL processor & reasoner objective:
analyze @Graph with 10 millions statements (“triples”)

Example 3-21

Refactored USDA National Agricultural Library Thesaurus (NALT) in schema.org

NALT/JSON–LD size: 6.84 MB

NALT/JSON–LD exceeds SMV 2.5 MB limit (no A/B analysis)

Alternative: configured SDL on AWS/EC2 server

SDL/NALT report size: 31.6 MB

SDL/NALT/AWS/EC2 processing time: 5 hours

NALT “triples”: 515,530

SDL/NALT (be patient …)

Example 3-22

Ontomatica’s Web Enabled Directed Graph Engine (WEDGE) Reference Library is an application of National Agriculture Library Thesaurus

Research papers are mapped to schema.org JSON–LD structure in SDL report

Research papers are annotated using schema.org @Type and @Property grammar

WEDGE Reference Library contains information about 200,000+ papers

NALT “triples”: 515,530

WEDGE NALT Library

Example 3-23

Visualization does not include Taxa which is included in SDL report (Example 3-21)

Visualization uses same JSON–LD structure as used in SDL Release 2.0 design and prototype

NALT Sunburst

Example 3-31

Refactored US NIH National Cancer Institute Thesaurus (NCIT) in schema.org

NCIT/JSON–LD size: 13.7 MB
(no A/B analysis with Schema Markup Validator)

SDL/NCIT report size: 76.9 MB

SDL/NCIT/AWS/EC2 processing time: 9 hours

NCIT “triples”: 946,520

NCIT (be patient …)

Example 3-32

ChEMATIC (Chemical Entities with Medical Applications, Therapeutic Indications & Consequences) is an application of data from NIH NCIT & NIH Medical Subject Headings (MeSH)

Several other ontologies complement NCIT & MeSH JSON-LD structures

Biochemicals are mapped to hierarchical JSON-LD structures

Total ChEMATIC “triples” (structures and object maps): 700+ million

WEDGE ChEMATIC

Release 3.0 Issues

SDL/AWS/ECS is configured as a Docker container but improved methods will be needed to install SDL on best–available AWS/EC2 server

To reduce processing duration, need methods to use multiple CPU cores

SDL/AWS/EC2 is expensive to run — need to implement a business model to offset operating expenses

Rel-3 Issues on GitHub

Release 4.0 Design Goals

Support Shapes Constraint Language (SHACL) — a specification for validating graph–based data against a set of conditions

Support Shape Expressions (ShEx) — an RDF language for identifying predicates and their associated cardinalities and datatypes

Schemarama on GitHub

Item 4-21

Tim Berners‐Lee on SHACL & ShEx:

Shapes explain to machines what data should look like, independently of how that data is displayed to a user

Forms are a user interface allowing people to read and write data in a specific shape

Footprints explain to machines where new data should be stored

TBL presentation

Item 4-22

Ruben Verborgh on Shapes & Linked Data:

Apps should be coded against shapes [and] Linked Data so other apps can reuse them

[Where] vocabularies provide a list of possible attributes, shapes mandate a specific structure for data, combining attributes from vocabularies in a certain way

Footprints explain to machines where new data should be stored

Ruben Verborgh article

Item 4-23

Key findings in the US PubMed/NCBI article “Automatic Generation of SHACL Shapes from Ontologies”

OWL and SHACL are not equivalent in their interpretation

There are differences in how OWL interprets restrictions (for inferencing) and how SHACL interprets constraints (for validation)

PubMed SHACL article

Item 4-31

Glucosinolates are natural components of many pungent plants such as brocolli, mustard, cabbage, and horseradish

US NIH NCI review of links between cruciferous vegetable intake & lung cancer risk concluded that high intake may decrease risk in a range of 17 ‐ 23 %

Other studies report similar risk reductions for colorectal, breast, kidney, esophageal, & oropharyngeal (mouth & throat) cancers

NLM Glucs article

Example 4-32

American Food Data Systems Institute (AFDSI) & Ontomatica participate in food & agriculture research projects

One WEDGE project integrated & synthesized glucosinolate data from many studies

WEDGE–Glucosinolates enables Principal Investigators & researchers to visualize relationships that otherwise are difficult to understand & analyze

WEDGE–Glucosinolates

Example 4-33

With an objective of creating a Knowledge Graph, glucosinolate data was difficult to synthesize & integrate

Observations & measurement methods were irregular

Plant taxa & genetic variety data was regular, but ‘part of plant’ designations were irregular

Research process would have been easier & more accurate if shape data had been enforced during preparations & observations

Glucs “fingerprint data”

Example 4-34

Force Directed Graph represents integration of data specifications (from ontologies) & data constraints (to ensure data quality)

“Ontology part” of graph (taxa & ‘part of plant’) is visable in WEDGE–Glucosinolates

“Shape part” of graph (represented as SHACL in TTL format) is in footer

Force Directed Graph page

Item 4-41

Diabetes is a debilitating & life threatening disease

Research about & remedies for diabetes depend on precise information where “the devil is in the details”

This NCBI article is an overview

NCBI Diabetes article

Example 4-42

ChEMATIC is a WEDGE application to visualize relationships among biochemistry, factor inputs & human conditions

ChEMATIC does not document opinions (something is good or bad); it only documents items & their relationships

Medical & nutrition experts use ChEMATIC information to express opinions & advice

This graph visualizes data about Diabetes Mellitus, Type 2

ChEMATIC Diabetes Mellitus graph

Example 4-43

Diabetes observation & monitoring are key parts of a personalized remedy

First we need to specify the shape of glucose observations

Then we need to integrate observation shapes with monitored glucose data

Example 4-44 illustrates an observation graph for glucose

Dexcom Diabetes Monitor

Example 4-44: Graph of ShEx Observation for glucose

Example 4-45

Visualizing Dexcom Observation Data - Hourly

Visualize hourly data

Example 4-46

Visualizing Dexcom Observation Data - Daily

Visualize daily data

Example 4-47

Visualizing Dexcom Observation Data - Histogram

Visualize histogram

Twitter: ShEx SHACL Announcements

Item 4-61

Develop specification & design for implementing SHACL & ShEx in SDL

Simplify workflow that involves at least 2 source files (ontology & shape) & possibly more than one data structure (JSON-LD & TTL)

Explain at least three conditions: ontology messages, shape messages, & ontology/shape integration messages

Reconcile irregularity between ontology constraints & shape constraints

Patel-Schneider-SHACL

Release 5.0 Design Goals

Support other ontologies — in addition to schema.org

In addition to @Context registration of vocabulary terms, support reasoning about ontology–specific grammar

Enable vocabulary & reasoning for SKOS–based datasets

Enable vocabulary & reasoning for OWL–based datasets

Example 5-21

UN FAO AgroVoc is a SKOS–based dataset

AgroVoc is a multilingual controlled vocabulary covering all areas of interest to the Food & Agriculture Organization of the United Nations, including food, nutrition, agriculture, fisheries, forestry & the environment.

UN FAO – Brassica

Example 5-22

US Library of Congress is a SKOS–based dataset

The Library of Congress Subject Headings (LCSH) comprise a thesaurus (controlled vocabulary) of subject headings, maintained by the United States Library of Congress, for use in bibliographic records

US LOC – herbicide

Example 5-31

Plant Ontology is an OWL–based dataset

“archegonium head” is referenced in WEDGE–Glucosinolates

PO – “archegonium head”

Example 5-32

Avocado Ontology is an OWL–based dataset

Avocado is a popular food & popular ingredient in other foods

Avocado ontology

Example 5-41

US NIH PubChem is a multi–ontology dataset

PubChem is a database of chemical molecules & their activities against biological assays

Author: National Center for Biotechnology Information (NCBI); partOf United States National Institutes of Health (NIH)

More than 80 database vendors contribute to PubChem

PubChem RDF

Example 5-51

Wedge–FNDDS (Food & Nutrient Database for Dietary Studies) is a multi–ontology dataset

FNDDS includes foods & beverages nutrition data reported in “What We Eat in America”

FNDDS is an application of OWL–based ontologies including AFDSI’s Vocal (acronym for the phrase “Vocabularium Alimentarum — Vocabulary of Food”)

foods made with brassicas on Wedge–FNDDS

Release 5.0 Issues

Production issues will be more complicated than Release 3.0

May be difficult to load an SDL–instance configured with schema.org–based datasets + SKOS–based datasets + OWL–based datasets

Processing duration could be long (days!)

Provenance and Document Properties

author
Ankita Dhandha
organization
Ontomatica
date published
22-12-01
date modified
22-12-10
modification note
Added Section 4
date modified
22-12-16
modification note
Added Section 5
date modified
23-01-01
release note
Added information based on Gregg Kellogg email
date modified
23-07-07
release note
Updated to use Schema.org Markup Validator (SMV)
date modified
23-12-10
release note
updated story format
date modified
24-01-11
release note
more future information
date modified
24-01-11
release note
more future information
date modified
24-01-11
release note
more future information