Home

Welcome to the SoilWise Knowledge Base!

LLA

Design Document: Link Liveliness Assessment

Introduction

Component Overview and Scope

The linkchecker component is designed to evaluate the validity and availability of links within metadata records advertised via a OGC API - Records API.

A link in metadata record either points to:

another metadata record
a downloadable instance (pdf/zip/sqlite/mp4/pptx) of the resource
- the resource itself
- documentation about the resource
- identifier of the resource (DOI)
a webservice or API (sparql, openapi, graphql, ogc-api)

Linkchecker evaluates for a set of metadata records, if:

the links to external sources are valid
the links within the repository are valid
link metadata represents accurately the resource (mime type, size, data model, access constraints)

If endpoint is API, some sanity checks can be performed on the API:

Identify if the API adopted any API-standard
If an API standard is adopted, does the API support basic operations of that API
Does the metadata correctly mention the standard

The component returns a http status: 200 OK, 401 Non Authorized, 404 Not Found, 500 Server Error and timestamp. The component runs an evaluation for a single resource at request, or runs tests at intervals providing a history of availability. The results of the evaluation can be extracted via an API. The API is based on the fastapi framework and can be deployed using a docker container.

Evaluation process runs as a scheduled CI-CD pipeline in Gitlab. It uses 5 threads to increase performance.

Users

all data users (authorised + unauthorised)
- can see the results of link evaluation in the Catalogue (if XX is integrated) or access the API directly to retrieve reports
administrators
- can configure and manually start the evaluation process
- can see the history of link evaluations

References

Requirements

Functional Requirements

Users want to understand the availability of a resource before they click a link. It helps them to understand if a click will be (un)successfull. The availability can be indicated as not available (404), sometimes available (in %), authorisation required, deprecated (too many times not available), 'unknown' (due to timeouts or other failures).

Non-functional Requirements

Should respect rules in robots.txt
Should not introduce vulnerabilities

Architecture

Technological Stack

Core Language:
- Python + core library urlliib: Used for the linkchecker, API, and database interactions.
Database:
- PostgreSQL: Utilized for storing and managing information.
Backend Framework:
- FastAPI: Employed to create and expose REST API endpoints, utilizing its efficiency and auto-generated components like Swagger.
Frontend:
- See Integrations & Interfaces
Containerization:
- Docker: Used to containerize the linkchecker application, ensuring deployment and execution across different environments.

Overview of Key Features

Link validation: Returns HTTP status codes for each link, along with other important information such as the parent URL, any warnings, and the date and time of the test. Additionally, the tool enhances link analysis by identifying various metadata attributes, including file format type (e.g., image/jpeg, application/pdf, text/html), file size (in bytes), and last modification date. This provides users with valuable insights about the resource before accessing it.
Broken link categorization: Identifies and categorizes broken links based on status codes, including Redirection Errors, Client Errors, and Server Errors.
Deprecated links identification: Flags links as deprecated if they have failed for X consecutive tests, in our case X equals to 10. Deprecated links are excluded from future tests to optimize performance.
Timeout management: Allows the identification of URLs that exceed a timeout threshold which can be set manually as a parameter in linkchecker's properties.
Availability monitoring: When run periodically, the tool builds a history of availability for each URL, enabling users to view the status of links over time.
OWS services (WMS, WFS, WCS, CSW) typically return a HTTP 500 error when called without the necessary parameters. A handling for these services has been applied in order to detect and include the necessary parameters before being checked.

Component Diagrams

flowchart LR
    H["Harvester"]-- "writes" -->MR[("Record Table")]
    MR-- "reads" -->LAA["Link Liveliness Assessment"]
    MR-- "reads" -->CA["Catalogue"]
    LAA-- "writes" -->LLAL[("Links Table")]
    LAA-- "writes" -->LLAVH[("Validation History Table")]
    CA-- "reads" -->API["**API**"]
    LLAL-- "writes" -->API
    LLAVH-- "writes" -->API

Sequence Diagram

sequenceDiagram
    participant Linkchecker
    participant DB
    participant Catalogue

    Linkchecker->>DB: Establish Database Connection
    Linkchecker->>Catalogue: Extract Relevant URLs

    loop URL Processing
        Linkchecker->DB: Check URL Existence
        Linkchecker->DB: Check Deprecation Status

        alt URL Not Deprecated
            Linkchecker-->DB: Insert/Update Records
            Linkchecker-->DB: Insert/Update Links with file format type, size, last_modified
            Linkchecker-->DB: Update Validation History
        else URL Deprecated
            Linkchecker-->DB: Skip Processing
        end
    end

    Linkchecker->>DB: Close Database Connection

Database Design

classDiagram
    Links <|-- Validation_history
    Links <|-- Records
    Links : +Int ID
    Links : +Int fk_records
    Links : +String Urlname
    Links : +String deprecated
    Links : +String link_type
    Links : +Int link_size
    Links : +DateTime last_modified
    Links : +String Consecutive_failures
    class Records{
    +Int ID
    +String Records
    }
    class Validation_history{
      +Int ID
      +Int fk_link
      +String Statuscode
     +String isRedirect
     +String Errormessage
     +Date Timestamp
    }

Integrations & Interfaces

Visualisation of evaluation in Metadata Catalogue, the assessment report is retrieved using ajax from the each record page
FastAPI now incorporates additional metadata for links, including file format type, size, and last modified date.

Key Architectural Decisions

Initially we started with linkchecker library, but performance was really slow, because it tested the same links for each page again and again. We decided to only test the links section of ogc-api:records, it means that links within for example metadata abstract are no longer tested. OGC OWS services are a substantial portion of links, these services return error 500, if called without parameters. For this scenario we created a dedicated script. If tests for a resource fail a number of times, the resource is no longer tested, and the resource tagged as deprecated. Links via a facade, such as DOI, are followed to the page they are referring to. It means the lla tool can understand the relation between DOI and the page it refers to. For each link it is known on which record(s) it is mentioned, so if a broken link occurs, we can find a contact to notify in the record.

For the second release we have enhanced the link liveliness assement tool to collect more information about the resources:

File type format (media type) to help users understand what format they'll be accessing (e.g., image/jpeg, application/pdf, text/html)
File size to inform users about download expectations
Last modification date to indicate how recent the resource is

API Updates

The API has been extended to include the newly tracked metadata fields: - link_type: Shows the file format type of the resource (e.g., image/jpeg, application/pdf) - link_size: Indicates the size of the resource in bytes - last_modified: Provides the timestamp when the resource was modified

Next Steps

We plan to enhance the link liveliness assesment tool to include geospatial attributes such as field details, and spatial information from various data formats (e.g., GeoTIFF, Shapefile, CSV). This will be accomplished using GDAL and OWSLib to enable efficient retrieval without the need to download the full files.

Risks & Limitations

TBD

MD Augmentation

Intro

The metadata augmentation repository contains a number of modules each performing specific tasks on the metadata. For each module a dedicated design document is provided.

NER_augmentation
deduplication
high value Datasets
keyword_matcher
spatial_locator
spatial_scope_analyser
translation

Design Document: NER Augmentation

Introduction

Component Overview and Scope

Users

References

Requirements

Functional Requirements

Non-functional Requirements

Architecture

Technological Stack

Overview of Key Features

Component Diagrams

Sequence Diagram

Database Design

Integrations & Interfaces

Key Architectural Decisions

Risks & Limitations

Design Document: Deduplication

Introduction

Component Overview and Scope

Users

References

Requirements

Functional Requirements

Non-functional Requirements

Architecture

Technological Stack

Overview of Key Features

Component Diagrams

Sequence Diagram

Database Design

Integrations & Interfaces

Key Architectural Decisions

Risks & Limitations

Design Document: EUOS-high-value dataset tagging

Introduction

Component Overview and Scope

The EUSO high-value datasets are those with substantial potential to assess soil health status, as detailed on the EUSO dashboard. This framework includes the concept of soil degradation indicator metadata-based identification and tagging. Each dataset (possibly only those with the supra-national spatial scope - under discussion) will be annotated with a potential soil degradation indicator for which it might be utilised. Users can then filter these datasets according to their specific needs.

The EUSO soil degradation indicators employ specific methodologies and thresholds to determine soil health status

Users

authorised data users
unauthorised data users
administrators

References

EUSO HIgh-value data methodology

Requirements

Functional Requirements

For critical / high-value resources, SoilWise will either link with them or store them

Non-functional Requirements

Labelling data results as high-value data sets will rely on EUSO/ESDAC criteria, which are not identical to the High Value Data directive criteria.

Architecture

Technological Stack

Overview of Key Features

Component Diagrams

Sequence Diagram

flowchart TD
    subgraph ic[Indicators Search]
        ti([Title Check]) ~~~ ai([Abstract Check])
        ai ~~~ ki([Keywords Check])
    end
    subgraph Codelists
        sd ~~~ si
    end
    subgraph M[Methodologies Search]
        tiM([Title Check]) ~~~ aiM([Abstract Check])
        kl([Links check]) ~~~ kM([Keywords Check])
    end
    m[(Metadata Record)] --> ic
    m --> M
    ic-- + ---M
    sd[(Soil Degradation Codelist)] --> ic
    si[(Soil Indicator Codelist)] --> ic
    em[(EUSO Soil Methodologies list)] --> M
    M --> et{{EUSO High-Value Dataset Tag}}
    et --> m
    ic --> es{{Soil Degradation Indicator Tag}}
    es --> m
    th[(Thesauri)]-- synonyms ---Codelists

Database Design

Integrations & Interfaces

Spatial scope analyser
Catalogue
Knowledge graph

Key Architectural Decisions

Risks & Limitations

Design Document: Keyword Matcher

Introduction

Component Overview and Scope

Users

References

Requirements

Functional Requirements

Non-functional Requirements

Architecture

Technological Stack

Overview of Key Features

Component Diagrams

Sequence Diagram

Database Design

Integrations & Interfaces

Key Architectural Decisions

Risks & Limitations

Design Document: Spatial Locator

Introduction

Component Overview and Scope

Users

References

Requirements

Functional Requirements

Non-functional Requirements

Architecture

Technological Stack

Overview of Key Features

Component Diagrams

Sequence Diagram

Database Design

Integrations & Interfaces

Key Architectural Decisions

Risks & Limitations

Design Document: Spatial Scope Analyser

Introduction

Component Overview and Scope

Users

References

Requirements

Functional Requirements

Non-functional Requirements

Architecture

Technological Stack

Overview of Key Features

Component Diagrams

Sequence Diagram

Database Design

Integrations & Interfaces

Key Architectural Decisions

Risks & Limitations

Design Document: Translation

Introduction

Some imported records are in a non-english language. In order to imrpove their discoverability the component uses EU translation service to translate the title and abstract to english. The keywords are not translated, because they are translated through the keyword matcher.

Component Overview and Scope

Users

References

Requirements

Functional Requirements

Identify if the record benefits from translation
Identify the source language
Request the translation
Receive the translation (asynchronous)
Update the catalogue with translated content

Non-functional Requirements

Prevent that the service is mis-used as a proxy to the EU translate service

Architecture

Technological Stack

Overview of Key Features

Component Diagrams

Sequence Diagram

Database Design

Integrations & Interfaces

Key Architectural Decisions

Risks & Limitations

Validation

Abstract test Suite

The elements below are tested on being available in the suggested cardinality and type.

Element DC	Element ISO	Cardinality	Type	Codelist	Comment
identifier	fileidentifier	1-n	string	-
title	title	1-n	string	-
language	language	0-n	string	-	2/3/5-letter iso?
description	abstract	0-n	string
date	date	0-n	date
distribution	distributioninfo	0-n	str or uri
contributor	contact#?	0-n	str or uri
creator	contact#author	0-n	str or uri
publisher	contact#distributor	0-n	str or uri
coverage-temporal	extent#temporal	0-n	date-period
coverage-spatial	extent#spatial	0-n	str, uri or bbox
rights	constraints	0-1	str or uri
license	constraints	0-1	str or uri
subject	keyword/topiccategory	0-n	str or uri
type	hierarchylevel	1-1	str or uri
format	format	0-1	str or uri

Design Document: Metadata validator

Introduction

Component Overview and Scope

SoilWise Repository aims for the approach to harvest and register as much as possible. Catalogues which capture metadata authored by data custodians typically have a wide range of metadata completeness and accuracy. Therefore, the SoilWise Repository employs metadata validation mechanisms to provide additional information about metadata completeness, conformance and integrity. Information resulting from the validation process are stored together with each metadata record in a relational database and updated after registering a new metadata version.

It is important to understand that this activity runs on unprocessed records at entry to the system. Which means, since there is a n:1 relation between ingested records and deduplicated, enhanced records in the Soilwise repository, a validation result should always be evaluated by its combination of record identifier and source and moment of harvesting. Validation results can best be visualised on record level, by showing a historic list of sources which contributed to the record.

Users

authorised data users
- see validation results (log) per each metadata record
unauthorised data users
administrators
- manually run validation
- see summary validation results
- monitor validation process

References

Requirements

Functional Requirements

regular automated validation, at least once per six months after the first project year
validation result is stored in database related to the harvested record
metadata of datasets, documents,journal articles, ...
metadata in ISO19139:2007, Dublin Core, ...
information about metadata completeness (are elements populated)
information about metadata conformance (schema, profile, ...)?
information about metadata integrity (are elements consistent with each other)
Link liveliness assessment

Nice to have

Provide means for validation of data
- Data structure
- Content integrity
- Data duplication
Capture results of manual tests (in case not all test case can be automated)

Non-functional Requirements

follow ATS ETS approach and implement AI / ML enrichment
reach TRL 7
adopt the ISO 19157 data quality measures
JRC does not want to discourage data providers from publishing metadata by visualizing, that they are not conformant with a set of rules

Architecture Abstract test suites

Technological Stack

Mardown file on Github

Architecture Executable test suites

Technological Stack

Harvested records are stored in 'harvest.items' table (postgres db), they are identified by their hash. Multiple editions of the same record are available, which each can have a different validation result. A scheduled process (ci-cd, argo-workflows) triggers a task to understand which items have not been validated before and validates them. results are stored on the

Overview of Key Features

Link Liveliness Assessment: to validate if the reference link is currently working, or deprecated. This is a separate component, which returns a http status: 200 OK, 401 Non Authorized, 404 Not Found, 500 Server Error and timestamp.

Component Diagrams

Sequence Diagram

Database Design

Two tables, related via record hash, because the result is stored per indicator. But main table has an overall summary.

Validation-results

record-hash (str)	result-summary (int)	date (date)
UaE4GeF	64	2025-01-12T11-06-34Z

Validation-by-indicator

record-hash (str)	indicator (str)	result (str)
UaE4GeF	completeness	77

Integrations & Interfaces

validation components run as a schedules process, it uses harvest.items table as a source
Link Liveliness Assessment runs as a scheduled process, it uses public.records as a source
Harvester prepares the metadata in harvest.items as is, to have a clean validation result at data entry.
Catalogue may provide an interface to the validation results, but it requires a authentication
Storage

Key Architectural Decisions

for the first iteration, Hale Studio was selected, with restricted access to the validation results
Shacl validation was discussed to be implemented for next iterations (for GeoDCAT-ap and Dublin core)
minimal SoilWise profile was discussed to indicate compliance with SoilWise functionality
EUSO Metadata profile was discussed
two-step validation of metadata was discussed, at first using harvested metadata, and next using SoilWise-augmented metadata

Risks & Limitations

Hale Studio currently does not support Dublin Core
Users may expect a single validation result and not a historic list of validated sources
Records from some sources require processing before they can be tested as iso19139 or Dublin core, there is a risk that metadata errors are introduced in pre-processing of the records, in that case the test validates more the software then the metadata itself