Table of Contents
Home
Welcome to the SoilWise Knowledge Base!
LLA ↵
Design Document: Link Liveliness Assessment
Introduction
Component Overview and Scope
The linkchecker component is designed to evaluate the validity and availability of links within metadata records advertised via a OGC API - Records API.
A link in metadata record either points to:
- another metadata record
- a downloadable instance (pdf/zip/sqlite/mp4/pptx) of the resource
- the resource itself
- documentation about the resource
- identifier of the resource (DOI)
- a webservice or API (sparql, openapi, graphql, ogc-api)
Linkchecker evaluates for a set of metadata records, if:
- the links to external sources are valid
- the links within the repository are valid
- link metadata represents accurately the resource (mime type, size, data model, access constraints)
If endpoint is API, some sanity checks can be performed on the API:
- Identify if the API adopted any API-standard
- If an API standard is adopted, does the API support basic operations of that API
- Does the metadata correctly mention the standard
The component returns a http status: 200 OK
, 401 Non Authorized
, 404 Not Found
, 500 Server Error
and timestamp.
The component runs an evaluation for a single resource at request, or runs tests at intervals providing a history of availability.
The results of the evaluation can be extracted via an API. The API is based on the fastapi framework and can be deployed using a docker container.
Evaluation process runs as a scheduled CI-CD pipeline in Gitlab. It uses 5 threads to increase performance.
Users
- all data users (authorised + unauthorised)
- can see the results of link evaluation in the Catalogue (if XX is integrated) or access the API directly to retrieve reports
- administrators
- can configure and manually start the evaluation process
- can see the history of link evaluations
References
Requirements
Functional Requirements
Users want to understand the availability of a resource before they click a link. It helps them to understand if a click will be (un)successfull. The availability can be indicated as not available
(404), sometimes available
(in %), authorisation required
, deprecated
(too many times not available), 'unknown' (due to timeouts or other failures).
Non-functional Requirements
- Should respect rules in robots.txt
- Should not introduce vulnerabilities
Architecture
Technological Stack
-
Core Language:
- Python + core library urlliib: Used for the linkchecker, API, and database interactions.
-
Database:
- PostgreSQL: Utilized for storing and managing information.
-
Backend Framework:
- FastAPI: Employed to create and expose REST API endpoints, utilizing its efficiency and auto-generated components like Swagger.
-
Frontend:
- See Integrations & Interfaces
-
Containerization:
- Docker: Used to containerize the linkchecker application, ensuring deployment and execution across different environments.
Overview of Key Features
- Link validation: Returns HTTP status codes for each link, along with other important information such as the parent URL, any warnings, and the date and time of the test. Additionally, the tool enhances link analysis by identifying various metadata attributes, including file format type (e.g., image/jpeg, application/pdf, text/html), file size (in bytes), and last modification date. This provides users with valuable insights about the resource before accessing it.
- Broken link categorization: Identifies and categorizes broken links based on status codes, including Redirection Errors, Client Errors, and Server Errors.
- Deprecated links identification: Flags links as deprecated if they have failed for X consecutive tests, in our case X equals to 10. Deprecated links are excluded from future tests to optimize performance.
- Timeout management: Allows the identification of URLs that exceed a timeout threshold which can be set manually as a parameter in linkchecker's properties.
- Availability monitoring: When run periodically, the tool builds a history of availability for each URL, enabling users to view the status of links over time.
- OWS services (WMS, WFS, WCS, CSW) typically return a HTTP 500 error when called without the necessary parameters. A handling for these services has been applied in order to detect and include the necessary parameters before being checked.
Component Diagrams
flowchart LR
H["Harvester"]-- "writes" -->MR[("Record Table")]
MR-- "reads" -->LAA["Link Liveliness Assessment"]
MR-- "reads" -->CA["Catalogue"]
LAA-- "writes" -->LLAL[("Links Table")]
LAA-- "writes" -->LLAVH[("Validation History Table")]
CA-- "reads" -->API["**API**"]
LLAL-- "writes" -->API
LLAVH-- "writes" -->API
Sequence Diagram
sequenceDiagram
participant Linkchecker
participant DB
participant Catalogue
Linkchecker->>DB: Establish Database Connection
Linkchecker->>Catalogue: Extract Relevant URLs
loop URL Processing
Linkchecker->DB: Check URL Existence
Linkchecker->DB: Check Deprecation Status
alt URL Not Deprecated
Linkchecker-->DB: Insert/Update Records
Linkchecker-->DB: Insert/Update Links with file format type, size, last_modified
Linkchecker-->DB: Update Validation History
else URL Deprecated
Linkchecker-->DB: Skip Processing
end
end
Linkchecker->>DB: Close Database Connection
Database Design
classDiagram
Links <|-- Validation_history
Links <|-- Records
Links : +Int ID
Links : +Int fk_records
Links : +String Urlname
Links : +String deprecated
Links : +String link_type
Links : +Int link_size
Links : +DateTime last_modified
Links : +String Consecutive_failures
class Records{
+Int ID
+String Records
}
class Validation_history{
+Int ID
+Int fk_link
+String Statuscode
+String isRedirect
+String Errormessage
+Date Timestamp
}
Integrations & Interfaces
- Visualisation of evaluation in Metadata Catalogue, the assessment report is retrieved using ajax from the each record page
- FastAPI now incorporates additional metadata for links, including file format type, size, and last modified date.
Key Architectural Decisions
Initially we started with linkchecker library, but performance was really slow, because it tested the same links for each page again and again.
We decided to only test the links section of ogc-api:records, it means that links within for example metadata abstract are no longer tested.
OGC OWS services are a substantial portion of links, these services return error 500, if called without parameters. For this scenario we created a dedicated script.
If tests for a resource fail a number of times, the resource is no longer tested, and the resource tagged as deprecated
.
Links via a facade, such as DOI, are followed to the page they are referring to. It means the lla tool can understand the relation between DOI and the page it refers to.
For each link it is known on which record(s) it is mentioned, so if a broken link occurs, we can find a contact to notify in the record.
For the second release we have enhanced the link liveliness assement tool to collect more information about the resources:
- File type format (media type) to help users understand what format they'll be accessing (e.g., image/jpeg, application/pdf, text/html)
- File size to inform users about download expectations
- Last modification date to indicate how recent the resource is
API Updates
The API has been extended to include the newly tracked metadata fields: - link_type: Shows the file format type of the resource (e.g., image/jpeg, application/pdf) - link_size: Indicates the size of the resource in bytes - last_modified: Provides the timestamp when the resource was modified
Next Steps
We plan to enhance the link liveliness assesment tool to include geospatial attributes such as field details, and spatial information from various data formats (e.g., GeoTIFF, Shapefile, CSV). This will be accomplished using GDAL and OWSLib to enable efficient retrieval without the need to download the full files.
Risks & Limitations
TBD
Ended: LLA
MD Augmentation ↵
Intro
The metadata augmentation repository contains a number of modules each performing specific tasks on the metadata. For each module a dedicated design document is provided.
Design Document: NER Augmentation
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: Deduplication
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: EUOS-high-value dataset tagging
Introduction
Component Overview and Scope
The EUSO high-value datasets are those with substantial potential to assess soil health status, as detailed on the EUSO dashboard. This framework includes the concept of soil degradation indicator metadata-based identification and tagging. Each dataset (possibly only those with the supra-national spatial scope - under discussion) will be annotated with a potential soil degradation indicator for which it might be utilised. Users can then filter these datasets according to their specific needs.
The EUSO soil degradation indicators employ specific methodologies and thresholds to determine soil health status
Users
- authorised data users
- unauthorised data users
- administrators
References
Requirements
Functional Requirements
- For critical / high-value resources, SoilWise will either link with them or store them
Non-functional Requirements
- Labelling data results as high-value data sets will rely on EUSO/ESDAC criteria, which are not identical to the High Value Data directive criteria.
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
flowchart TD
subgraph ic[Indicators Search]
ti([Title Check]) ~~~ ai([Abstract Check])
ai ~~~ ki([Keywords Check])
end
subgraph Codelists
sd ~~~ si
end
subgraph M[Methodologies Search]
tiM([Title Check]) ~~~ aiM([Abstract Check])
kl([Links check]) ~~~ kM([Keywords Check])
end
m[(Metadata Record)] --> ic
m --> M
ic-- + ---M
sd[(Soil Degradation Codelist)] --> ic
si[(Soil Indicator Codelist)] --> ic
em[(EUSO Soil Methodologies list)] --> M
M --> et{{EUSO High-Value Dataset Tag}}
et --> m
ic --> es{{Soil Degradation Indicator Tag}}
es --> m
th[(Thesauri)]-- synonyms ---Codelists
Database Design
Integrations & Interfaces
- Spatial scope analyser
- Catalogue
- Knowledge graph
Key Architectural Decisions
Risks & Limitations
Design Document: Keyword Matcher
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: Spatial Locator
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: Spatial Scope Analyser
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: Translation
Introduction
Some imported records are in a non-english language. In order to imrpove their discoverability the component uses EU translation service to translate the title and abstract to english. The keywords are not translated, because they are translated through the keyword matcher.
Component Overview and Scope
Users
References
Requirements
Functional Requirements
- Identify if the record benefits from translation
- Identify the source language
- Request the translation
- Receive the translation (asynchronous)
- Update the catalogue with translated content
Non-functional Requirements
- Prevent that the service is mis-used as a proxy to the EU translate service
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Ended: MD Augmentation
Validation ↵
Abstract test Suite
The elements below are tested on being available in the suggested cardinality and type.
Element DC | Element ISO | Cardinality | Type | Codelist | Comment |
---|---|---|---|---|---|
identifier | fileidentifier | 1-n | string | - | |
title | title | 1-n | string | - | |
language | language | 0-n | string | - | 2/3/5-letter iso? |
description | abstract | 0-n | string | ||
date | date | 0-n | date | ||
distribution | distributioninfo | 0-n | str or uri | ||
contributor | contact#? | 0-n | str or uri | ||
creator | contact#author | 0-n | str or uri | ||
publisher | contact#distributor | 0-n | str or uri | ||
coverage-temporal | extent#temporal | 0-n | date-period | ||
coverage-spatial | extent#spatial | 0-n | str, uri or bbox | ||
rights | constraints | 0-1 | str or uri | ||
license | constraints | 0-1 | str or uri | ||
subject | keyword/topiccategory | 0-n | str or uri | ||
type | hierarchylevel | 1-1 | str or uri | ||
format | format | 0-1 | str or uri |
Design Document: Metadata validator
Introduction
Component Overview and Scope
SoilWise Repository aims for the approach to harvest and register as much as possible. Catalogues which capture metadata authored by data custodians typically have a wide range of metadata completeness and accuracy. Therefore, the SoilWise Repository employs metadata validation mechanisms to provide additional information about metadata completeness, conformance and integrity. Information resulting from the validation process are stored together with each metadata record in a relational database and updated after registering a new metadata version.
It is important to understand that this activity runs on unprocessed records at entry to the system. Which means, since there is a n:1 relation between ingested records and deduplicated, enhanced records in the Soilwise repository, a validation result should always be evaluated by its combination of record identifier and source and moment of harvesting. Validation results can best be visualised on record level, by showing a historic list of sources which contributed to the record.
Users
- authorised data users
- see validation results (log) per each metadata record
- unauthorised data users
- administrators
- manually run validation
- see summary validation results
- monitor validation process
References
Requirements
Functional Requirements
- regular automated validation, at least once per six months after the first project year
- validation result is stored in database related to the harvested record
- metadata of datasets, documents,journal articles, ...
- metadata in ISO19139:2007, Dublin Core, ...
- information about metadata completeness (are elements populated)
- information about metadata conformance (schema, profile, ...)?
- information about metadata integrity (are elements consistent with each other)
- Link liveliness assessment
Nice to have
- Provide means for validation of data
- Data structure
- Content integrity
- Data duplication
- Capture results of manual tests (in case not all test case can be automated)
Non-functional Requirements
- follow ATS ETS approach and implement AI / ML enrichment
- reach TRL 7
- adopt the ISO 19157 data quality measures
- JRC does not want to discourage data providers from publishing metadata by visualizing, that they are not conformant with a set of rules
Architecture Abstract test suites
Technological Stack
Mardown file on Github
Architecture Executable test suites
Technological Stack
Harvested records are stored in 'harvest.items' table (postgres db), they are identified by their hash
. Multiple editions of the same record are available, which each can have a different validation result.
A scheduled process (ci-cd, argo-workflows) triggers a task to understand which items have not been validated before and validates them.
results are stored on the
Overview of Key Features
- Link Liveliness Assessment: to validate if the reference link is currently working, or deprecated. This is a separate component, which returns a http status:
200 OK
,401 Non Authorized
,404 Not Found
,500 Server Error
and timestamp.
Component Diagrams
Sequence Diagram
Database Design
Two tables, related via record hash, because the result is stored per indicator. But main table has an overall summary.
Validation-results
record-hash (str) | result-summary (int) | date (date) |
---|---|---|
UaE4GeF | 64 | 2025-01-12T11-06-34Z |
Validation-by-indicator
record-hash (str) | indicator (str) | result (str) |
---|---|---|
UaE4GeF | completeness | 77 |
Integrations & Interfaces
- validation components run as a schedules process, it uses
harvest.items
table as a source - Link Liveliness Assessment runs as a scheduled process, it uses
public.records
as a source - Harvester prepares the metadata in harvest.items
as is
, to have a clean validation result at data entry. - Catalogue may provide an interface to the validation results, but it requires a authentication
- Storage
Key Architectural Decisions
- for the first iteration, Hale Studio was selected, with restricted access to the validation results
- Shacl validation was discussed to be implemented for next iterations (for GeoDCAT-ap and Dublin core)
- minimal SoilWise profile was discussed to indicate compliance with SoilWise functionality
- EUSO Metadata profile was discussed
- two-step validation of metadata was discussed, at first using harvested metadata, and next using SoilWise-augmented metadata
Risks & Limitations
- Hale Studio currently does not support Dublin Core
- Users may expect a single validation result and not a historic list of validated sources
- Records from some sources require processing before they can be tested as iso19139 or Dublin core, there is a risk that metadata errors are introduced in pre-processing of the records, in that case the test validates more the software then the metadata itself