Table of Contents
Home
Welcome to the SoilWise Knowledge Base!
- Guidance on FAIR data&knowledge publication for the Soil Community is available at soilwise-fair-strategy
- Developer documentation
- User documentation about using the hub
- Administrator documentation
- Frequently asked questions
SoilWise repository - Administrator guidance
Quick start
SWR, by design, does not facilitate direct metadata record modification, removal or addition.
Metadata can be included (and excluded) by adding and refining a system of harvesters.
Harvesters are micro-tasks which run at intervals and synchronise the catalogue with remote sources.
Each harvester task is configured by its type, the endpoint to harvest, the interval and a filter to subset the query.
As an administrator you maintain the harvesters, with the goal to optimize the content of the catalogue.
Administration is managed though an instance of Gitlab, running at Wageningen University.
3 distinct activities can be identified
- Schedule and monitor harvester tasks - Task monitoring and scheduling is managed through the gitlab admin interface of the harvester project
- Create and adjust harvester tasks - Configuration for a harvester task is defined within the CI folder of the harvester repository.
- Create and adjust harvester types - Various catalogue interfaces exist, which each require dedicated code for interaction. The required code is maintained in the harvester repository.
Schedule and monitor harvester tasks
Harvester tasks run as scheduled pipelines in a gitlab environment.
From the Gitlab webbased user interface you can evaluate previous runs of the harvester tasks, adjust and schedule new tasks.
Make sure you are a member with at least the role developer in the soilwise-he/harvester gitlab project.
The gitlab project is a clone of the public github harvester project.
Create and adjust harvester tasks
Configuration for a harvester task is defined within the CI folder of the harvester repository. Each file in that folder represents a harvesting task. Notice that each file is referenced from the main gitlab-ci.yaml file. Each file typically has 2 entries, one for the develop and one for the production environment. Dedicated harvesters have dedicated parameters, but all include the configuration of the database, into which the harvested records are persisted. Some parameters reference Gitlab variables, which are modified separately through the gitlab admin interface
# example of use of a gitlab variable
- export POSTGRES_HOST=$POSTGRES_HOST_TEST
Most harvesters have a parameter to indicate the endpoint and a filter. The filter typically has a {key:value} structure. Notice the quotes around the filter.
# example of use of an endpoint and filter configuration
- export HARVEST_URL=https://inspire-geoportal.ec.europa.eu/srv/eng/csw
- export HARVEST_FILTER='[{"th_httpinspireeceuropaeutheme-theme.link":"http://inspire.ec.europa.eu/theme/so"}]'
Create and adjust harvester types
Various catalogue interfaces exist, which each require dedicated code for interaction. The required code is maintained in the harvester repository.
A generic harvester type CSW is available for catalogues such as INSPIRE, ISRIC, FAO, EEA, Bonares.
Other repositories use dedicated api's, for which dedicated harvesters are provided, such as Prepsoil, Impact4soil, ESDAC, OpenAire.
Code changes are needed to adjust the operation of a specific harvester.
SoilWise Catalogue — User Guide
1. Quick overview
SoilWise is a FAIR-focused data & knowledge repository that helps you discover datasets and knowledge resources about soil and soil health across Europe. This guide explains how to find relevant resources using the catalogue search page and how to read and access a resource from its detail page.
2. Quick start
- Open the Search page.
- Type a keyword in the search box and press Enter (or click Search).
- Narrow results with filters on the left.
- Select a result title to open the Detail page where you can view metadata.
- Follow a link to access the resource.
3. Search page — interface elements & how to use them
3.1 Search box (keywords)
- Purpose: free-text search across titles, descriptions, keywords and metadata.
-
Tips:
-
Use short phrases like
soil organic carbon,soil health indicators,Europe. - Quotes for exact phrases:
"soil organic carbon". - Boolean-style hints (if supported):
AND,OR,-to exclude terms (if not supported, present examples only).
Microcopy suggestion:
Search by keyword, title, author, or description. Try: "soil organic carbon", or use multiple keywords separated by spaces.
3.2 Filters (left-hand panel)
Common filter types you may see:
- Resource type (dataset, document, model, policy brief)
- Subject / Topic (soil organic matter, erosion, biodiversity...)
- Geographic scope (country, region)
- Temporal coverage / Date (year collected or published)
- License / Access (open / restricted / embargoed)
- Provider / Source (institution or repository)
How to use:
- Click a filter category to expand it. Check the boxes matching your interest to narrow results.
- Filters combine as AND across categories and as OR within a category (standard behaviour). Example: checking
DatasetANDPolicy briefin Resource type (if multiple allowed) will return both types.
Microcopy suggestion on top of filters panel:
Refine your results — select one or more options. Filters stack: selections in different groups narrow the list.
3.3 Explode filters (show related/expanded items)
What it does: “Explode” expands a selected filter to include narrower or related terms. Use it when you want broader coverage that includes child concepts (e.g., exploding “soil health” to include individual indicators like porosity or organic matter).
How to present/explain:
- Next to hierarchical filters show a small toggle/icon:
▸ Expandor+ Explode. - When user toggles it, include a popup explanation:
Including narrower terms (e.g., 'soil health' → 'soil organic carbon', 'microbial biomass', ...).
Example microcopy for the toggle:
Explode: include narrower/related terms
Behavior notes (recommended implementation):
- Default: off (keeps search precise).
- When on: filter expands to include child concepts; results will increase. Show a small badge
explodednear the active filter list and include an undo/clear option.
3.4 Sorting
Common sort options:
- Relevance (default when you search)
- Newest first (publication or data collection date)
- Alphabetical (title or provider)
Microcopy suggestion:
Sort by: Relevance | Date | Title (A–Z)
3.5 Pagination & results per page
- Provide clear pagination controls at top and bottom of results:
Prev/ page numbers /Next. - Allow user to change results per page (e.g., 10 / 25 / 50 / 100). Show current range (
Showing 1–25 of 342 results). - Keep filters and sort selections while switching pages.
- For long result lists, consider offering an Export option (CSV / JSON) for result metadata.
Microcopy for pagination:
Showing 1–25 of 342 results. Change per page: [10 ▾]
3.6 Result items — what each list item should show
Each result listing (compact) should include:
- Title (clickable link to Detail page)
- Resource type badge (e.g., Dataset, Report)
- Short description (1–2 lines)
- Geographic coverage or tags (if present)
- Date (publication or dataset date)
- Provider / source name
- Access indicator (Open / Restricted) and an icon/link to the resource if fast-access is available
- Optional: star or save-to-favorites, and a small
ifor quick view modal
Example compact item:
Title
Dataset · Soil organic carbon (France) · 2021 · INRAE
Short sentence describing it… [Open] [Save] [Share]
4. Detail page — reading resource metadata & accessing the resource
4.1 Header area
- Full title
- Resource type and provider
- Badges: Access (Open/Restricted), License (CC BY, ODbL…), Format (CSV, GeoTIFF, PDF)
4.2 Key metadata (display prominently)
- Abstract / Description (full)
- Authors / Contributors
- Date of publication or data collection
- Geographic coverage (map thumbnail if coordinates exist)
- Keywords / Subjects
- Spatial and temporal extent
- Licensing and access conditions (open / request / restricted) — include exact license text or short summary and a
View full licenselink - DOI / Persistent identifier (if present)
- Contact / provider link
- Related resources (links to datasets, publications, or tools that reference the resource)
4.3 Access link / Download
- Prominently present a clear CTA button:
Access resourceorDownload data(if file) orRequest access(when restricted). - If external:
Opens in new tabmicrocopy. - If requires login or access request: show the steps and contact details.
Microcopy for CTA:
Access resource — external site (opens in new tab)
4.4 Additional features
- Citable record: provide citation text in common citation styles (APA, BibTeX) and a
Copy citationbutton. - Provenance / lineage: short timeline of the dataset’s creation/updates.
- Version history (if multiple versions exist).
- Usage metrics: downloads/views (if available).
- Feedback / report issue button to flag broken links or incorrect metadata.
5. Example workflows
A. Find open datasets about soil organic carbon in Spain
- Search:
soil organic carbon - Filter:
Geographic scope → Spain - Filter:
Access → Open - Sort:
Newest first - Click a dataset title → press
Download data
B. Broaden a topic using explode
- Search:
soil health - Under
TopictoggleExplodeto include narrower indicators (organic matter, microbial biomass) - Results will increase; refine by date or provider.
6. Troubleshooting & FAQ (short)
Q: I clicked “Access resource” and it fails.
A: Check if the link opened in a new tab (pop-ups blocked?), verify your network. If the resource is restricted, follow the Request access instructions on the detail page. Use Report issue to notify us.
Q: My search returns no results. A: Try fewer filters, use broader terms, or enable Explode for hierarchical topics. Remove date constraints.
Q: How current is the data?
A: Check the Date and Version fields on the resource detail page; contact the provider listed in the metadata for confirmation.
LLA
Design Document: Link Liveliness Assessment
Introduction
Component Overview and Scope
The linkchecker component is designed to evaluate the validity and availability of links within metadata records advertised via a OGC API - Records API.
A link in metadata record either points to:
- another metadata record
- a downloadable instance (pdf/zip/sqlite/mp4/pptx) of the resource
- the resource itself
- documentation about the resource
- identifier of the resource (DOI)
- a webservice or API (sparql, openapi, graphql, ogc-api)
Linkchecker evaluates for a set of metadata records, if:
- the links to external sources are valid
- the links within the repository are valid
- link metadata represents accurately the resource (mime type, size, data model, access constraints)
If endpoint is API, some sanity checks can be performed on the API:
- Identify if the API adopted any API-standard
- If an API standard is adopted, does the API support basic operations of that API
- Does the metadata correctly mention the standard
The component returns a http status: 200 OK, 401 Non Authorized, 404 Not Found, 500 Server Error and timestamp.
The component runs an evaluation for a single resource at request, or runs tests at intervals providing a history of availability.
The results of the evaluation can be extracted via an API. The API is based on the fastapi framework and can be deployed using a docker container.
Evaluation process runs as a scheduled CI-CD pipeline in Gitlab. It uses 5 threads to increase performance.
Users
- all data users (authorised + unauthorised)
- can see the results of link evaluation in the Catalogue (if XX is integrated) or access the API directly to retrieve reports
- administrators
- can configure and manually start the evaluation process
- can see the history of link evaluations
References
Requirements
Functional Requirements
Users want to understand the availability of a resource before they click a link. It helps them to understand if a click will be (un)successfull. The availability can be indicated as not available (404), sometimes available (in %), authorisation required, deprecated (too many times not available), 'unknown' (due to timeouts or other failures).
Non-functional Requirements
- Should respect rules in robots.txt
- Should not introduce vulnerabilities
Architecture
Technological Stack
-
Core Language:
- Python + core library urlliib: Used for the linkchecker, API, and database interactions.
-
Database:
- PostgreSQL: Utilized for storing and managing information.
-
Backend Framework:
- FastAPI: Employed to create and expose REST API endpoints, utilizing its efficiency and auto-generated components like Swagger.
-
Frontend:
- See Integrations & Interfaces
-
Containerization:
- Docker: Used to containerize the linkchecker application, ensuring deployment and execution across different environments.
Overview of Key Features
- Link validation: Returns HTTP status codes for each link, along with other important information such as the parent URL, any warnings, and the date and time of the test. Additionally, the tool enhances link analysis by identifying various metadata attributes, including file format type (e.g., image/jpeg, application/pdf, text/html), file size (in bytes), and last modification date. This provides users with valuable insights about the resource before accessing it.
- Broken link categorization: Identifies and categorizes broken links based on status codes, including Redirection Errors, Client Errors, and Server Errors.
- Deprecated links identification: Flags links as deprecated if they have failed for X consecutive tests, in our case X equals to 10. Deprecated links are excluded from future tests to optimize performance.
- Timeout management: Allows the identification of URLs that exceed a timeout threshold which can be set manually as a parameter in linkchecker's properties.
- Availability monitoring: When run periodically, the tool builds a history of availability for each URL, enabling users to view the status of links over time.
- OWS services (WMS, WFS, WCS, CSW) typically return a HTTP 500 error when called without the necessary parameters. A handling for these services has been applied in order to detect and include the necessary parameters before being checked.
Component Diagrams
flowchart LR
H["Harvester"]-- "writes" -->MR[("Record Table")]
MR-- "reads" -->LAA["Link Liveliness Assessment"]
MR-- "reads" -->CA["Catalogue"]
LAA-- "writes" -->LLAL[("Links Table")]
LAA-- "writes" -->LLAVH[("Validation History Table")]
CA-- "reads" -->API["**API**"]
LLAL-- "writes" -->API
LLAVH-- "writes" -->API
Sequence Diagram
sequenceDiagram
participant Linkchecker
participant DB
participant Catalogue
Linkchecker->>DB: Establish Database Connection
Linkchecker->>Catalogue: Extract Relevant URLs
loop URL Processing
Linkchecker->DB: Check URL Existence
Linkchecker->DB: Check Deprecation Status
alt URL Not Deprecated
Linkchecker-->DB: Insert/Update Records
Linkchecker-->DB: Insert/Update Links with file format type, size, last_modified
Linkchecker-->DB: Update Validation History
else URL Deprecated
Linkchecker-->DB: Skip Processing
end
end
Linkchecker->>DB: Close Database Connection
Database Design
classDiagram
Links <|-- Validation_history
Links <|-- Records
Links : +Int ID
Links : +Int fk_records
Links : +String Urlname
Links : +String deprecated
Links : +String link_type
Links : +Int link_size
Links : +DateTime last_modified
Links : +String Consecutive_failures
class Records{
+Int ID
+String Records
}
class Validation_history{
+Int ID
+Int fk_link
+String Statuscode
+String isRedirect
+String Errormessage
+Date Timestamp
}
Integrations & Interfaces
- Visualisation of evaluation in Metadata Catalogue, the assessment report is retrieved using ajax from the each record page
- FastAPI now incorporates additional metadata for links, including file format type, size, and last modified date.
Key Architectural Decisions
Initially we started with linkchecker library, but performance was really slow, because it tested the same links for each page again and again.
We decided to only test the links section of ogc-api:records, it means that links within for example metadata abstract are no longer tested.
OGC OWS services are a substantial portion of links, these services return error 500, if called without parameters. For this scenario we created a dedicated script.
If tests for a resource fail a number of times, the resource is no longer tested, and the resource tagged as deprecated.
Links via a facade, such as DOI, are followed to the page they are referring to. It means the lla tool can understand the relation between DOI and the page it refers to.
For each link it is known on which record(s) it is mentioned, so if a broken link occurs, we can find a contact to notify in the record.
For the second release we have enhanced the link liveliness assement tool to collect more information about the resources:
- File type format (media type) to help users understand what format they'll be accessing (e.g., image/jpeg, application/pdf, text/html)
- File size to inform users about download expectations
- Last modification date to indicate how recent the resource is
API Updates
The API has been extended to include the newly tracked metadata fields: - link_type: Shows the file format type of the resource (e.g., image/jpeg, application/pdf) - link_size: Indicates the size of the resource in bytes - last_modified: Provides the timestamp when the resource was modified
Next Steps
We plan to enhance the link liveliness assesment tool to include geospatial attributes such as field details, and spatial information from various data formats (e.g., GeoTIFF, Shapefile, CSV). This will be accomplished using GDAL and OWSLib to enable efficient retrieval without the need to download the full files.
Risks & Limitations
TBD
MD Augmentation
Intro
The metadata augmentation repository contains a number of modules each performing specific tasks on the metadata. For each module a dedicated design document is provided.
Design Document: NER Augmentation
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: Deduplication
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: EUOS-high-value dataset tagging
Introduction
Component Overview and Scope
The EUSO high-value datasets are those with substantial potential to assess soil health status, as detailed on the EUSO dashboard. This framework includes the concept of soil degradation indicator metadata-based identification and tagging. Each dataset (possibly only those with the supra-national spatial scope - under discussion) will be annotated with a potential soil degradation indicator for which it might be utilised. Users can then filter these datasets according to their specific needs.
The EUSO soil degradation indicators employ specific methodologies and thresholds to determine soil health status
Users
- authorised data users
- unauthorised data users
- administrators
References
Requirements
Functional Requirements
- For critical / high-value resources, SoilWise will either link with them or store them
Non-functional Requirements
- Labelling data results as high-value data sets will rely on EUSO/ESDAC criteria, which are not identical to the High Value Data directive criteria.
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
flowchart TD
subgraph ic[Indicators Search]
ti([Title Check]) ~~~ ai([Abstract Check])
ai ~~~ ki([Keywords Check])
end
subgraph Codelists
sd ~~~ si
end
subgraph M[Methodologies Search]
tiM([Title Check]) ~~~ aiM([Abstract Check])
kl([Links check]) ~~~ kM([Keywords Check])
end
m[(Metadata Record)] --> ic
m --> M
ic-- + ---M
sd[(Soil Degradation Codelist)] --> ic
si[(Soil Indicator Codelist)] --> ic
em[(EUSO Soil Methodologies list)] --> M
M --> et{{EUSO High-Value Dataset Tag}}
et --> m
ic --> es{{Soil Degradation Indicator Tag}}
es --> m
th[(Thesauri)]-- synonyms ---Codelists
Database Design
Integrations & Interfaces
- Spatial scope analyser
- Catalogue
- Knowledge graph
Key Architectural Decisions
Risks & Limitations
Design Document: Keyword Matcher
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: Spatial Locator
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: Spatial Scope Analyser
Introduction
Component Overview and Scope
Users
References
Requirements
Functional Requirements
Non-functional Requirements
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Design Document: Translation
Introduction
Some imported records are in a non-english language. In order to imrpove their discoverability the component uses EU translation service to translate the title and abstract to english. The keywords are not translated, because they are translated through the keyword matcher.
Component Overview and Scope
Users
References
Requirements
Functional Requirements
- Identify if the record benefits from translation
- Identify the source language
- Request the translation
- Receive the translation (asynchronous)
- Update the catalogue with translated content
Non-functional Requirements
- Prevent that the service is mis-used as a proxy to the EU translate service
Architecture
Technological Stack
Overview of Key Features
Component Diagrams
Sequence Diagram
Database Design
Integrations & Interfaces
Key Architectural Decisions
Risks & Limitations
Validation
Abstract test Suite
The elements below are tested on being available in the suggested cardinality and type.
| Element DC | Element ISO | Cardinality | Type | Codelist | Comment |
|---|---|---|---|---|---|
| identifier | fileidentifier | 1-n | string | - | |
| title | title | 1-n | string | - | |
| language | language | 0-n | string | - | 2/3/5-letter iso? |
| description | abstract | 0-n | string | ||
| date | date | 0-n | date | ||
| distribution | distributioninfo | 0-n | str or uri | ||
| contributor | contact#? | 0-n | str or uri | ||
| creator | contact#author | 0-n | str or uri | ||
| publisher | contact#distributor | 0-n | str or uri | ||
| coverage-temporal | extent#temporal | 0-n | date-period | ||
| coverage-spatial | extent#spatial | 0-n | str, uri or bbox | ||
| rights | constraints | 0-1 | str or uri | ||
| license | constraints | 0-1 | str or uri | ||
| subject | keyword/topiccategory | 0-n | str or uri | ||
| type | hierarchylevel | 1-1 | str or uri | ||
| format | format | 0-1 | str or uri |
Design Document: Metadata validator
Introduction
Component Overview and Scope
SoilWise Repository aims for the approach to harvest and register as much as possible. Catalogues which capture metadata authored by data custodians typically have a wide range of metadata completeness and accuracy. Therefore, the SoilWise Repository employs metadata validation mechanisms to provide additional information about metadata completeness, conformance and integrity. Information resulting from the validation process are stored together with each metadata record in a relational database and updated after registering a new metadata version.
It is important to understand that this activity runs on unprocessed records at entry to the system. Which means, since there is a n:1 relation between ingested records and deduplicated, enhanced records in the Soilwise repository, a validation result should always be evaluated by its combination of record identifier and source and moment of harvesting. Validation results can best be visualised on record level, by showing a historic list of sources which contributed to the record.
Users
- authorised data users
- see validation results (log) per each metadata record
- unauthorised data users
- administrators
- manually run validation
- see summary validation results
- monitor validation process
References
Requirements
Functional Requirements
- regular automated validation, at least once per six months after the first project year
- validation result is stored in database related to the harvested record
- metadata of datasets, documents,journal articles, ...
- metadata in ISO19139:2007, Dublin Core, ...
- information about metadata completeness (are elements populated)
- information about metadata conformance (schema, profile, ...)?
- information about metadata integrity (are elements consistent with each other)
- Link liveliness assessment
Nice to have
- Provide means for validation of data
- Data structure
- Content integrity
- Data duplication
- Capture results of manual tests (in case not all test case can be automated)
Non-functional Requirements
- follow ATS ETS approach and implement AI / ML enrichment
- reach TRL 7
- adopt the ISO 19157 data quality measures
- JRC does not want to discourage data providers from publishing metadata by visualizing, that they are not conformant with a set of rules
Architecture Abstract test suites
Technological Stack
Mardown file on Github
Architecture Executable test suites
Technological Stack
Harvested records are stored in 'harvest.items' table (postgres db), they are identified by their hash. Multiple editions of the same record are available, which each can have a different validation result.
A scheduled process (ci-cd, argo-workflows) triggers a task to understand which items have not been validated before and validates them.
results are stored on the
Overview of Key Features
- Link Liveliness Assessment: to validate if the reference link is currently working, or deprecated. This is a separate component, which returns a http status:
200 OK,401 Non Authorized,404 Not Found,500 Server Errorand timestamp.
Component Diagrams
Sequence Diagram
Database Design
Two tables, related via record hash, because the result is stored per indicator. But main table has an overall summary.
Validation-results
| record-hash (str) | result-summary (int) | date (date) |
|---|---|---|
| UaE4GeF | 64 | 2025-01-12T11-06-34Z |
Validation-by-indicator
| record-hash (str) | indicator (str) | result (str) |
|---|---|---|
| UaE4GeF | completeness | 77 |
Integrations & Interfaces
- validation components run as a schedules process, it uses
harvest.itemstable as a source - Link Liveliness Assessment runs as a scheduled process, it uses
public.recordsas a source - Harvester prepares the metadata in harvest.items
as is, to have a clean validation result at data entry. - Catalogue may provide an interface to the validation results, but it requires a authentication
- Storage
Key Architectural Decisions
- for the first iteration, Hale Studio was selected, with restricted access to the validation results
- Shacl validation was discussed to be implemented for next iterations (for GeoDCAT-ap and Dublin core)
- minimal SoilWise profile was discussed to indicate compliance with SoilWise functionality
- EUSO Metadata profile was discussed
- two-step validation of metadata was discussed, at first using harvested metadata, and next using SoilWise-augmented metadata
Risks & Limitations
- Hale Studio currently does not support Dublin Core
- Users may expect a single validation result and not a historic list of validated sources
- Records from some sources require processing before they can be tested as iso19139 or Dublin core, there is a risk that metadata errors are introduced in pre-processing of the records, in that case the test validates more the software then the metadata itself