Identification

Title

Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model

Abstract

This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604

Resource type

nonGeographicDataset

Resource locator

https://github.com/BritishGeologicalSurvey/princeton-nlp-relation-extraction

name: BGS github repository

function: download

https://webapps.bgs.ac.uk/services/ngdc/accessions/index.html#item186633

name: Data

function: download

https://doi.org/10.5285/afba2d1d-8a5d-4b96-a6fa-c13b5d8d32cd

name: Digital Object Identifier (DOI)

function: information

Unique resource identifier

code

http://data.bgs.ac.uk/id/dataHolding/13608217

codeSpace

Dataset language

eng

Spatial reference system

code identifying the spatial reference system

Classification of spatial data and services

Topic category

geoscientificInformation

Keywords

Keyword set

keyword value

originating controlled vocabulary

title

GEMET - INSPIRE themes, version 1.0

reference date

date type

publication

effective date

2008-06-01

Keyword set

keyword value

NGDC Deposited Data

Physical properties

Mathematical programming

data.gov.uk (non-INSPIRE)

Citable Data

Stratigraphic unit

originating controlled vocabulary

title

BGS Thesaurus of Geosciences

reference date

date type

revision

effective date

2022

Keyword set

keyword value

Keyword set

keyword value

NERC_DDC

Geographic location

West bounding longitude

East bounding longitude

North bounding latitude

South bounding latitude

Temporal reference

Temporal extent

Begin position

2023-11-01

End position

2024-02-15

Dataset reference date

date type

creation

effective date

2024-02-15

Frequency of update

notPlanned

Quality and validity

Lineage

Data was sourced from 3 corpus already available under open licences. BGS memoirs/technical reports sentences were sourced from previous labelled data at https://github.com/BritishGeologicalSurvey/geo-ner-model/blob/main/bgs.3class.geo-all-data.txt. This was converted to a new data format and additional labels were manually added to a subset of the data using doccano open source text annotation software. A small sample of sentences were taken from selected DECC/OGA onshore hydrocarbons well reports http://data.bgs.ac.uk/id/dataHolding/13607542 and from selected Mineral Reconnaisance Programme reports http://data.bgs.ac.uk/id/dataHolding/13605457 These reports were processed by 1. converting to machine readable text, 2. splitting into pages and sentences/paragraphs, 3. converting to doccano import JSONlines format, 5. manually annotating to label a range of geological concepts, 6. manually labelling to add relation labels to indicate how those concepts relate to each other, 7. exporting in doccano JSONL(Relations) format and also converting to format required by https://github.com/princeton-nlp/PURE

Conformity

Conformity report

specification

title

INSPIRE Implementing rules laying down technical arrangements for the interoperability and harmonisation of Geology

reference date

date type

publication

effective date

2011

degree

false

explanation

See the referenced specification

Conformity report

specification

title

Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing Directive 2007/2/EC of the European Parliament and of the Council as regards interoperability of spatial data sets and services

reference date

date type

publication

effective date

2010-12-08

degree

false

explanation

See http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:323:0011:0102:EN:PDF

Data format

name of format

jsonlines

version of format

doccano JSONL(Relation)

Constraints related to access and use

Constraint set

Limitations on public access

Constraint set

Limitations on public access

The copyright of materials derived from the British Geological Survey's work is vested in the Natural Environment Research Council [NERC]. No part of this work may be reproduced or transmitted in any form or by any means, or stored in a retrieval system of any nature, without the prior permission of the copyright holder, via the BGS Intellectual Property Rights Manager. Use by customers of information provided by the BGS, is at the customer's own risk. In view of the disparate sources of information at BGS's disposal, including such material donated to BGS, that BGS accepts in good faith as being accurate, the Natural Environment Research Council (NERC) gives no warranty, expressed or implied, as to the quality or accuracy of the information supplied, or to the information's suitability for any use. NERC/BGS accepts no liability whatever in respect of loss, damage, injury or other occurence however caused.

Responsible organisations

Responsible party

organisation name

British Geological Survey

full postal address

Environmental Science Centre, Nicker Hill, Keyworth

NOTTINGHAM

NG12 5GG

United Kingdom

telephone number

0115 936 3143

facsimile number

0115 936 3276

email address

enquiries@bgs.ac.uk

responsible party role

distributor

Responsible party

organisation name

British Geological Survey

full postal address

Environmental Science Centre, Nicker Hill, Keyworth

NOTTINGHAM

NG12 5GG

United Kingdom

telephone number

0115 936 3143

facsimile number

0115 936 3276

email address

enquiries@bgs.ac.uk

responsible party role

originator

Responsible party

organisation name

British Geological Survey

full postal address

Environmental Science Centre, Nicker Hill, Keyworth

NOTTINGHAM

NG12 5GG

United Kingdom

telephone number

0115 936 3143

facsimile number

0115 936 3276

email address

enquiries@bgs.ac.uk

responsible party role

pointOfContact

Responsible party

organisation name

British Geological Survey

full postal address

Environmental Science Centre, Nicker Hill, Keyworth

NOTTINGHAM

NG12 5GG

United Kingdom

telephone number

0115 936 3143

facsimile number

0115 936 3276

email address

enquiries@bgs.ac.uk

responsible party role

principalInvestigator

Metadata on metadata

Metadata point of contact

organisation name

British Geological Survey

full postal address

Environmental Science Centre,Keyworth

NOTTINGHAM

NG12 5GG

United Kingdom

telephone number

+44 115 936 3100

email address

enquiries@bgs.ac.uk

responsible party role

pointOfContact

Metadata date

2025-03-10

Metadata language

eng