Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
nonGeographicDataset
https://github.com/BritishGeologicalSurvey/princeton-nlp-relation-extraction
name: BGS github repository
function: download
https://webapps.bgs.ac.uk/services/ngdc/accessions/index.html#item186633
name: Data
function: download
https://doi.org/10.5285/afba2d1d-8a5d-4b96-a6fa-c13b5d8d32cd
name: Digital Object Identifier (DOI)
function: information
http://data.bgs.ac.uk/id/dataHolding/13608217
eng
geoscientificInformation
publication
2008-06-01
NGDC Deposited Data
Physical properties
Mathematical programming
data.gov.uk (non-INSPIRE)
Citable Data
Stratigraphic unit
revision
2022
NERC_DDC
2023-11-01
2024-02-15
creation
2024-02-15
notPlanned
Data was sourced from 3 corpus already available under open licences. BGS memoirs/technical reports sentences were sourced from previous labelled data at https://github.com/BritishGeologicalSurvey/geo-ner-model/blob/main/bgs.3class.geo-all-data.txt. This was converted to a new data format and additional labels were manually added to a subset of the data using doccano open source text annotation software. A small sample of sentences were taken from selected DECC/OGA onshore hydrocarbons well reports http://data.bgs.ac.uk/id/dataHolding/13607542 and from selected Mineral Reconnaisance Programme reports http://data.bgs.ac.uk/id/dataHolding/13605457 These reports were processed by 1. converting to machine readable text, 2. splitting into pages and sentences/paragraphs, 3. converting to doccano import JSONlines format, 5. manually annotating to label a range of geological concepts, 6. manually labelling to add relation labels to indicate how those concepts relate to each other, 7. exporting in doccano JSONL(Relations) format and also converting to format required by https://github.com/princeton-nlp/PURE
publication
2011
false
See the referenced specification
publication
2010-12-08
false
See http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:323:0011:0102:EN:PDF
jsonlines
doccano JSONL(Relation)
The copyright of materials derived from the British Geological Survey's work is vested in the Natural Environment Research Council [NERC]. No part of this work may be reproduced or transmitted in any form or by any means, or stored in a retrieval system of any nature, without the prior permission of the copyright holder, via the BGS Intellectual Property Rights Manager. Use by customers of information provided by the BGS, is at the customer's own risk. In view of the disparate sources of information at BGS's disposal, including such material donated to BGS, that BGS accepts in good faith as being accurate, the Natural Environment Research Council (NERC) gives no warranty, expressed or implied, as to the quality or accuracy of the information supplied, or to the information's suitability for any use. NERC/BGS accepts no liability whatever in respect of loss, damage, injury or other occurence however caused.
British Geological Survey
Environmental Science Centre, Nicker Hill, Keyworth
NOTTINGHAM
NG12 5GG
United Kingdom
0115 936 3143
0115 936 3276
distributor
British Geological Survey
Environmental Science Centre, Nicker Hill, Keyworth
NOTTINGHAM
NG12 5GG
United Kingdom
0115 936 3143
0115 936 3276
originator
British Geological Survey
Environmental Science Centre, Nicker Hill, Keyworth
NOTTINGHAM
NG12 5GG
United Kingdom
0115 936 3143
0115 936 3276
pointOfContact
British Geological Survey
Environmental Science Centre, Nicker Hill, Keyworth
NOTTINGHAM
NG12 5GG
United Kingdom
0115 936 3143
0115 936 3276
principalInvestigator
British Geological Survey
Environmental Science Centre,Keyworth
NOTTINGHAM
NG12 5GG
United Kingdom
+44 115 936 3100
pointOfContact
2025-03-10