Guest blog for the Open Knowledge Foundation
Anthony Beck was recently asked to write a guest blog piece for the Open Knowledge Foundation (OKF).
This was broken into three articles:
Part 1: DART and STAR Projects
Part 2: Ethics and open data
Part 3: A view of the future
I’ve collated the unabridged versions here. This also shows how easy it is to re-use the illustrations and other imagery that we host in Flickr.
Dig the new breed: how open approaches can empower archaeologists
Jonathan Gray (a blog on DART would be interesting) and Jo Walsh (a blog crystallising the ethics bit) both asked me to do a guest blog on the DART project and other general issues surrounding how open approaches can impact on archaeology. In order to rise to this challenge, and at least provide a starting point, I took the title from the presentation I gave at the Open Knowledge Conference 2010. I did this for two reasons: It’s a terrible play on words (dig is employed as a synonym for “excavation” and “To like”) and I like name-checking “The Jam”. Also, as this post has taken form, it’s changed from being a piece about Open Science and Ethics into something about how disruptive technologies can be implemented to transform how the heritage sector operates.
The thrust is about archaeology, archaeologists, data, process and open approaches. I’ll ramble about a number of things including the DART project, STAR project, ethics and my vision for the future of archaeological data.
STAR & STELLAR – Anyone for linked heritage data?
I recently attended a STAR project workshop and saw a glimpse of the future. The Semantic Technologies for Archaeological Resources (STAR) project investigated “the potential of semantic terminology tools for widening and improving access to heritage resources, exploring the possibilities of combining a high level, core ontology with domain thesauri and natural language processing techniques”. The project has looked at extracting structured knowledge from “grey literature” using Natural Language Processing (NLP) tools – all very worthy and interesting but not something I’m directly excited by as “grey literature” is essentially tertiary data (an extraction of synthetic data derived from the primary record). In addition they have developed an RDF based approach to query data stored in heterogeneous excavation databases. WOW!
And in case you missed that…. querying data stored in heterogeneous excavation databases. Essentially they have resolved syntactic (platform/format), schematic (structural) and semantic (language) heterogeneities by generating mappings of key fields (i.e. a sub-set of the source data) to the English Heritage extension of theCIDOC Conceptual Reference Model (CRM), extracting the data as RDF and providing semantic interoperability through Knowledge Organization Systems (KOS) represented in SKOS format from standard heritage thesauri. In essence, they extract RDF from relational databases using hand crafted mappings to both SKOS and ontology articulating semantics and canonical concepts respectively.
The combination of RDF, ontology and SKOS have allowed the team to produce a demonstrator capable of cross searching different excavation with “difficult queries”. The team demonstrated that they could address questions such as, show me contexts that satisfy the following criteria:
- Roman corn drying ovens with palaeobotanical analysis
- Charred plant remains and charcoal from 4 post structures
- Post holes that contain ritual deposits
Granted there are limitations: it currently supports a sub-set of the data collected during excavation and the RDF model is viewed as an interim tool with users going back to the source databases to conduct further analysis. However, the concept has been definitively demonstrated. Great stuff! The impact of this work is profound: the SKOS and ontology will allow inferencing/reasoning over the data which will transform the way the data can be re-used, analysed and generalised (more on this in a bit).
The Glamorgan team have a follow on project called Semantic Technologies Enhancing Links and Linked data for Archaeological Resources (STELLAR) funded by the AHRC. One of the aims of STELLAR is to develop “best practice guidelines and tools ….. both for mapping/extracting archaeological data as RDF and for generating archaeological Linked Data”. This will take the research developed in STAR and provide tools so that it can be deployed to mainstream archaeological data. I’m really looking forward to seeing the roll-out of this technology.
DART and Open Science
DART is an acronym for Detection of Archaeological Residues using remote sensing Techniques. DART is a three year Science and Heritage initiative (funded by the Arts and Humanities Research Council (AHRC) and the Engineering and Physical Sciences Research Council (EPSRC)) and led by the School of Computing at the University of Leeds. The project aims to improve the understanding of the physical, chemical, biological and environmental factors that determine whether an archaeological feature (pit, ditch, posthole etc.) can be detected by a sensor (camera, Ground Penetrating Radar, etc.). DART has attracted a consortium consisting of 25 key heritage and industry organisations and academic consultants and researchers from the areas of computer vision, geophysics, remote sensing, knowledge engineering and soil science.
Archaeological sites and features are created by localised formation and deformation processes. There are a range of imaging instruments that can be used to detect these archaeological residues, although, the knowledge required to determine what, when, how and why to use each different type of sensor is patchy. Seasonal, environmental and vegetation dynamics play a part, although the complexities of interaction and how they modify “contrast signatures” derived from the existing formation and deformation processes is uncertain.
This is important so I will provide an example: as a mud-brick built farmstead erodes, the silt, sand, clay, large clasts and organics in the mud-brick along with other anthropogenic debris are incorporated into the soil. This produces a localised variation in soil size and structure. This in turn impacts on drainage and localised crop stress and vigour. These localised variations can all provide measurable differences, or contrasts, that indicate the presence of archaeology. To focus further on drainage: in most years a localised drainage differential produces no discernable contrast (no soil colour or crop stress/vigour difference). However, in extreme wet/dry periods the different drainage characteristics result in different soil moisture retention properties which can produce localised variations in crop/stress vigour which can be observed as differences in crop height or crop colour (essentially crop marks). Archaeological contrasts can be expressed through, for example, variations in chemistry, magnetic field, resistance, topography, temperature and spectral reflectance.
Hence, archaeological features produce localised contrasts which can be detected using an appropriate sensor under appropriate conditions. However, little is known about how different archaeological residues contrast with their local environment, how these contrasts are expressed in the electromagnetic spectrum, or how environmental, and other localised factors such as soil or vegetation, impact on contrast magnitude (over space and time). This requires an understanding of both the nature of the residues and the landscape matrix within which they exist.
The programme of research has been designed specifically to identify physical, chemical and biological contrast factors that may allow the detection of archaeological residues (both directly and by proxy) under different land-use and environmental conditions. To determine contrast factors samples and measurements will be taken on and around different sub-surface archaeological features at different times of the day and year to ensure that a representative range of conditions is covered. Field measurements will include geophysical and hyperspectral surveys, thermal profiling, soil moisture and spectral reflectance. Laboratory analysis of samples will include geochemistry and particle size. Models will be developed that translate these physical values into spectral and magnetic measures in order to determine detection parameters. The key will be to understand the dynamic interaction between soils, vegetation and archaeological residues and how these affect detection with sensing devices. This requires understanding how the archaeology differs from, and dynamically interacts with, the localised soils and vegetation and how these differences can be detected. This will allow DART to address the following research issues:
- What are the factors that produce archaeological contrasts?
- How do these contrast processes vary over space and time?
- What processes cause these variations?
- How can we best detect these contrasts (sensors and conditions)?
DART is committed to open science principles and aims to act as an exemplar for how data, tools, and analysis can be made available to the wider academic, heritage and general community. Data, software, algorithms and services developed throughout the project will be made available for re-use with appropriate open licences. Licensing is an issue as license incompatibility can severely restrict re-use. Science Commons is establishing protocols in this area. Publically accessible dissemination is preferred, however, where necessary domain specific or institutional repositories will be utilised for long-term preservation. Cameron Neylon is part of the project consortium and provides steer on these issues.
The whole point of taking an open science position on this project is so that we can maximise the benefit and impact. The research problem is large and complex: one project will not solve it. Inevitably the science will need refining in areas with different geologies, land-use patterns, environmental dynamics and past archaeological exploitation repertoires. Adequate articulation will require long term data collection under different conditions, followed by iterative hypothesis testing and modelling. The challenge is to get this information in the quickest, cheapest and easiest ways. An Open Science approach means that DART is openly collaborating with researchers and individuals throughout the world. The body of work developed within DART can be easily re-used by others: our results can be tested as the data and algorithms will be in the public domain, which means that they can be rapidly evaluated and easily re-used. Unlocking the “body of knowledge” and “know how” surrounding a programme of research should significantly reduce the barriers to re-use. This may generate a critical mass of surrounding research, which can only improve the underlying models and science. Providing scientists with the methodology of how to make the wheel will not only stop us reinventing it, but will also improve the manufacturing process.
Ethics and archaeological data
Open Science provides the framework for producing transparent and reproducible science by providing open access to raw data, transformative algorithms and synthetic interpretations. STAR and STELLAR provide the foundation from which fine granularity excavation data can be made available as part of the semantic web and feed into Open Science analysis. This provides answers to the questions of how and why we should have open access to archaeological data. However, it does not provide answers to what data should be opened or if archaeological data should be opened at all. We move into the sphere of ethics and open archaeology.
Recently I have chatted to a number of people and organisations who want to open up heritage data. The conversations tend to have an ethical component. Like other disciplines, such as ecology, there are potential ethical issues in making heritage data open. The oft touted reason, in the UK at least, is that if access is given to this information then it will be exploited by “night hawkers” (irresponsible metal-detectorists) and other “treasure hunters” and sites (a term I don’t really like) will be destroyed. This argument is polarised and plays to the lowest common denominator: it is based on the premise that “accessible knowledge will inevitably be abused” and eschews any of the benefits that data sharing can provide. Nor does it consider the nuanced ethical arguments concerning re-appropriation of artifacts collected under imperialist regimes or the ethical conundrum surrounding research into aboriginal or other indigenous communities (which, now that I’ve raised them I wont comment on them further). The Portable Antiquities Scheme has done much to improve this argument.
The elephant in the room in this debate concerns those archaeologists who have sat on their archive for decades. We know of its significance but it is not available for academic and research analysis and does not inform the planning process. This has enormous impact on local planning policy, public and academic understanding, theory, practice etc. Since, the introduction of Planning Policy Guidance 16 (PPG16: essentially commercial archaeology) in the UK in November 1990 there has been less of this approach. PPG16 has recently been replaced by Planning Policy Statement 5 (PPS5).
I find the situation somewhat paradoxical. The UK curatorial systems expect that a generalised summary, or synthesis, of any investigation is deposited with the regional curatorial officers. This data is entered into the Sites and Monuments Record (SMR) and is publically accessible. Therefore, the public has access to a generalised dataset. The expectations for primary, or raw, data are different: it’s considered ethically appropriate to deposit fine granularity data (i.e. non-generalised, primary, data, such as those from excavation) with the Archaeology Data Service (ADS), however, there are issues raised if an individual wants to do this outside such formal structures (however, the Perry Oaks Project have released redacted versions of their site data). Is this an issue of ethics, or where formal and informal work practices collide, or is this simply an issue of cost: individuals and organisations have the will but not the finances. Finally, and possibly most likely, do archaeologists just feel uncomfortable making their fine grained data available to a mass audience without going through a representative authority such as the ADS? My feeling is that within the archaeology domain there is an informal belief that if data is deposited with a repository then the repository also takes the ethical responsibility if the data is released. Deposition so that data is available in perpetuity is part of business and academic best practice, however, deposition does not necessarily mean release and subsequent consumption by other parties (public or otherwise).
Whatever the answer the point remains: archaeologists, for right or wrong, consider the implications of placing fine grained data in the public domain and “Ethical considerations” have been identified as a “barrier” to deposition. However, there appears to be limited guidance as to how to resolve these issues. This means that many archaeologists are re-inventing the wheel. The challenge is to provide some supporting “thing” that makes it easy for individuals and organisations to get to a clear, and hopefully unambiguous, ethical position. Such a “thing” will reduce uncertainty thereby removing one of the barriers to data sharing. The current default position is the equivalent of doing nothing: surely this must change. Supporting “stuff” which is recognised and approved by national heritage organisations and standards bodies will act as important lubricant to help individuals and groups to release data through informal channels. It should be recognised that the relationship between the “citizen”, the archaeologists and heritage data will change: citizen science and citizen data, will play more of a role in heritage than ever before. Hence, a focus on the informal is important: we don’t want more grey data do we? The Portable Antiquities Scheme is the “poster boy” for archaeological approaches to citizen science – although they do have a range of different user access levels.
I raised this as a topic for the Archaeology working group at the Open Knowledge Foundation. Response so far has been positive and has spilled over to colleagues in the curatorial sector and beyond (the discussion thread can be found here). I will be setting up a meeting to start to discuss these issues later this year. Both the Archaeology Data Service and the University of Leeds have kindly offered a venue. Watch this space…..
Wrapping it all up – or, I have a “linked data” dream…
OK, to recap we have:
- A scientific movement that advocates open approaches to data, theory and practice
- Emerging foundational interoperability using semantic web technology
- The potential to remove a barrier and facilitate the submission of primary dat
These are three powerful movements that could prove to be highly disruptive. In combination they have the potential to turn archaeological data and data repositories from static siloed islands (containing data that is increasingly stale) into an interlinked network of data nodes that reflect changes dynamically.
The lynch-pin is the use of triplestores (RDF databases) that provide persistent identifiers. Persistent identifiers allow us to refer to a digital object (a statement, a file or set of files) in perpetuity, even if the underlying storage location moves. This means links between objects are persistent: therefore, when an observation or interpretation changes its effects are propagated through to all the data/events that link to it. I see organisations such as the ADS, Talis (an innovating semantic web technology provider which provide the Talis Platform which includes a free RDF hosting service for open data) and national heritage bodies providing such services.
Some open science projects are likely to adopt RDF as their de-facto data sharing format. RDF triples (subject, predicate, object) provide a schema transparent mechanism for data storage. They are not ideal for all data types (raster data structures for example) but when used with Ontology and SKOS, as demonstrated by STAR, they are powerful analytical, search and inference tools.
So, what is the importance of storing heritage data in RDF. Well it depends which point of view you take. From a data management perspective there is no longer any need to migrate data formats. However, to facilitate re-use different “views” of the RDF model can be generated and incorporated into traditional analytical software, such as GIS. Importantly, analysis stops being a “knowledge backwater”: new knowledge can be appended back into the triplestore.
From a data curation, re-use and analysis perspective the data quality has the potential to be dramatically improved. Deposition is no longer the final act of the excavation process: rather it is where the dataset can be integrated with other digital resources and analysed as part of the complex tapestry of heritage data. The data does not have to go stale: as the source data is re-interpreted and interpretation frameworks change these are dynamically linked through to the archives, hence, the data sets retain their integrity in light of changes in the surrounding and supporting knowledge system.
An example is probably useful at this juncture: In addition to many other things pottery provides essential dating evidence for archaeological contexts. However, pottery sequences are developed on a local basis by individuals with imperfect knowledge of the global situation. This means there is overlap, duplication and conflict between different pottery sequences which are periodically reconciled (your Type IIb sherd is the same as my Type IVd sherd and we can refine the dating range…… Hurrah… now let’s have another beer). This is the perennial processes of lumping and splitting inherent in any classification system. The semantic integration of localised sequences can potentially support more robust pottery frameworks. In addition as the pottery data is linked (and not decoupled and stale) then updated pottery classifications and dating implications immediately update the dating probability density function for a context or group. To make it even better one can reason over the data to find out which contexts, relationships and groups are impacted by a change in the dating sequences either by proxy or by logical inference (a change in the date of a context produces a logical inconsistency with a stratigraphically related group) While we’re on the topic of stratigraphy, an area of notorious tedium and poor quality data (often with conflicting relationships), RDF allows rapid logical consistency checking as stratigraphic relationships are basically a graph and RDF triples are a graph database. Publically deposited RDF data should be linked data: this means that all the primary data archives are linked to their supporting knowledge frameworks (such as a pottery sequence). When a knowledge framework changes the implications are propagated through to the related data dynamically. This means that policy, development control and research decisions are based upon data that reflects the most-up-to date information and knowledge….. cool huh.
Incorporating excavation data into RDF means that ontology and SKOS can be used to dynamically repurpose the data for policy formulation, planning impact, regional heritage control and mitigation purposes in conjunction with the SMR records. Raw data can be integrated from multiple different sources with different degrees of spatial and attribute granularity and, where appropriate, generalised so that the data is fit for the end users’ purpose. From a policy perspective curatorial officers’ no longer have to battle to stop datasets becoming stale and add new datasets to the local SMR. The SMR will remain an essential dataset: even though it is a generalised resource it is the only location of a digital record for resources that are unlikely to be digitised in the future (unless there is a very unlikely reverse in funding patterns). This will allow the curatorial officer to develop more effective regional research agendas based upon up-to-date and accurate data. This has the potential to change the way Historic Environment Information Resources (HEIRs) are managed by curatorial officers and transform how developers (property and software), policy makers and the general public engage with and consume any data. Instead curatorial officers can become the filters through which a dynamic, multi-scale heritage resource can be effectively exploited. They will be able to frame, execute and develop a dynamic regional research agenda supported by a suite of primary linked data resources. They will be able to support innovative access to this data to researchers, planners and most importantly the public. This is a significant and important change in role. In addition the heritage data can be mashed up with other data resources to produce tailor made resources for different end-user communities. This has been successfully employed by data.gov.uk.
Data re-use and mashups are also important for those undertaking research and analysis. The big difference will be for those who undertake research or collect data that transcends different traditional analytical scales. For example, the National Mapping Programme which aims to “enhance the understanding of past human settlement, by providing primary information and synthesis for all archaeological sites and landscapes visible on aerial photographs or other airborne remote sensed data” will provider deeper insights when it is integrated with other data. However, this integration can occur in real time and add tangible interpretative depth. If an interpreter is digitising data from an aerial photograph and they see two ditches cutting one another they are unlikely to be able to tell the relative stratigraphic sequence of the two features. Direct access to excavation or other data will allow the full relationships and their interpretative relevance to be deduced during data collection.
In the longer term consumers of archaeological data will be more used to dealing with primary data and will become more aware of its potential and demand more of the resource. This should produce a ground up re-appraisal of recording systems and a better understanding of archaeological hermeneutics. The interpretative interplay between theory, practice and data as part of a dynamic knowledge system is essential. Although this has been recognised, in reality theory, practice and data have never really been joined up. We don’t have to use a one size fits all approach to conducting excavations (in England this is PPS5, MORPHE and the MoLA (nee MoLAS) recording system where excavation is routinely decoupled from analysis) but we can tailor bespoke systems that address local, regional and national research challenges. We can generate interesting and provocative data that can be used to test theory and inform practice and move away from recording systems mired in the theoretical and intellectual paradigms of the mid 70’s. The virtuous circle is re-established; theory will influence practice, which will change the nature of the data, which will impact on interpretative frameworks, which will provide a body of knowledge against which theory can be tested
There is a new breed: there are people and organisations who don’t want to do what’s always been done. People who are empowered and don’t believe that established institutions and hierarchies are the gatekeepers of progress: organisations that can, and want to, change the way we ‘play the game’, people who want to collaborate. Organisations that want to share. Open approaches can help to make all this happen. This is all facilitated by disruptive technology which is increasingly mature, broadly available for free (or at a low cost) and with low barriers of use and re-use. In the nearly twenty years of studying and working in the heritage sector I’ve seen it change dramatically. I feel we are on the cusp of changing the way we engage with our data which could profoundly alter the way we understand the past, how we can communicate this in the present and how we can sustainably manage a complex resource for the future.