We are delighted to be able to Angela Dappert Memorial Award for excellence in the theory and practice of digital preservation in 2024. The award is done in collaboration with the International Digital Preservation Conference, iPres. The 2024 iPres Conference will be held in Belgium on 16-20 September. I am looking forward to seeing the excellent papers short-listed by the iPres Programme Committee.
2024 Angela Dappert Memorial Award Winners
The 2024 Angela Dappert Memorial Award in Digital Preservation has gone to Klaus Rechert, Dragan Espenschied, Rafael Gieschke, and Wendy Hagenmaier for the work reported in their iPres 2024 paper “Preserving Users’ Knowledge of Contemporary and Legacy Computer Systems.”
I felt that this paper signals a shift in the way that the digital preservation community is thinking about emulation technology as well as capturing knowledge about legacy computer systems and software. Emulation methods have become more common in long-term digital archiving environments. They make it possible to run and interact with legacy software and data that ran on machines that may be long since retired and difficult to find outside of a computing museum.
But as the people who knew how to use those systems become older and, sadly, rarer a new challenge emerges: How can we capture and preserve their knowledge of how to use these systems? Most of the approaches that people have explored looked at extensive manual documentation and video recordings. Rechart and his co-authors have taken a different approach. They record interaction events between a user and a system and then use large language models to reformulate and summarize these event records. This may make an enormous reduction in the cost of preserving this legacy information. I’m looking forward to their next steps as well as learning if these techniques for uncovering tacit knowledge apply in other areas.
I was able to present the award at the 2024 iPres conference in Ghent, Belgium.
Video: AI and Archives Part 2: Hazards
This is the second in our video series done in collaboration with the eArchving Initiative. Organisations, including archives, are driving ahead towards deploying AI-based systems. What hazards stand in our way? Adam Farquhar, founder of Digital Lifecycle Management, discusses some of the risks, including the new EU AI Act.
Video: AI and Archives Part 1: Opportunities
This is the first in our video series done in collaboration with the eArchiving Initiative. This looks at opportunities for AI in Archives. I look at how archives can take advantage of artificial intelligence, machine learning, and large language models to enhance core business processes.
The videos are available through the e-Ark Consortium YouTube Channel.
Archives and AI Part 1: Opportunities
Archives are knowledge-focused organizations, but they lag in adopting Artificial Intelligence and Machine Learning technology. This article outlines specific AI and ML use-cases and opportunities for archives. My goal is to stimulate archivists to think about possibilities and technologists to consider new opportunities.
Many opportunities arise only when archival materials are in digital form. Every year, a larger portion of archival materials are born digital. Archives can also digitise materials when they are physical. Digitisation by itself can enable researchers to access materials without a journey to the archive, but when it is coupled with AI tools like handwriting recognition or image analysis, benefits really start to flow.
In this article, I’m focusing on the distinctive business processes of archives. I take these broadly to include national, regional, city, university, business, and cultural heritage institutions that preserve significant records for current and future use. For now, I’ll consider five core business processes:
- Appraisal and Acquisition: Determining which records to add to the collection based on their value and the archive’s mission, policy, and resources. This applies to gifts and loans as well.
- Arrangement and Description: Organizing and describing materials for findability and understanding. Arrangement follows archival principles like provenance and original order. Description creates finding aids, catalogue records, and other discovery tools.
- Preservation: Ensuring that holdings survive for the long term through proper storage, monitoring, and preventive measures. This includes disaster planning and recovery for both digital and physical materials.
- Access: Providing access through research guides, discovery tools, reference services, and digital access services.
- Outreach: Actively promoting collections and their value through exhibitions, educational programs, and publications.
Modern AI tools, such as LLMs (large language models) including Google’s Gemini, OpenAI’s ChatGPT, Anthropic’s Claude, or Meta’s Llama can provide benefits for every one of these processes. The first three are oriented towards archivists. Depending on the organisation, access is often oriented towards researchers, while outreach is oriented towards the more general public.
Appraisal and acquisition
Let’s consider two situations.
- A national archive receives materials from a government department on a scheduled basis. One of the main tasks here is to ensure that the archive gets everything that it is supposed to. Another is that it did not receive things that it shouldn’t. It is challenging for an archivist to be kept up to date on the major projects or even organisational components of a department. But an AI system could read departmental and government web sites to identify projects, organisational components, and key people. It could then cross-check to see that the relevant material is included.
- An archivist is considering whether to accept a private donation. She has expert staff, but she realizes that it will be years before they have time to organise and describe the content. That means it will be years before any researcher will be able to use it, if ever. If it can’t be used, it has little value. If material is digital or digitised, then an AI system could carry out the second step of providing an initial description that would enable archivists and researchers to better understand it.
For me, this is one of the most important contributions. By reducing the cost of description, AI can reduce the cost of acquisition. I used to hate it when I worked at the British Library and we had to turn down fascinating donations – often the archive of an author, researcher, or artist – because we couldn’t afford to make it accessible. Even worse was when we could have raised money to cover the cost of digitisation but not enough for description.
I think that the acquisition process can also be supported by AI, but those use cases are broadly shared by other organizations that negotiate over and price materials such as auction houses, galleries, or many other businesses.
Arrangement and description
This is an area that has huge potential – especially if material is in digital form or can be digitised. Current AI is very capable at summarising documents, identifying people, places, projects, building up a timeline, and many other aspects that lead to useful descriptions. They can even do a decent job of identifying things like key people or important projects. There have also been major improvements over the last few years in handling images, audio, and video material. This could make material that was previously completely inaccessible or very expensive to describe a full part of the archive.
Preservation
For now, off-the-shelf AI components may have less to contribute to this area. There are, however, many possibilities. For example, a drone could take images or video of (open) storage areas that could be automatically reviewed to identify damage, rot, or signs of rodents. People working with obsolete software systems could consult a system that detailed knowledge from the manuals, user forums, and YouTube videos and was able to show them how to accomplish their tasks. Photographs could be digitised and minor damage like scratches repaired.
Access
AI tools are absolutely revolutionising how researchers interrogate and interact with content. We are all doing it! Summarise this document! What is this project about? Explain this long email chain! Who are the key people involved? Tell me about this artist’s broader career! What was the broader social context for this work? Describe the photos in this folder! Who is speaking on this recording? Translate this to a language that I speak!
Researchers need a new generation of AI-enabled tools to work with and understand archival materials. This has the potential not only to accelerate their work, but also to enable them to use material that they would never have considered before because it was too much additional work to find, access, or understand. This could dramatically increase the real value of archival material and of archives themselves.
Of course, when a summary is created – just as one a researcher reads an item – there are opportunities for misunderstanding and unexpected biases to be introduced. In a later article, I’ll discuss how I like to think about the contributions that these sorts of tools make and point to some new research in this area.
Outreach
Archival collections can bring value to diverse distributed communities. AI tools have potential to lower the cost of people interacting with collections in ways that they find meaningful – in a language and style that is natural for them. Sometimes this is literally language. Current LLMs can do a brilliant job of translating text from one language to another. But they are also good at paraphrasing in a way that makes it easier for someone to understand. For example, I can ask a for the system to “explain it like I’m five” or “paraphrase this without the math.” With historical English texts, I can ask for it to be paraphrased in modern English. All of these methods can introduce some errors or misunderstandings, but they also mean that people have an enriched starting point.
Current LLMs also make it easier to embed archival items in a broader business, cultural, and historic context. People can have open ended conversations that might approximate what they might have with a curator. AI tools also open up collections to people who are not comfortable asking questions directly to a person or coming into a traditional archive building. They never get tired of answering the same question or explaining in a new way.
Conclusion
My goal in this article was to explore AI and ML use-cases for archives and help to build a bridge between archivists and technologists. Archivists need to develop a realistic vision of how AI can benefit them. Technologists need to understand more about what archives do so that they can develop helpful solutions.
The next article in this series will talk about some of the barriers to introducing AI technology in Archives and how to address them. After that, we will look at the intersection of possibilities and technology and build up a timeline to guide thinking about implementations.
Running a (not-so) large LLM locally
Today, it is possible to run a surprisingly capable large language model (LLM) at home on a modest consumer grade machine. The full LLMs that power services like ChatGPT, Gemini, Bard, or CoPilot have hundreds of billions or even trillions of parameters. This means that they require an absurd amount of memory to run.
When I was attending the Fully Connected conference in May, I spoke with a couple of people who turned me on to ways to run LLMs locally. The methods ranged from the insanely easy to rather easy. Once I became aware of them, I started to see articles and YouTube videos talking about them as well. I thought that I would share my experience using Ollama, which is insanely easy to get started with.
Ollama is an open source framework to download and run LLMs on your own machine. I love two things about this. First, it means that you can preserve your privacy. All of the data and computation remains in an environment that you control completely. Second, using a local machine means that you know the costs – and they are limited to electricity (and what you might have already spent on your hardware). I hate waking up in the morning realizing that my cloud job ran over night and resulted in a big compute bill.
I’ve been using Ollama on linux extensively, but for this article I thought that I would check to see if it worked just as well under Windows 11 (windows version).
Getting started is actually really truly this easy:
- Download and run the Windows installer from https://ollama.com/download/OllamaSetup.exe
- Open a terminal and run Ollama, telling it which LLM you would like to use.
- Chat away!
To use the rather nice 4B parameter Phi3 model from Microsoft it as easy as:
PS C:\...> ollama run phi3
PlaintextThe first time that you do this, Ollama will download the model and print out some information while you wait. This version of the Phi3 model is around 3.8GB in total. You may have time for a coffee and think about ordering a bigger disk!
pulling manifest
pulling b26e6713dc74... 100% ▕█████████████████████████████████████████████████████▏ 2.4 GB
pulling fa8235e5b48f... 100% ▕█████████████████████████████████████████████████████▏ 1.1 KB
pulling 542b217f179c... 100% ▕█████████████████████████████████████████████████████▏ 148 B
pulling 8dde1baf1db0... 100% ▕█████████████████████████████████████████████████████▏ 78 B
pulling f91db7a2deb9... 100% ▕█████████████████████████████████████████████████████▏ 485 B
verifying sha256 digest
writing manifest
removing any unused layers
success
PlaintextOnce the model is downloaded, Ollama sits waiting for your question. Let’s ask it why grass is green.
>>> why is grass green?
Grass appears green to us primarily due to the way it reflects light. The color of an object, in a basic sense, depends on which wavelengths of visible light it absorbs and which ones it reflects. Plants contain chlorophyll, a pigment crucial for photosynthesis—the process by which plants convert sunlight into energy.
Chlorophyll is particularly efficient at absorbing blue and red light frequencies but does not absorb green wavelengths effectively. Instead, it reflects these green light waves back to our eyes more than other colors do. This reflection of the green spectrum makes grass appear predominantly green.
It's also worth mentioning that different types of plants have varying shades of green based on their chlorophyll content and can sometimes contain other pigments, which might give them slightly different hues<br>under certain conditions (e.g., autumn leaves are not purely green because they reflect more red light as well).
This mechanism is an evolutionary adaptation that helps plants optimize the process of photosynthesis while also providing a deterrent to herbivores, who may find bright green foliage less appetizing than other colors.
>>>> Send a message (/? for help)
PlaintextTo me, this seems like a pretty high quality answer to this question. The Phi3 model is not a specialist for botany or physics. It has rather a lot of general knowledge that you can now access locally. Asking it about topics like digital preservation or even relationships and dating gives remarkably thoughtful and complete answers.
I am super impressed with the quality of the Ollama implementation. It is very easy to download, install, and use. It is also very sophisticated and supports a wide range of applications. For example, it will use a GPU to run a model more efficiently, but it will seamlessly blend this with the CPU and regular RAM if it doesn’t fit in the GPU. That means that your machines can run larger models that you might expect.
I want to highlight that a model with 4, 8, or 13 billion parameters is not as capable as one with hundreds of billions or more. Phi3 or Llama13 is not the same as Claude 1.5 or ChatGPT 4o. But they are surprisingly capable simplified versions of these larger models and benefit substantially from their big brothers.
I may follow up with a longer post on using Ollama and my experiences with the various LLMs that are in the Ollama library including the multimodal ones that can handle a mix of images and text. There is also a growing ecosystem of tools and systems to work with a mix of cloud and local models. I’ll be writing about some of them over the coming weeks.
Navigating the Evolving AI Landscape
In May, I had the opportunity to attend the London Fully Connected conference in London. The conference was run by Weights & Biases, which is a company that provides tools and services that help ML (machine learning) engineers manage the complexity of training and deploying machine learning models.
They’ve expanded their view of the market for ML tools and services to reflect the ways that it is maturing. Some of these ideas are influencing the way that I think about how companies develop and integrate ML.
I’ve been following them and using their tools off and on since shortly after they began in 2017. Their initial tools were a great support for people who are training machine learning models. At that time, there had been huge progress in vision problems, like classification or object detection using deep convolutional neural nets (CNNs). Foundational vision models were available in a flourishing array of different architectures. They provided a starting point and could recognize many types of objects, like cats, flowers, buses, or people. Fine-tuning these models, starting with a pre-trained foundation vision model, can be relatively straightforward.
For many practical tasks, you can get good results by fine-tuning with hundreds or a few thousand training examples. Of course there is more to it. In addition to the training data, there is a large space of hyperparameters that you can select from to try to improve on your current best result. For example, you can select different model architectures, model sizes, learning rates, optimizers, batch size, number of epochs, size of input images, and more.
I guess that just about everyone who was training models in 2017 spent time wandering through the hyperparameter space – losing their notes, forgetting which combinations of hyperparameters and models they had run before, which datasets they had trained on, and wondering which combinations were most effective. In any case, I know that I did! To be fair, there was a lot to track, and not everything that needed tracking was obvious.
Weights & Biases provides services that help ML engineers to manage this complexity. It made a huge difference in making sense of model training . It only took a few lines of code, and you could readily log and monitor a training run. They captured most of the relevant hyperparameters and logged both ML information (e.g., loss, epoch) as well as system information (e.g., GPU usage and temperature ). They also captured thumbnails of sampled input and output images . This helped a lot to understand what was going on and catch certain sorts of bugs. They also have hyperparameter search capabilities to help you look at the space of possibilities systematically. One of the things that I liked best about their tools was how robust they were and how little impact they had on the servers that I used to train the models.
Weights & Biases have made substantial progress since those early days, and I was impressed by the clarity with which they see the market developing. They used to target people who were training small models or perhaps fine-tuning medium-sized ones. Today they look at their market as having three components:
- Foundation model builders. These are the people who train large models from scratch. They typically work for large corporations or well-funded university research groups. Today, training these large models on internet-scale datasets can cost millions of dollars. Perhaps there are a few thousand of these people in this prestige market including engineers at Google, Meta, OpenAI, and Anthropic.
- Specialized model builders. These are the people who are taking a foundation model (either an LLM – large language model – or a vision model) and adapting it for a specialized setting. For example, taking a vision model and further training it to identify cancerous growths from scans or segmenting an image into subject and background. These settings often have stringent operational requirements and need both additional training and rigorous evaluation. You might see specialized models being developed for many markets including hospitals, law firms, or retailers. Perhaps there are a hundred thousand people in this market addressing specific sector and organisational needs.
- Application builders. This is the fastest-growing market – people building applications that include an off-the-shelf ML/AI component as just another widget that helps them to solve a business problem or make their users happier. There could eventually be millions of these people building everything from in-house tools to scalable software-as-a-service offerings.
Providers targeting ML developers will benefit from recognizing these as three distinct markets.
How LLMs Work Part 1: The Basic Loop
Machine learning systems based on LLMs (large language models) are everywhere! Many of us use systems like OpenAI’s ChatGPT, Anthropic’ s Claude, Meta’s Llama, Google’s Gemini, or Microsoft’s CoPilot directly every day. Even more of us encounter them indirectly through services or systems that we work with – whether that is a search engine, composing an email, or dealing with customer services. And some of us are working on building or integrating them into new applications.
But most of us know very little about what is going on behind the curtain or inside the black box. Interacting with them sometimes feels like interacting with a human, but they also have some very non-human behaviours and quirks. Understanding even a little about how they work helps us to make sense of their surprising behaviours and enables us to use these systems more effectively.
In this article, I’m going to slide back the lid of the black box to shed some light on what’s inside. I will try to be clear and make this accessible for anyone who wants to learn about how LLMs work. There won’t be any math, calculus, or complex algorithms here. If you are looking for a technical discussion, this is not the place! I will also be focusing on LLMs that deal with language only, rather than multi-modal systems that can work with audio, images, or video. I will also leave an explanation of how these models are trained for another day.
Over the course of a few articles, we’re going to look at:
- The basic loop an LLM uses to create an answer.
- The output text and how it is chosen.
- The prompt and how it is often structured.
- The differences between words, tokens, and embeddings.
This should help us to understand how they give different responses to the same prompt, how the response is highly sensitive to the first words generated, why long answers take more time to generate, how a chatbot remembers your discussion, and even some tips for making better prompts.
The Basic Loop
All the current LLMs or Chatbots work the same way. They receive a prompt. For the moment, think of this as some instructions combined with a question. They produce a sequence of words in response. We read those words and interpret them as an answer to our question or the next step in a longer conversation. Most of the time the words that come back are sensible and the sort of thing that a well-informed person might say. This reinforces the impression that they are truly intelligent.
There are several simplifications in this account! We’re going to unpack four of them in this article. I’m going to save the biggest for the end – the input and output aren’t really words. But for the moment, let’s pretend that they are.
One of the remarkable and most non-human aspects of an LLM is that it is not a understanding machine, but it is a prediction engine. It predicts the next word in its response. At each step, it has an input that has two parts – the prompt and words that form its response so far. At the beginning, the response is empty. You can imagine how this goes. Suppose the prompt is: “Space, the final frontier.”, then on the first iteration, it might predict “these” as a highly likely next word. On the next iteration, the input would be “Space, the final frontier. These”. In response to this, it might predict “ are” to be highly likely – followed by “the”, “voyages”, “of”, “ the”, ”Starship”, and so on.
In fact, the LLM doesn’t just predict the most likely next word given the input. It predicts likelihoods for every word that it knows of being the next word. So, “Enterprise” will be predicted with a high likelihood; “banana” less likely, and “gastroenterologist” even less so.
It is worth noting that the core LLM is deterministic. That is for a given prompt and partial response, it will predict the same likelihoods for the next word every time. There is no creativity going on here. It gives the same predictions every time! Totally fixed!
There is a separate component that looks at the likelihoods and selects the next word to add to the response. It could just pick the most likely next word, but that would be super boring. Imagine if every high school student who had ChatGPT write an essay for them turned in the exact same one.
To make the responses more diverse, LLM implementations have some [GU1] dials, called hyper-parameters, that influence how varied and surprising a response is. Three commonly used ones are:
- Temperature. If the temperature is set to zero, the most likely word is picked. A setting of one will use the initial distribution of likelihood. So, if the first word is 25%, the second 15%, and the third is 10%, then the second word will be selected 15% of the time. A setting of more than one stretches the distribution out, so the most likely words are selected less often and the less likely ones will be chosen (relatively) more often. Cranking this dial up too high can give some nutty or malformed responses.
- Top P. This limits the choices to a portion of the distribution. If the value is 1.0, then all of the words are included. In the value is 0.5, then only the words adding up to 50% of the distribution would be considered.
You can see how these two work nicely together. Using a lower value for Top P and a higher value for Temperature allows lots of creative choice within a limited (hopefully sensible) subset of the possible words.
- Repetition penalty. This can result in a word being rejected if it has already been used in the response. It helps to reduce redundancy and matches our expectations for good writing.
To summarise: the output of an LLM system is composed in many iterations. Each time through the loop, the LLM predicts the likelihood of the next words, a component picks a word, and adds that to the partial response. The prompt and partial response are then fed back into the LLM.
The loop stops either because the next word is a special one to signal that the response is complete or because the maximum length has been reached. We’ll come back to the special tokens like this later.
So now we know quite a bit about the inner loop of an LLM. We also know about some of the dials that we can use to influence its output. There is one additional thing to mention. You may have seen some LLM implementations offer several different responses. They do this by several responses being generated at once. Remember – the predictions are completely determined by the input, so it is easy to keep a list of partial responses and switch between them as you like. They will have made some different choices early on, so they become increasingly different as they go along. Suppose with the prompt “Once upon a time a“ the next word might be “ princess” or “ dragon” or “boy” – each would give rise to very different continuations.
If you’ve enjoyed this article, please let me know in the comments. If you have any questions or topics that you would like us to write up, let us know. If there is anything that we could do to make this clearer or easier to understand, that would especially helpful.
In the next article in this series, we’re going to look at tokens and embeddings.
Digital Lifecycle Management Ltd a Member of the DLM Forum
We are delighted to now be a member of the DLM Forum. I’m looking foward to engaging with the Forum’s international community as we look at new challenges and opportunities for data, archive, and document lifecycle management. I’m particularly excited about exploring the opportunities for applying AI/ML methods in these settings.
Digital Lifecycle Management Ltd to join eArchiving Initiative
We are looking towards joining the eArchiving initiative. It is taking the next step on a ten year journey to make it easier for organisations to preserve, migrate, reuse, and trust digital information over the long term. The team is developing software, tools, and standards that support interoperability across archiving systems. They are also providing a conformance specification and implementing tools to check system conformance. In addition to the web pages on the EC’s Digital Strategy site, there is additional information on the e-Ark Foundation site.