Blog | How to automate an OSINT pipeline with LLMs

Automating OSINT pipelines with LLMs

Contents

  1. Intended audience
  2. What we want to achieve
  3. The overall design principles of the system
  4. The pipeline methodology
  5. Proof of concept practical guide for non-developers
  6. Summary

1. Intended audience

Excluding the truly stubborn or unlucky, there probably aren’t too many intelligence professionals who haven’t already embraced sensible LLM use to improve their capabilities or make their job easier.

But how can this be automated? This post is aimed at the non-developer and the developer looking to scale a production system that processes millions of articles/documents per day. I will run through a simplified version of the process we use at Exo-Sig and the lessons learned from automating LLM use over the past three years.

The final section, Proof of concept practical guide for non-developers, is pretty self-explanatory. It will hopefully enable a non-developer to put together a proof of concept to demonstrate value within their organisation.

2. What we want to achieve

This guide will be focused on processing text, it may have overlap to imagery, video and audio but that isn’t the focus.

For this example we want to automate daily processing of news articles or longish text documents to collate, process and filter down to high value articles/documents for review by a human analyst.

The example is a proof of concept where we have a small budget ($50) and no development team. We intend to transition into a production level system at a later date. The principles and methodology scale well from proof of concept to large production system.

3. The overall design principles of the system


3.1 Principle one: Inverted pyramid to slash data fees

To cut data fees by over 90%, we will adopt the inverted pyramid methodology. Basically small LLMs (cheaper) for large query volumes, working our way to large LLMs (expensive) for very small query volume. Each step filters out articles that we don’t think will be relevant to the analyst. The prices shown below are approximate prices as of 29 September 2025.

1,000,000 daily articles
Keyword filter, for example kidnap or kidnapping
50,000 daily articles
Small LLM (7B – 14B) – Input $0.04 per 750,000 words, Output $0.10 per 750,000 words
2,000 daily articles
Medium LLM (30B – 70B) – Input $0.10 per 750,000 words, Output $0.28 per 750,000 words
200 daily articles
Large LLM (200B+) – Input $0.30 per 750,000 words, Output $2.90 per 750,000 words


Or for another way of visualising it, you can use the below.

Small LLM
Junior analyst
Medium LLM
Analyst
Large LLM
Senior analyst

3.2 Principle two: Design a system that assumes LLM failure

LLMs are fantastic, but in three years I haven’t seen any tasks that they score 100% on. So when you design your system ensure that you don’t require perfect performance for the system to add significant value to the end user. We aim for 95% accuracy on core functionality and allow lower accuracy for less important tasks. But also consider who you are competing with, the human. If you ask a human to review 2,000 documents a day in a very boring formulaic way, the human won’t score 100% either.

3.3 Principle three: Have a robust testing methodology and test often

A whole post could be written on this, but it has a few really simple principles.

  1. Create a test that is representative of the real world task. If your solution should be global, select a global dataset to test. If multi-language test all languages.
  2. Make sure the sample is statistically significant. For a proof of concept, at least 500 examples should be tested; for a production system at least 2,000.
  3. For starters, pick examples right at the beginning of the pipeline, such as a news article that should have been flagged to an analyst and look at the end of the pipeline to see if it did. This is quite fast and fits well for proofs of concept.
  4. Test often. Testing can’t be done only once; it needs to recur weekly or monthly, with an audit trail.

4. The pipeline methodology


We want to identify relevant content, classify it, extract key characteristics and assess its importance. Below is a simplified flow of how Exo-Sig does this. We have 36 stages, but for this example we will use nine to avoid confusion.

4.1 Stage one: Keyword match from API

Provide a list of keywords to the news API that appear frequently in articles or posts for your event or topic of interest.

4.2 Stage two: Are you the type of thing I am interested in?

Small LLM - one question

Write a short paragraph on the type of event or topic you are interested in, for example if it is a riot, describe what you classify as a riot. Ask for a yes or no answer from the LLM and include the news article/document in the query. This will filter out a large amount of irrelevant content.

4.3 Stage three: Are you the exact type of thing I am interested in?

Medium LLM - 10 questions

The real world is complex and when you start scaling up your system you will find all types of content that kind of covers the right event, but that you are not interested in. For example here are a few questions we use:

  1. Is the article covering a future event?
  2. Is the article covering a court case about the event rather than the event itself?
  3. Is the article covering an event that occurred more than one year ago?

4.4 Stage four: Extract the key characteristics of the event

Medium LLM - 20 questions

By this point you should have filtered out the vast majority of false positives, so it is time to extract the key characteristics. If the event of interest is an insurgent attack, maybe this is the number of fighters involved, the weapon systems used or the number of casualties. In a production system you won’t want to ask 20 questions in one batch as accuracy will suffer. Split into lower numbers and for key characteristics such as date it is a good idea to ask only that question.

4.5 Stage five: Data normalisation

Not the most interesting subject, but very important. You have now received answers from the LLM that you are going to put in front of the end user, so it is important to make sure it passes some rules. For example, if you asked multiple-choice questions, you’ll want to verify that it gave a valid answer. The easiest way to deal with LLM questions and answers is always to ask for the answer as JSON.

4.6 Stage six: Quality control

Medium LLM - one question

LLMs make mistakes, so it is sensible to put in some QC steps, for example, rate how certain you are that this article’s primary subject is about a kidnapping event from 1 to 5. This will allow you to further reduce your false positive rate. You can also apply this QC step to key answers from Stage 4.

4.7 Stage seven: Grouping duplicate content

Medium LLM - one question

For a lot of use cases you will want to group or cluster articles/posts talking about the same event. So for example if in Stage Four you asked the LLM to generate a short summary of the event, you can send all the summaries for the same country and the same date to the Stage Seven LLM and ask it to return the IDs of duplicate events. This sounds simple, but this tends to be one of the most frustrating steps that requires the most testing.

4.8 Stage eight: Assessments

Large LLM - five questions

Here we can ask the model to assess the impact of an event or assess how interesting it will be to an analyst. All LLM steps require you to give it good examples to achieve good performance, but for this stage it is imperative. Aim for at least 20 good examples in the prompt and take your time in writing them. If using a reasoning model (you should be), make sure to save the reasoning trace as it is useful and interesting to see how the model is coming to its assessments.

4.9 Stage nine: Dissemination

For a proof of concept this can be as simple as email alerts. For a production system having a frontend platform allows users to get the most use out of the data. The great thing is that with current LLMs it is perfectly feasible for a non-developer to knock together a frontend for a proof of concept within a week.

5. Proof of concept practical guide for non-developers


5.1 Proof of concept guide for stage one: Keyword match from API

Let’s go with a news API. One of the providers Exo-Sig uses is Event Registry. It is attractive for our proof of concept because it has a free tier. Sign up and grab the API key.

Now to explain the overall theory of how this will work to a non-developer audience. These stages really just boil down to:

  1. Get some rows of data from somewhere (API or database table)
  2. Ask an LLM a question/set of questions through an API on each row of data (i.e. news article)
  3. Modify or add to each row of data in some way
  4. Insert the data rows into a database table

That is it, it is super simple, like editing Excel spreadsheets. I suggest using Python for the language and Postgres for the database. Don’t worry if you don’t know Python, and it’s fine if you choose a different database. Start by asking a decent LLM the following question.

"I am not a developer; I want to write a simple Python script that will query the Event Registry (or other provider) for keywords XXX and YYY. I want you to keep your example super simple and please explain to me what it is doing. Try and keep it to the minimal amount of code required and avoid anything complicated. Please print the return from the API. At a later stage we will insert it into a database. You will also need to explain to me how to run the file."

Now when you are happy with this, ask the LLM to help you import it into a database. This should be no more than about 10 lines of code, and you will be shocked at how easy it is. Remember to ask the LLM to keep it simple and use the easiest database. The best thing about LLMs is that they are like a teacher that never gets fed up with your questions and you don’t need to feel embarrassed about asking them.

5.2 Proof of concept guide for LLM stages two to eight: Asking questions to LLMs

OK now we will start using LLMs. This step is basically repeatable for all LLM steps.

To recap what we will achieve in stages two to eight.

  1. Get the relevant rows from a database table (like a permanent Excel spreadsheet)
  2. Send some questions to an LLM API for each row (article or document)
  3. Process the answer/s for each row (like a temporary Excel spreadsheet)
  4. Import each row into a new database table (like a permanent Excel spreadsheet)

A) Get the data

Firstly ask an LLM to give you six lines of code to get the data out of the table from stage one. Ask it to load the data into a pandas DataFrame. This is an easy way for beginners to manipulate data. A Pandas DataFrame is basically an Excel spreadsheet you can’t visually interact with. A good way to visualise this whole process for a beginner, is just automating working with a series of Excel spreadsheets. Imagine the database tables as permanent spreadsheets and the Pandas DataFrames as short term spreadsheets, where you make a few changes and then save in a permanent spreadsheet.

B) Send the questions to an API

LLM vendor selection

Now, if your organisation already has a relationship with an LLM vendor, just use that API. It will be simpler.

If not, a cheap option is Open Router. Don’t put anything sensitive into it as it could be running on a server in someone’s basement, unless you specifically select for data centres. Going with a real company like Deep Infra is a safer option. We use Deep Infra as one fallback option in case Exo-Sig’s GPU servers fail and AWS GPUs are unavailable. This is not an endorsement of Deep Infra’s security practices or employee vetting. Do your own research if that’s important to you. The most expensive option is going with a provider like OpenAI, but for a proof of concept it isn’t a significant cost. A provider like OpenAI will have an increased security and compliance focus due to their government, fortune 500 contracts and budget. Again, this is not an endorsement. Please do your own research.

Write the questions

Different LLMs have different best practices, but here is an overall guide. If we look at what we want to include by paragraphs:

  • Paragraph 1: Summary of what the task is and what the LLM is, i.e. you may tell the LLM it is a risk intelligence analyst.
  • Paragraph 2: The list of questions to answer, make sure to include the options if needed.
  • Paragraph 3: The format you want the answer to be in, json is a good idea.
  • Paragraph 4: 5-20 examples of input (i.e. news article) and the answer you would expect.
  • Paragraph 5: The news article or document you want the LLM to analyse.

Sending the question

Tell an LLM which provider you are using and ask it to give you some Python code to make a series of API queries for the data in your Pandas DataFrame from step A. Tell the LLM you want to load the response into a Pandas DataFrame (invisible Excel) and connect it to your original Pandas DataFrame from step A.

C) Process the answers

Before importing you probably want to check the answers and the format of the answers. For example if one of the questions was a multiple-choice from four options, you want to check it picked one of the four.

D) Import the data

Super easy: ask the LLM for six lines of code to import your Pandas DataFrame into an existing or new table. I prefer to import into new tables (spreadsheets) as it is easier to trace errors across the stages. Now to avoid processing the same data again and again, you can add a flag to the original table in a new column to indicate this stage has been processed for this article.

5.3 Proof of concept guide for stage nine: Disseminating the data

Before moving on to dissemination, don’t forget Principle Three: a robust testing methodology. If you have got this far you don’t want to discredit yourself in the eyes of your target audience by not being able to answer quantifiable answers on quality. For example, "the proof of concept has 15% false positive rate and an 8% false negative rate for topic A."

For your proof of concept I recommend using a frontend app instead of email. After Stage 8 you should have your finished data in your final table. All you need to do is ask an LLM to help you visualise that data in a frontend application. You probably want to add a few graphs/infographics too. I won’t recommend specific applications to use, but if you tell the LLM it should be quick and easy to use you should be able to get a working proof of concept dashboard within a couple of days. You will probably be surprised on how little code is needed to achieve this.

6. Summary


Hopefully you found this interesting or useful, feel free to reach out if you have any questions.