Simon Smith

PyLitSense: An easy way to try biomedical sentence embeddings

2023-09-06T00:00:00+00:00

Retrieval augmented generation can ground large language models to improve their response accuracy, recency, and referenceability. This can be particularly important in biomedical research, as you want up-to-date, non-hallucinated, referenced information.

For example, ask ChatGPT something like “Does metformin reduce COVID severity?” Many of the articles on this topic were published after its knowledge cutoff. So to perform best, it needs to search and then use the results to inform its response. And since we don’t only want keyword-based results (for example: we want to know if metformin “lessens,” “minimizes,” or has other effects like “reduce”), we need to use sentence embeddings.

Unfortunately, creating these embeddings on a large number of sentences can be expensive and time-consuming. And there are billions of sentences in biomedical papers. Fortunately, the US National Center for Biotechnology Information created LitSense to help. It allows you to query against hundreds of millions of sentences from PubMed abstracts, and some full-text articles.

I think this is an underutilized resource. So, to help people explore its potential, I’ve created the pylitsense Python package as a wrapper around the LitSense API. Here’s how to use it:

Install

pip install pylitsense

Use

from pylitsense.pylitsense import PyLitSense

# Initialize
pls = PyLitSense()

# Query
results = pls.query("your query here")

# Print results
for result in results:
    print(result.text, result.score)

Try it out, and add any issues or feature requests to the GitHub repo.

Introducing Agentflow: Execute complex LLM workflows with simple JSON

2023-08-07T00:00:00+00:00

Large language models (LLMs) are powerful tools, but implementing complex workflows with them can be a challenge.

Yes, tools like Auto-GPT and BabyAGI allow LLMs to execute multiple steps, but autonomously—the LLMs plan and then execute tasks themselves. Because of this, in my experience with Auto-GPT, things can quickly get out of control.

What I want is to have LLMs execute multiple steps, but under my control, following a predefined path. So I scratched my own itch and built Agentflow, an open source solution that lets you execute complex workflows with simple JSON.

With Agentflow, you can:

1. Write workflows in plain English

Just add tasks in a JSON file like this:

{
    "system_message": "Optional guiding message",
    "tasks": [
        {
            "action": "Step one."
        },
        {
            "action": "Step two."
        },
        {
            "action": "..."
        }
    ]
}

2. Add variables for dynamic outputs

You can include variables in {curly quotes} that you populate when running a workflow. For example, target_market is a variable here:

{
    "system_message": "You are an innovative entrepreneur.",
    "tasks": [
        {
            "action": "Generate 10 product ideas for {target_market}"
        },
        {
            "action": "..."
        }
    ]
}

3. Create and use custom functions

Custom functions expand LLMs’ capabilities beyond text generation. Easily define new functions by inheriting from the BaseFunction class. Specify functions to run using function_call as shown here:

{
    "system_message": "You are a creative artist.",
    "tasks": [
        {
            "action": "Brainstorm 10 painting ideas for {painting_subject}."
        },
        {
            "action": "Choose the best idea."
        },
        {
            "action": "Write a prompt for an AI art generator to produce an image of the painting."
        },
        {
            "action": "Generate the painting image using the prompt.", 
            "settings": {
                "function_call": "create_image"
            }
        },
        {
            "action": "..."
        }
    ]
}

4. Run workflows with a simple command

To run a workflow, just use the command line like this:

python -m run --flow=workflow_name

Or, for workflows with variables, like this:

python -m run --flow=workflow_with_variables_name --variables 'variable_1_name=value1' 'variable_2_name=value2'

Agentflow executes the specified workflow and provides a link to a folder with all outputs, including a JSON file containing all of the LLM’s responses.

Get started with Agentflow!

Check out the installation instructions, explore ideas and open issues, and feel free to contribute to expanding Agentflow’s capabilities.

Use OpenAI API streaming with functions

2023-07-26T00:00:00+00:00

The OpenAI API provides offers several features to facilitate using powerful language models like GPT-4 and GPT 3.5.

Two very useful features are streaming and function calling. With streaming, you give users results from the API as they’re generated, which is a better user experience because users don’t have to wait for an entire response at once. With function calling, you expand GPTs’ capabilities with functions that you define.

But in building with the OpenAI API, I’ve found it challenging to combine streaming with function calling. The main reason is that GPTs stream the function calls as well as the content! Even worse, they stream function calls in pieces. So to combine streaming with function calling, you need to monitor what the models stream, output content if it’s content, and build and execute function calls iteratively when it’s function calls.

I created this gist to do just that, and will walk through it at a high level here:

1. Install and configure the OpenAI library

First, perhaps obviously, you’ll need to install the OpenAI library, then configure it with your API key.

pip install openai

openai.api_key = os.environ["OPENAI_API_KEY"]

(Here, we set the API key from environment variables for security.)

2. Define functions

To tell GPTs about available functions, you must define them in a way that conforms with JSON Schema.

FUNCTIONS = {
    "count_string": {
        "name": "count_string",
        "description": "Counts the number of characters in a string.",
        "parameters": {
            "type": "object",
            "properties": {
                "string_to_count": {
                    "type": "string",
                    "description": "The string whose characters you want to count.",
                },
            },
            "required": ["string_to_count"],
        },
    }
}
FUNCTIONS_FOR_API = list(FUNCTIONS.values())

In this snippet, I define a function called count_string that counts the number of characters in a string.

Note that I’ve put functions into a dictionary to make it easier to work with in call_function below, but also into a list, which the OpenAI API needs.

3. Create functions

Having defined our functions, you now must create them. In this example, I create a simple function for counting characters in a string.

def count_string(string_to_count: str) -> str:
    """Counts the number of characters in a string."""
    return str(len(string_to_count))

4. Handle called functions

Next, you need some way to call functions GPTs want to execute. Here, I create a call_function utility. This function verifies whether the requested function is defined, validates its arguments, and calls the function, returning its result.

def call_function(function_name: str, function_arguments: str) -> str:
    """Calls a function and returns the result."""
    ...

5. Manage OpenAI responses

To handle the responses from OpenAI, we define a function that checks for text or function call responses, and executes function calls as needed. Comments in the code go into greater detail about how this works.

def get_response(messages: List[Dict[str, Any]]) -> Generator[str, None, None]:
    """Gets the response from OpenAI, updates the messages array, yields
    content, and calls functions as needed."""
    ...

6. Bring it all together

Finally, we use the get_response function to enable a conversation with streaming output and function calls.

if __name__ == "__main__":
    ...

In this main loop, we take user input, add it to the messages array, and stream the response. If the response involves a function call, we handle it accordingly.

It’s possible there’s an easier way to combine streaming and function calls. It feels like there should be. But if there is, I haven’t found it. Check out the code and let me know if you have any suggestions.

Fine-tune T5 with Hugging Face (as of September 9, 2022)

2022-09-09T00:00:00+00:00

Hugging Face is a great resource for streamlining the use of machine learning in applications. It can be challenging, however, to know what documentation and examples are the most up-to-date.

Take the case of fine-tuning a T5 model. If you search online for “fine tune a T5 model with Hugging Face” you’ll get thousands of results. Many of these are outdated, referring to older versions of the Hugging Face API, which has rapidly evolved. But it’s hard to know which results are outdated.

If you’re in the same boat, you can hopefully save some trouble with this Gist. This should be accurate as of September 9, 2022. Alternatively, here are the steps I used:

1. Install dependencies

pip install datasets pandas transformers

2. Import libraries and modules

from datasets import Dataset, DatasetDict
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

3. Set model, tokenizer, and data_collator variables

Note: You can use other versions of T5 too. See your options here.

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

4. Get data and divide into train, eval, and test sets

Note: Replace the dataframe with your own, but make sure it has “source_text” and “target_text” columns or you’ll need to modify other code below.

df = pd.DataFrame({"source_text": [], "target_text": []}) 
train_df = df.sample(frac = 0.8)
eval_df = df.drop(train_df.index).sample(frac = 0.5)
test_df = df.drop(train_df.index).drop(eval_df.index)

5. Create a dataset dict from the dataframes

dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "eval": Dataset.from_pandas(eval_df),
    "test": Dataset.from_pandas(test_df),
    })

6. Tokenize the dataset

Note: Change the max_length to whatever makes the most sense for your data.

def tokenize(source_texts, target_texts):
    model_inputs = tokenizer(source_texts, max_length=512, truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(target_texts, max_length=512, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
tokenized_dataset = dataset.map(tokenize, input_columns=["source_text", "target_text"], remove_columns=["source_text", "target_text"])

7. Set training arguments

Note: Change “output_directory” to where you want, and update other parameters as makes sense. Here’s the documentation for this.

training_arguments = Seq2SeqTrainingArguments(
    "output_directory",
    learning_rate=0.0001,
    weight_decay=0.01,
    fp16=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=20,    
    evaluation_strategy="epoch",
    report_to="all"
)

8. Create a trainer

trainer = Seq2SeqTrainer(
    model,
    training_arguments,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["eval"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

9. Train the model

trainer.train()

10. Save the tokenizer and model

Note: Update “output_directory” to wherever you want to save everything.

tokenizer.save_pretrained("output_directory")
model.save_pretrained("output_directory")

And that’s it! Again, all the code is in this Gist. And if you have any issues, you should probably check to make sure nothing has changed with Hugging Face’s API since I wrote this.

Use offset to work with unindexed arrays in BigQuery

2022-08-13T00:00:00+00:00

Recently I faced a challenge of working with multilevel nested arrays in BigQuery. The table I was working with had a structure somewhat like this:

id
level_one_struct_array
- name
- …
- level_two_struct_array
  - name
  - …

I needed to unnest the arrays, change values within the level two array, and then reaggregate everything.

The trouble is, using UNNEST in BigQuery doesn’t preserve order. So if I unnested each array and then reaggregated them, I wouldn’t necessarily get things back in the right order. And in my use case, order mattered.

The solution: use OFFSET to add indexes to the arrays, somewhat as follows:

WITH level_one_flattened AS (
    SELECT
        id,
        level_one_offset,
        level_one_struct.*
    FROM table_name,
        table_name.level_one_struct_array AS level_one_struct 
        WITH OFFSET AS level_one_offset    
), 

level_two_flattened AS (
    SELECT
        id,
        level_one_offset,
        level_two_offset,
        level_two_struct.*
    FROM level_one_flattened,
        level_one_flattened.level_two_struct_array AS level_two_struct 
        WITH OFFSET AS level_two_offset    
), 

...

With this done, I could work with level_one_flattened and level_two_flattened, then reaggregate everything at the end in the appropriate order using the offset-generated indexes.

It’s not rocket science, and I’m sure people with much greater expertise in SQL than me are very familiar with this. But it wasn’t something I needed to use until recently, when it came in very handy.

Extract structured data from unstructured text using language models like GPT-3

2022-08-05T00:00:00+00:00

Recently I faced a common challenge: extracting structured information from millions of unstructured text documents.

Neither regular expression extraction nor part of speech tagging would scale due to having multiple categories of content and inconsistent phrasing within them. We would have had to write tailored regular expressions or part of speech extractions for every new paragraph topic, and account for a long tail of edge cases.

Having experimented with large language models like GPT-3, I was curious as to whether we could simply train one to extract the information we wanted into a structured format like JSON. Then we could validate the JSON and load it directly into BigQuery.

I was thrilled to see how well this work, and if you have access to GPT-3 you can immediately try it yourself. For example, first, enter this one-shot training example:

John bought a bag of peanuts for $8. He thought they were delicious.
{“person”: “John”, “product”: “peanuts”, “cost”: ”$8”, “sentiment”: “positive”}

Then enter some similar examples and see how well GPT-3 manages them.

Example 1:

“When she arrived at the store, Sarah purchased a bottle of water. It cost $4.50. She was pissed that it was so expensive!”

GPT-3’s response:

{“person”: “Sarah”, “product”: “water”, “cost”: ”$4.50”, “sentiment”: “negative”}

Example 2:

“After a long day at work, Frank went shopping for some new clothes. He bought a suit and tie. It cost $1,500. He didn’t mind, as he considered it a cost of doing business.”

GPT-3’s response:

{“person”: “Frank”, “product”: “suit and tie”, “cost”: ”$1,500”, “sentiment”: “neutral”}

As you can seem from these two examples, GPT-3 generalizes extremely well.

There’s more work to do to scale this up. But so far it’s quite exciting to find a new way to solve such a common challenge.

Get colors from images as swatches

2022-07-28T00:00:00+00:00

A few weeks ago I was playing with scientific figures and wondering how I might extract insights from them. One idea I had was to find all the colors in scientific images and rank them.

Given that different cells, cell parts, and tissues often have different colors—especially when stained to do so—that could be a productive path. For example, if cancer cells are stained a different color in an image than healthy cells, the higher the percentage of the cancer color, the worse it is.

Turns out this isn’t an easy problem to solve. Why? Because while we see a few colors in an image, there are actually many variations of those colors which are imperceptible to us.

The solution is to cluster images by color. For example, group all the reddish colors together, then all the bluish ones, and so on. And then you can determine the relative amounts of each color.

I wrote some code to do this, which I’ve called “Colorgram” and uploaded as a Gist. Here’s an example of it working on a Dall-e image I generated:

Input image

Colorgrammed image

PS: I owe a debt to various places where I learned how to do this and copied some code snippets. I don’t remember them all but now wish I did. If you recognize this as something you’ve worked on, and want credit, please let me know as I’m happy to give it!

Extract table data from images with pure Pytesseract

2022-07-21T00:00:00+00:00

When extracting data from documents, one common challenge is processing text in images. This can be particularly difficult when the text is in tables. You don’t just want the text, but want it structured in relation to other text.

You can find solutions to this problem by Googling, but many seem brittle and overcomplicated. For example, they may be brittle because they rely on using image recognition libraries like OpenCV to find gridlines that might not always be present. And they may be overcomplicated because can’t OCR tools like Tesseract alone return information needed to reconstruct table data from provided attributes?

That was my hypothesis, anyway. Since Tesseract gives you information on x and y coordinates of text, and since tables follow a fairly standard format, I thought that we should be able to extract table text and structure using only Tesseract.

So I tested the idea.

The toy problem: A simple table in an image

To test my hypothesis, I created a very simple toy problem using this table in an image:

The code

Using this toy problem, here’s how I approached a solution, step by step.

1. Import key libraries

I was hoping to use the fewest libraries possible, in keeping with my goal of simplicity. But I ended up having to import several, shown below.

Importantly, note the use of Pytesseract. This is a Python wrapper around Tesseract, which you’ll need to install. See the instructions for doing so by following that link.

import cv2
import numpy as np
from PIL import Image
import pandas as pd
import pytesseract
from sklearn.cluster import AgglomerativeClustering

2. Create a function to preprocess the image

Tesseract works better when you do even basic preprocessing on images. I wrote a function to handle that. Note that it explicitly strips out gridlines, which some other image table extractors I’ve seen need in order to determine rows and columns, as mentioned above.

def preprocess(image_path: str) -> np.ndarray:

    # Get the image.
    img = cv2.imread(image_path)
  
    # Convert the image to grayscale.
    gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Remove backgrounds.
    bg_free_img = cv2.threshold(gray_img, 0, 255, cv2.THRESH_OTSU)[1]

    # Create an inverse image to use for removing lines.
    inverted_img = ~ bg_free_img 

    # Remove horizontal lines.
    # TODO: Set line thickness dynamically.
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
    remove_horizontal = cv2.morphologyEx(inverted_img, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
    cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    for c in cnts: cv2.drawContours(bg_free_img, [c], -1, (255, 255, 255), 2)

    # Remove vertical lines.
    # TODO: Set line thickness dynamically.
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,40))
    remove_vertical = cv2.morphologyEx(inverted_img, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
    cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    for c in cnts: cv2.drawContours(bg_free_img, [c], -1, (255, 255, 255), 2)

    # Return the output image.
    return bg_free_img

3. Create a function to group text by inferred row

Tesseract isn’t so helpful as to return information on text’s row in a table. It doesn’t have a concept of table or row. It does tell you, however, text’s “top” value, which is effectively its y coordinate. Using this, we can cluster text into distinct rows.

def get_row_max_tops(img_df: pd.DataFrame, distance_threshold: float) -> list: 

    # Create coordinates to use for clustering top values for rows. Note that 
    # we use (0, y), where why is "top." We specify 0 for x because we don't 
    # care here about the left value, only the top value.
    row_coordinates = [(0, row["top"]) for _, row in img_df.iterrows()]

    # Cluster rows by top values.
    row_clusters = AgglomerativeClustering(
        n_clusters=None,
        affinity="manhattan",
        linkage="complete",
        distance_threshold=distance_threshold)
    row_clusters.fit(row_coordinates)

    # Create max row tops values using row clusters and sort ascending.
    row_max_tops = []
    for row_index in np.unique(row_clusters.labels_):
        row_coordinate_indexes = np.where(row_clusters.labels_ == row_index)[0]
        row_max_top = max([row_coordinates[row_coordinate_index][1] for row_coordinate_index in row_coordinate_indexes])
        row_max_tops.append(row_max_top)
    row_max_tops.sort()

    # Return the row index and max top for each row.
    return [(i, row_max_top) for i, row_max_top in enumerate(row_max_tops)]

4. Extract table data from the preprocessed image using table row clusters

With the functions above to preprocess an image and cluster text by row, we’re ready to rock. The last function we need does the following:

Preprocess the image
Cluster text into rows
Use Tesseract’s “left” and “word_num” attributes to sort text into appropriate columns
Return everything as a dataframe

def read(image_path: str, distance_threshold=25.0) -> pd.DataFrame:

    # Preprocess the image.
    img = preprocess(image_path)

    # Read the image into a Pytesseract data frame.
    img_df = pytesseract.image_to_data(img, output_type="data.frame")

    # Drop any blank text.
    img_df.dropna(inplace=True)

    # Add row numbers to the dataframe. We do this by clustering rows according
    # to their "top" value. We then determine the max "top" value for each row.
    # Then we assign row numbers to the dataframe based on top values.
    row_max_tops = get_row_max_tops(img_df, distance_threshold)
    img_df["row_number"] = pd.Series([], dtype=object)
    for row_number, row_max_top in row_max_tops:
        if row_number > 0: lower_bound = row_max_tops[row_number - 1][1] + 1 # E.g. if the prior row has a max top of 50, the lower bound for the next row is 51
        else: lower_bound = 0 
        upper_bound = row_max_top
        img_df.loc[img_df["top"].between(lower_bound, upper_bound), "row_number"] = row_number

    # Sort the dataframe by row number, left, and word_num so we can build table content logically.
    img_df.sort_values(["row_number", "left", "word_num"], inplace=True)

    # Build the table content.
    table_content = []
    for row_number in img_df["row_number"].unique():
        row_content = []
        cell_content = []
        for _, word in img_df[img_df["row_number"] == row_number].iterrows():
            if word["word_num"] == 1 and len(cell_content) > 0:
                row_content.append(" ".join(cell_content))
                cell_content = []
            cell_content.append(word["text"])
        row_content.append(" ".join(cell_content))
        table_content.append(row_content)

    # Convert the table content to a dataframe, and return it.
    return pd.DataFrame(table_content)  

The result? Pretty good!

To extract table data from an image as a Pandas dataframe, now all you have to run is this:

img = "/path-to-your-image.jpg"
df = read(img)

Below is the output you’ll get for the toy table. As you can see, the approach works fairly well, though it puts the column headers one column too far to the left.

Column 1 header  Column 2 header  Column 3 header            None
   Row 1 header   Row 1 column 1   Row 1 column 2  Row 1 column 3
   Row 2 header   Row 2 column 1   Row 2 column 2  Row 2 column 3
   Row 3 header   Row 3 column 1   Row 3 column 2  Row 3 column 3
   Row 4 header   Row 4 column 1   Row 4 column 2  Row 4 column 3
   Row 5 header   Row 5 column 1   Row 5 column 2  Row 5 column 3
   Row 6 header   Row 6 column 1   Row 6 column 2  Row 6 column 3

Beyond the toy problem: How does it do in the real world?

In the real world, of course, tables aren’t as neat and tidy as the toy problem. And indeed, when I try this approach on dirtier inputs, the results are nowhere near as clean. But as proof of concepts go, I think this works pretty well and has the potential for further improvement. Maybe you have some additional ideas?

Download all the code

If you’d like to download this, try it yourself, and maybe further improve it, please download the code from this Gist.

PS: Thanks to PyImageSearch and Stack Overflow for some guidance as I worked through this problem.