Jekyll2023-09-06T21:13:05+00:00https://www.simonsmith.ca/feed.xmlSimon SmithThe website of Simon Smith of Toronto, Ontario, Canada.Simon SmithPyLitSense: An easy way to try biomedical sentence embeddings2023-09-06T00:00:00+00:002023-09-06T00:00:00+00:00https://www.simonsmith.ca/2023/09/06/pylitsense<p><a href="https://arxiv.org/abs/2005.11401">Retrieval augmented generation</a> can ground large language models to improve their response accuracy, recency, and referenceability. This can be particularly important in biomedical research, as you want up-to-date, non-hallucinated, referenced information.</p>
<p>For example, ask ChatGPT something like “Does metformin reduce COVID severity?” Many of the articles on this topic were published after its knowledge cutoff. So to perform best, it needs to search and then use the results to inform its response. And since we don’t only want keyword-based results (for example: we want to know if metformin “lessens,” “minimizes,” or has other effects like “reduce”), we need to use <a href="https://en.wikipedia.org/wiki/Sentence_embedding">sentence embeddings</a>.</p>
<p>Unfortunately, creating these embeddings on a large number of sentences can be expensive and time-consuming. And there are <em>billions</em> of sentences in biomedical papers. Fortunately, the US National Center for Biotechnology Information created <a href="https://www.ncbi.nlm.nih.gov/research/litsense/">LitSense</a> to help. It allows you to query against hundreds of millions of sentences from PubMed abstracts, and some full-text articles.</p>
<p>I think this is an underutilized resource. So, to help people explore its potential, I’ve created the <a href="https://pypi.org/project/pylitsense/">pylitsense</a> Python package as a wrapper around the LitSense API. Here’s how to use it:</p>
<h2 id="install">Install</h2>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>pylitsense
</code></pre></div></div>
<h2 id="use">Use</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pylitsense.pylitsense</span> <span class="kn">import</span> <span class="n">PyLitSense</span>
<span class="c1"># Initialize
</span><span class="n">pls</span> <span class="o">=</span> <span class="n">PyLitSense</span><span class="p">()</span>
<span class="c1"># Query
</span><span class="n">results</span> <span class="o">=</span> <span class="n">pls</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">"your query here"</span><span class="p">)</span>
<span class="c1"># Print results
</span><span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">text</span><span class="p">,</span> <span class="n">result</span><span class="p">.</span><span class="n">score</span><span class="p">)</span>
</code></pre></div></div>
<p>Try it out, and add any issues or feature requests to the <a href="https://github.com/simonmesmith/pylitsense">GitHub repo</a>.</p>Simon SmithRetrieval augmented generation can ground large language models to improve their response accuracy, recency, and referenceability. This can be particularly important in biomedical research, as you want up-to-date, non-hallucinated, referenced information.Introducing Agentflow: Execute complex LLM workflows with simple JSON2023-08-07T00:00:00+00:002023-08-07T00:00:00+00:00https://www.simonsmith.ca/2023/08/07/agentflow<p>Large language models (LLMs) are powerful tools, but implementing complex workflows with them can be a challenge.</p>
<p>Yes, tools like <a href="https://github.com/Significant-Gravitas/Auto-GPT">Auto-GPT</a> and <a href="https://github.com/yoheinakajima/babyagi">BabyAGI</a> allow LLMs to execute multiple steps, but <em>autonomously</em>—the LLMs plan and then execute tasks themselves. Because of this, in my experience with Auto-GPT, things can quickly get out of control.</p>
<p>What I want is to have LLMs execute multiple steps, but under my control, following a predefined path. So I scratched my own itch and built <a href="https://github.com/simonmesmith/agentflow">Agentflow</a>, an open source solution that lets you execute complex workflows with simple JSON.</p>
<p>With Agentflow, you can:</p>
<h2 id="1-write-workflows-in-plain-english">1. Write workflows in plain English</h2>
<p>Just add tasks in a JSON file like this:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"system_message"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Optional guiding message"</span><span class="p">,</span><span class="w">
</span><span class="nl">"tasks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Step one."</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Step two."</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"..."</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<h2 id="2-add-variables-for-dynamic-outputs">2. Add variables for dynamic outputs</h2>
<p>You can include variables in {curly quotes} that you populate when running a workflow. For example, <code class="language-plaintext highlighter-rouge">target_market</code> is a variable here:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"system_message"</span><span class="p">:</span><span class="w"> </span><span class="s2">"You are an innovative entrepreneur."</span><span class="p">,</span><span class="w">
</span><span class="nl">"tasks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Generate 10 product ideas for {target_market}"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"..."</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<h2 id="3-create-and-use-custom-functions">3. Create and use custom functions</h2>
<p>Custom functions expand LLMs’ capabilities beyond text generation. Easily define new functions by inheriting from the <code class="language-plaintext highlighter-rouge">BaseFunction</code> class. Specify functions to run using <code class="language-plaintext highlighter-rouge">function_call</code> as shown here:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"system_message"</span><span class="p">:</span><span class="w"> </span><span class="s2">"You are a creative artist."</span><span class="p">,</span><span class="w">
</span><span class="nl">"tasks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Brainstorm 10 painting ideas for {painting_subject}."</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Choose the best idea."</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Write a prompt for an AI art generator to produce an image of the painting."</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Generate the painting image using the prompt."</span><span class="p">,</span><span class="w">
</span><span class="nl">"settings"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"function_call"</span><span class="p">:</span><span class="w"> </span><span class="s2">"create_image"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"..."</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<h2 id="4-run-workflows-with-a-simple-command">4. Run workflows with a simple command</h2>
<p>To run a workflow, just use the command line like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> run <span class="nt">--flow</span><span class="o">=</span>workflow_name
</code></pre></div></div>
<p>Or, for workflows with variables, like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> run <span class="nt">--flow</span><span class="o">=</span>workflow_with_variables_name <span class="nt">--variables</span> <span class="s1">'variable_1_name=value1'</span> <span class="s1">'variable_2_name=value2'</span>
</code></pre></div></div>
<p>Agentflow executes the specified workflow and provides a link to a folder with all outputs, including a JSON file containing all of the LLM’s responses.</p>
<h2 id="get-started-with-agentflow">Get started with Agentflow!</h2>
<p>Check out the <a href="https://github.com/simonmesmith/agentflow">installation instructions</a>, explore <a href="https://github.com/simonmesmith/agentflow/issues">ideas and open issues</a>, and feel free to contribute to expanding Agentflow’s capabilities.</p>Simon SmithLarge language models (LLMs) are powerful tools, but implementing complex workflows with them can be a challenge.Use OpenAI API streaming with functions2023-07-26T00:00:00+00:002023-07-26T00:00:00+00:00https://www.simonsmith.ca/2023/07/26/use-openai-streaming-with-functions<p>The <a href="https://platform.openai.com/docs/api-reference">OpenAI API</a> provides offers several features to facilitate using powerful language models like GPT-4 and GPT 3.5.</p>
<p>Two very useful features are streaming and <a href="https://openai.com/blog/function-calling-and-other-api-updates">function calling</a>. With streaming, you give users results from the API as they’re generated, which is a better user experience because users don’t have to wait for an entire response at once. With function calling, you expand GPTs’ capabilities with functions that you define.</p>
<p>But in building with the OpenAI API, I’ve found it challenging to combine streaming with function calling. The main reason is that GPTs stream the function calls as well as the content! Even worse, they stream function calls in pieces. So to combine streaming with function calling, you need to monitor what the models stream, output content if it’s content, and build and execute function calls iteratively when it’s function calls.</p>
<p>I created <a href="https://gist.github.com/simonmesmith/bbeb894fc4ae954b246125eb2902800b">this gist</a> to do just that, and will walk through it at a high level here:</p>
<h1 id="1-install-and-configure-the-openai-library">1. Install and configure the OpenAI library</h1>
<p>First, perhaps obviously, you’ll need to install the OpenAI library, then configure it with your API key.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">pip install openai
</span></code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">openai</span><span class="p">.</span><span class="n">api_key</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">"OPENAI_API_KEY"</span><span class="p">]</span>
</code></pre></div></div>
<p>(Here, we set the API key from environment variables for security.)</p>
<h1 id="2-define-functions">2. Define functions</h1>
<p>To tell GPTs about available functions, you must define them in a way that conforms with <a href="https://json-schema.org/">JSON Schema</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">FUNCTIONS</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"count_string"</span><span class="p">:</span> <span class="p">{</span>
<span class="s">"name"</span><span class="p">:</span> <span class="s">"count_string"</span><span class="p">,</span>
<span class="s">"description"</span><span class="p">:</span> <span class="s">"Counts the number of characters in a string."</span><span class="p">,</span>
<span class="s">"parameters"</span><span class="p">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="p">:</span> <span class="s">"object"</span><span class="p">,</span>
<span class="s">"properties"</span><span class="p">:</span> <span class="p">{</span>
<span class="s">"string_to_count"</span><span class="p">:</span> <span class="p">{</span>
<span class="s">"type"</span><span class="p">:</span> <span class="s">"string"</span><span class="p">,</span>
<span class="s">"description"</span><span class="p">:</span> <span class="s">"The string whose characters you want to count."</span><span class="p">,</span>
<span class="p">},</span>
<span class="p">},</span>
<span class="s">"required"</span><span class="p">:</span> <span class="p">[</span><span class="s">"string_to_count"</span><span class="p">],</span>
<span class="p">},</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">FUNCTIONS_FOR_API</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">FUNCTIONS</span><span class="p">.</span><span class="n">values</span><span class="p">())</span>
</code></pre></div></div>
<p>In this snippet, I define a function called <code class="language-plaintext highlighter-rouge">count_string</code> that counts the number of characters in a string.</p>
<p>Note that I’ve put functions into a dictionary to make it easier to work with in <code class="language-plaintext highlighter-rouge">call_function</code> below, but also into a list, which the OpenAI API needs.</p>
<h1 id="3-create-functions">3. Create functions</h1>
<p>Having defined our functions, you now must create them. In this example, I create a simple function for counting characters in a string.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">count_string</span><span class="p">(</span><span class="n">string_to_count</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="s">"""Counts the number of characters in a string."""</span>
<span class="k">return</span> <span class="nb">str</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">string_to_count</span><span class="p">))</span>
</code></pre></div></div>
<h1 id="4-handle-called-functions">4. Handle called functions</h1>
<p>Next, you need some way to call functions GPTs want to execute. Here, I create a <code class="language-plaintext highlighter-rouge">call_function</code> utility. This function verifies whether the requested function is defined, validates its arguments, and calls the function, returning its result.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">call_function</span><span class="p">(</span><span class="n">function_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">function_arguments</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="s">"""Calls a function and returns the result."""</span>
<span class="p">...</span>
</code></pre></div></div>
<h1 id="5-manage-openai-responses">5. Manage OpenAI responses</h1>
<p>To handle the responses from OpenAI, we define a function that checks for text or function call responses, and executes function calls as needed. Comments in the code go into greater detail about how this works.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_response</span><span class="p">(</span><span class="n">messages</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]])</span> <span class="o">-></span> <span class="n">Generator</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">]:</span>
<span class="s">"""Gets the response from OpenAI, updates the messages array, yields
content, and calls functions as needed."""</span>
<span class="p">...</span>
</code></pre></div></div>
<h1 id="6-bring-it-all-together">6. Bring it all together</h1>
<p>Finally, we use the <code class="language-plaintext highlighter-rouge">get_response</code> function to enable a conversation with streaming output and function calls.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="p">...</span>
</code></pre></div></div>
<p>In this main loop, we take user input, add it to the messages array, and stream the response. If the response involves a function call, we handle it accordingly.</p>
<p>It’s possible there’s an easier way to combine streaming and function calls. It feels like there should be. But if there is, I haven’t found it. Check out <a href="https://gist.github.com/simonmesmith/bbeb894fc4ae954b246125eb2902800b">the code</a> and let me know if you have any suggestions.</p>Simon SmithThe OpenAI API provides offers several features to facilitate using powerful language models like GPT-4 and GPT 3.5.Fine-tune T5 with Hugging Face (as of September 9, 2022)2022-09-09T00:00:00+00:002022-09-09T00:00:00+00:00https://www.simonsmith.ca/2022/09/09/fine-tune-t5-with-hugging-face<p><a href="https://huggingface.co/">Hugging Face</a> is a great resource for streamlining the use of machine learning in applications. It can be challenging, however, to know what documentation and examples are the most up-to-date.</p>
<p>Take the case of fine-tuning a <a href="https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html">T5</a> model. If you search online for “fine tune a T5 model with Hugging Face” you’ll get thousands of results. Many of these are outdated, referring to older versions of the Hugging Face API, which has rapidly evolved. But it’s hard to know which results are outdated.</p>
<p>If you’re in the same boat, you can hopefully <a href="https://gist.github.com/simonmesmith/0334cef17d06d23ca5fa50c78a956d57">save some trouble with this Gist</a>. This should be accurate as of September 9, 2022. Alternatively, here are the steps I used:</p>
<h1 id="1-install-dependencies">1. Install dependencies</h1>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="go">pip install datasets pandas transformers
</span></code></pre></div></div>
<h1 id="2-import-libraries-and-modules">2. Import libraries and modules</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">Dataset</span><span class="p">,</span> <span class="n">DatasetDict</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">T5Tokenizer</span><span class="p">,</span> <span class="n">T5ForConditionalGeneration</span><span class="p">,</span> <span class="n">DataCollatorForSeq2Seq</span><span class="p">,</span> <span class="n">Seq2SeqTrainingArguments</span><span class="p">,</span> <span class="n">Seq2SeqTrainer</span>
</code></pre></div></div>
<h1 id="3-set-model-tokenizer-and-data_collator-variables">3. Set model, tokenizer, and data_collator variables</h1>
<p>Note: You can use other versions of T5 too. <a href="https://huggingface.co/docs/transformers/model_doc/t5">See your options here</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">T5Tokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"t5-base"</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">T5ForConditionalGeneration</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"t5-base"</span><span class="p">)</span>
<span class="n">data_collator</span> <span class="o">=</span> <span class="n">DataCollatorForSeq2Seq</span><span class="p">(</span><span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="4-get-data-and-divide-into-train-eval-and-test-sets">4. Get data and divide into train, eval, and test sets</h1>
<p>Note: Replace the dataframe with your own, but make sure it has “source_text” and “target_text” columns or you’ll need to modify other code below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">"source_text"</span><span class="p">:</span> <span class="p">[],</span> <span class="s">"target_text"</span><span class="p">:</span> <span class="p">[]})</span>
<span class="n">train_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="n">frac</span> <span class="o">=</span> <span class="mf">0.8</span><span class="p">)</span>
<span class="n">eval_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">train_df</span><span class="p">.</span><span class="n">index</span><span class="p">).</span><span class="n">sample</span><span class="p">(</span><span class="n">frac</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="n">test_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">train_df</span><span class="p">.</span><span class="n">index</span><span class="p">).</span><span class="n">drop</span><span class="p">(</span><span class="n">eval_df</span><span class="p">.</span><span class="n">index</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="5-create-a-dataset-dict-from-the-dataframes">5. Create a dataset dict from the dataframes</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset</span> <span class="o">=</span> <span class="n">DatasetDict</span><span class="p">({</span>
<span class="s">"train"</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">from_pandas</span><span class="p">(</span><span class="n">train_df</span><span class="p">),</span>
<span class="s">"eval"</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">from_pandas</span><span class="p">(</span><span class="n">eval_df</span><span class="p">),</span>
<span class="s">"test"</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">from_pandas</span><span class="p">(</span><span class="n">test_df</span><span class="p">),</span>
<span class="p">})</span>
</code></pre></div></div>
<h1 id="6-tokenize-the-dataset">6. Tokenize the dataset</h1>
<p>Note: Change the max_length to whatever makes the most sense for your data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">tokenize</span><span class="p">(</span><span class="n">source_texts</span><span class="p">,</span> <span class="n">target_texts</span><span class="p">):</span>
<span class="n">model_inputs</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">(</span><span class="n">source_texts</span><span class="p">,</span> <span class="n">max_length</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">as_target_tokenizer</span><span class="p">():</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">(</span><span class="n">target_texts</span><span class="p">,</span> <span class="n">max_length</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">model_inputs</span><span class="p">[</span><span class="s">"labels"</span><span class="p">]</span> <span class="o">=</span> <span class="n">labels</span><span class="p">[</span><span class="s">"input_ids"</span><span class="p">]</span>
<span class="k">return</span> <span class="n">model_inputs</span>
<span class="n">tokenized_dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">tokenize</span><span class="p">,</span> <span class="n">input_columns</span><span class="o">=</span><span class="p">[</span><span class="s">"source_text"</span><span class="p">,</span> <span class="s">"target_text"</span><span class="p">],</span> <span class="n">remove_columns</span><span class="o">=</span><span class="p">[</span><span class="s">"source_text"</span><span class="p">,</span> <span class="s">"target_text"</span><span class="p">])</span>
</code></pre></div></div>
<h1 id="7-set-training-arguments">7. Set training arguments</h1>
<p>Note: Change “output_directory” to where you want, and update other parameters as makes sense. <a href="https://huggingface.co/docs/transformers/v4.21.3/en/main_classes/trainer#transformers.TrainingArguments">Here’s the documentation for this</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">training_arguments</span> <span class="o">=</span> <span class="n">Seq2SeqTrainingArguments</span><span class="p">(</span>
<span class="s">"output_directory"</span><span class="p">,</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.0001</span><span class="p">,</span>
<span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span>
<span class="n">fp16</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">per_device_train_batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
<span class="n">per_device_eval_batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
<span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">num_train_epochs</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
<span class="n">evaluation_strategy</span><span class="o">=</span><span class="s">"epoch"</span><span class="p">,</span>
<span class="n">report_to</span><span class="o">=</span><span class="s">"all"</span>
<span class="p">)</span>
</code></pre></div></div>
<h1 id="8-create-a-trainer">8. Create a trainer</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">trainer</span> <span class="o">=</span> <span class="n">Seq2SeqTrainer</span><span class="p">(</span>
<span class="n">model</span><span class="p">,</span>
<span class="n">training_arguments</span><span class="p">,</span>
<span class="n">train_dataset</span><span class="o">=</span><span class="n">tokenized_dataset</span><span class="p">[</span><span class="s">"train"</span><span class="p">],</span>
<span class="n">eval_dataset</span><span class="o">=</span><span class="n">tokenized_dataset</span><span class="p">[</span><span class="s">"eval"</span><span class="p">],</span>
<span class="n">data_collator</span><span class="o">=</span><span class="n">data_collator</span><span class="p">,</span>
<span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span>
<span class="p">)</span>
</code></pre></div></div>
<h1 id="9-train-the-model">9. Train the model</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">trainer</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
</code></pre></div></div>
<h1 id="10-save-the-tokenizer-and-model">10. Save the tokenizer and model</h1>
<p>Note: Update “output_directory” to wherever you want to save everything.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="s">"output_directory"</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="s">"output_directory"</span><span class="p">)</span>
</code></pre></div></div>
<p>And that’s it! Again, <a href="https://gist.github.com/simonmesmith/0334cef17d06d23ca5fa50c78a956d57">all the code is in this Gist</a>. And if you have any issues, you should probably check to make sure nothing has changed with Hugging Face’s API since I wrote this.</p>Simon SmithHugging Face is a great resource for streamlining the use of machine learning in applications. It can be challenging, however, to know what documentation and examples are the most up-to-date.Use offset to work with unindexed arrays in BigQuery2022-08-13T00:00:00+00:002022-08-13T00:00:00+00:00https://www.simonsmith.ca/2022/08/13/use-offset-to-work-with-unindexed-arrays-in-bigquery<p>Recently I faced a challenge of working with multilevel nested arrays in BigQuery. The table I was working with had a structure somewhat like this:</p>
<ul>
<li>id</li>
<li>level_one_struct_array
<ul>
<li>name</li>
<li>…</li>
<li>level_two_struct_array
<ul>
<li>name</li>
<li>…</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>I needed to unnest the arrays, change values within the level two array, and then reaggregate everything.</p>
<p>The trouble is, using <code class="language-plaintext highlighter-rouge">UNNEST</code> in BigQuery doesn’t preserve order. So if I unnested each array and then reaggregated them, I wouldn’t necessarily get things back in the right order. And in my use case, order mattered.</p>
<p>The solution: use <code class="language-plaintext highlighter-rouge">OFFSET</code> to add indexes to the arrays, somewhat as follows:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="n">level_one_flattened</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">id</span><span class="p">,</span>
<span class="n">level_one_offset</span><span class="p">,</span>
<span class="n">level_one_struct</span><span class="p">.</span><span class="o">*</span>
<span class="k">FROM</span> <span class="k">table_name</span><span class="p">,</span>
<span class="k">table_name</span><span class="p">.</span><span class="n">level_one_struct_array</span> <span class="k">AS</span> <span class="n">level_one_struct</span>
<span class="k">WITH</span> <span class="k">OFFSET</span> <span class="k">AS</span> <span class="n">level_one_offset</span>
<span class="p">),</span>
<span class="n">level_two_flattened</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">id</span><span class="p">,</span>
<span class="n">level_one_offset</span><span class="p">,</span>
<span class="n">level_two_offset</span><span class="p">,</span>
<span class="n">level_two_struct</span><span class="p">.</span><span class="o">*</span>
<span class="k">FROM</span> <span class="n">level_one_flattened</span><span class="p">,</span>
<span class="n">level_one_flattened</span><span class="p">.</span><span class="n">level_two_struct_array</span> <span class="k">AS</span> <span class="n">level_two_struct</span>
<span class="k">WITH</span> <span class="k">OFFSET</span> <span class="k">AS</span> <span class="n">level_two_offset</span>
<span class="p">),</span>
<span class="p">...</span>
</code></pre></div></div>
<p>With this done, I could work with <code class="language-plaintext highlighter-rouge">level_one_flattened</code> and <code class="language-plaintext highlighter-rouge">level_two_flattened</code>, then reaggregate everything at the end in the appropriate order using the offset-generated indexes.</p>
<p>It’s not rocket science, and I’m sure people with much greater expertise in SQL than me are very familiar with this. But it wasn’t something I needed to use until recently, when it came in very handy.</p>
<p><a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unnest_operator">Read more about <code class="language-plaintext highlighter-rouge">UNNEST</code> and <code class="language-plaintext highlighter-rouge">OFFSET</code> in BigQuery’s docs here</a>.</p>Simon SmithRecently I faced a challenge of working with multilevel nested arrays in BigQuery. The table I was working with had a structure somewhat like this:Extract structured data from unstructured text using language models like GPT-32022-08-05T00:00:00+00:002022-08-05T00:00:00+00:00https://www.simonsmith.ca/2022/08/05/extract-structured-data-from-unstructured-text-using-language-models-like-gpt-3<p>Recently I faced a common challenge: extracting structured information from millions of unstructured text documents.</p>
<p>Neither regular expression extraction nor part of speech tagging would scale due to having multiple categories of content and inconsistent phrasing within them. We would have had to write tailored regular expressions or part of speech extractions for every new paragraph topic, and account for a long tail of edge cases.</p>
<p>Having experimented with large language models like <a href="https://beta.openai.com">GPT-3</a>, I was curious as to whether we could simply train one to extract the information we wanted into a structured format like JSON. Then we could validate the JSON and load it directly into BigQuery.</p>
<p>I was thrilled to see how well this work, and if you have access to GPT-3 you can immediately try it yourself. For example, first, enter this one-shot training example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>John bought a bag of peanuts for $8. He thought they were delicious.
{“person”: “John”, “product”: “peanuts”, “cost”: ”$8”, “sentiment”: “positive”}
</code></pre></div></div>
<p>Then enter some similar examples and see how well GPT-3 manages them.</p>
<p>Example 1:</p>
<blockquote>
<p>“When she arrived at the store, Sarah purchased a bottle of water. It cost $4.50. She was pissed that it was so expensive!”</p>
</blockquote>
<p>GPT-3’s response:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="err">“person”:</span><span class="w"> </span><span class="err">“Sarah”</span><span class="p">,</span><span class="w"> </span><span class="err">“product”:</span><span class="w"> </span><span class="err">“water”</span><span class="p">,</span><span class="w"> </span><span class="err">“cost”:</span><span class="w"> </span><span class="err">”$</span><span class="mf">4.50</span><span class="err">”</span><span class="p">,</span><span class="w"> </span><span class="err">“sentiment”:</span><span class="w"> </span><span class="err">“negative”</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Example 2:</p>
<blockquote>
<p>“After a long day at work, Frank went shopping for some new clothes. He bought a suit and tie. It cost $1,500. He didn’t mind, as he considered it a cost of doing business.”</p>
</blockquote>
<p>GPT-3’s response:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="err">“person”:</span><span class="w"> </span><span class="err">“Frank”</span><span class="p">,</span><span class="w"> </span><span class="err">“product”:</span><span class="w"> </span><span class="err">“suit</span><span class="w"> </span><span class="err">and</span><span class="w"> </span><span class="err">tie”</span><span class="p">,</span><span class="w"> </span><span class="err">“cost”:</span><span class="w"> </span><span class="err">”$</span><span class="mi">1</span><span class="p">,</span><span class="mi">500</span><span class="err">”</span><span class="p">,</span><span class="w"> </span><span class="err">“sentiment”:</span><span class="w"> </span><span class="err">“neutral”</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>As you can seem from these two examples, GPT-3 generalizes extremely well.</p>
<p>There’s more work to do to scale this up. But so far it’s quite exciting to find a new way to solve such a common challenge.</p>Simon SmithRecently I faced a common challenge: extracting structured information from millions of unstructured text documents.Get colors from images as swatches2022-07-28T00:00:00+00:002022-07-28T00:00:00+00:00https://www.simonsmith.ca/2022/07/28/get-colors-from-images-as-swatches<p>A few weeks ago I was playing with scientific figures and wondering how I might extract insights from them. One idea I had was to find all the colors in scientific images and rank them.</p>
<p>Given that different cells, cell parts, and tissues often have different colors—especially when stained to do so—that could be a productive path. For example, if cancer cells are stained a different color in an image than healthy cells, the higher the percentage of the cancer color, the worse it is.</p>
<p>Turns out this isn’t an easy problem to solve. Why? Because while we see a few colors in an image, there are actually many variations of those colors which are imperceptible to us.</p>
<p>The solution is to cluster images by color. For example, group all the reddish colors together, then all the bluish ones, and so on. And then you can determine the relative amounts of each color.</p>
<p>I wrote some code to do this, which I’ve called “<a href="https://gist.github.com/simonmesmith/a1c3fdef3d8e9a03cd170cc3e7a5a596">Colorgram</a>” and uploaded as a Gist. Here’s an example of it working on <a href="https://labs.openai.com/e/lsqFtxWjDEoJ7knsg5H24L0L/1hvQqqxamd1HlBcDECc2pSQz">a Dall-e image I generated</a>:</p>
<h2 id="input-image">Input image</h2>
<p><img src="/assets/images/colorful-modern-living-room.png" alt="Colorful modern living room" /></p>
<h2 id="colorgrammed-image">Colorgrammed image</h2>
<p><img src="/assets/images/colorful-modern-living-room-colorgrammed.png" alt="Colorful modern living room colorgrammed" /></p>
<p>PS: I owe a debt to various places where I learned how to do this and copied some code snippets. I don’t remember them all but now wish I did. If you recognize this as something you’ve worked on, and want credit, please let me know as I’m happy to give it!</p>Simon SmithA few weeks ago I was playing with scientific figures and wondering how I might extract insights from them. One idea I had was to find all the colors in scientific images and rank them.Extract table data from images with pure Pytesseract2022-07-21T00:00:00+00:002022-07-21T00:00:00+00:00https://www.simonsmith.ca/2022/07/21/extract-table-data-from-images-with-pure-pytesseract<p>When extracting data from documents, one common challenge is processing text in images. This can be particularly difficult when the text is in tables. You don’t just want the text, but want it structured in relation to other text.</p>
<p>You can find solutions to this problem by Googling, but many seem brittle and overcomplicated. For example, they may be brittle because they rely on using image recognition libraries like OpenCV to find gridlines that might not always be present. And they may be overcomplicated because can’t OCR tools like Tesseract alone return information needed to reconstruct table data from provided attributes?</p>
<p>That was my hypothesis, anyway. Since Tesseract gives you information on x and y coordinates of text, and since tables follow a fairly standard format, I thought that we should be able to extract table text and structure using <em>only</em> Tesseract.</p>
<p>So I tested the idea.</p>
<h2 id="the-toy-problem-a-simple-table-in-an-image">The toy problem: A simple table in an image</h2>
<p>To test my hypothesis, I created a very simple toy problem using this table in an image:</p>
<p><img src="/assets/images/toy-table-image.png" alt="Toy table image" /></p>
<h2 id="the-code">The code</h2>
<p>Using this toy problem, here’s how I approached a solution, step by step.</p>
<h3 id="1-import-key-libraries">1. Import key libraries</h3>
<p>I was hoping to use the fewest libraries possible, in keeping with my goal of simplicity. But I ended up having to import several, shown below.</p>
<p>Importantly, note the use of <a href="https://pypi.org/project/pytesseract/">Pytesseract</a>. This is a Python wrapper around Tesseract, which you’ll need to install. See the instructions for doing so by following that link.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">cv2</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">pytesseract</span>
<span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">AgglomerativeClustering</span>
</code></pre></div></div>
<h3 id="2-create-a-function-to-preprocess-the-image">2. Create a function to preprocess the image</h3>
<p>Tesseract works better when you do even basic preprocessing on images. I wrote a function to handle that. Note that it explicitly strips out gridlines, which some other image table extractors I’ve seen need in order to determine rows and columns, as mentioned above.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">preprocess</span><span class="p">(</span><span class="n">image_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">:</span>
<span class="c1"># Get the image.
</span> <span class="n">img</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">imread</span><span class="p">(</span><span class="n">image_path</span><span class="p">)</span>
<span class="c1"># Convert the image to grayscale.
</span> <span class="n">gray_img</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">cvtColor</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">COLOR_BGR2GRAY</span><span class="p">)</span>
<span class="c1"># Remove backgrounds.
</span> <span class="n">bg_free_img</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">gray_img</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">255</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">THRESH_OTSU</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
<span class="c1"># Create an inverse image to use for removing lines.
</span> <span class="n">inverted_img</span> <span class="o">=</span> <span class="o">~</span> <span class="n">bg_free_img</span>
<span class="c1"># Remove horizontal lines.
</span> <span class="c1"># TODO: Set line thickness dynamically.
</span> <span class="n">horizontal_kernel</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">getStructuringElement</span><span class="p">(</span><span class="n">cv2</span><span class="p">.</span><span class="n">MORPH_RECT</span><span class="p">,</span> <span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">remove_horizontal</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">morphologyEx</span><span class="p">(</span><span class="n">inverted_img</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">MORPH_OPEN</span><span class="p">,</span> <span class="n">horizontal_kernel</span><span class="p">,</span> <span class="n">iterations</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">cnts</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">findContours</span><span class="p">(</span><span class="n">remove_horizontal</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">RETR_EXTERNAL</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">CHAIN_APPROX_SIMPLE</span><span class="p">)</span>
<span class="n">cnts</span> <span class="o">=</span> <span class="n">cnts</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">cnts</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span> <span class="k">else</span> <span class="n">cnts</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">cnts</span><span class="p">:</span> <span class="n">cv2</span><span class="p">.</span><span class="n">drawContours</span><span class="p">(</span><span class="n">bg_free_img</span><span class="p">,</span> <span class="p">[</span><span class="n">c</span><span class="p">],</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="p">(</span><span class="mi">255</span><span class="p">,</span> <span class="mi">255</span><span class="p">,</span> <span class="mi">255</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span>
<span class="c1"># Remove vertical lines.
</span> <span class="c1"># TODO: Set line thickness dynamically.
</span> <span class="n">vertical_kernel</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">getStructuringElement</span><span class="p">(</span><span class="n">cv2</span><span class="p">.</span><span class="n">MORPH_RECT</span><span class="p">,</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">40</span><span class="p">))</span>
<span class="n">remove_vertical</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">morphologyEx</span><span class="p">(</span><span class="n">inverted_img</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">MORPH_OPEN</span><span class="p">,</span> <span class="n">vertical_kernel</span><span class="p">,</span> <span class="n">iterations</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">cnts</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">findContours</span><span class="p">(</span><span class="n">remove_vertical</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">RETR_EXTERNAL</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">CHAIN_APPROX_SIMPLE</span><span class="p">)</span>
<span class="n">cnts</span> <span class="o">=</span> <span class="n">cnts</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">cnts</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span> <span class="k">else</span> <span class="n">cnts</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">cnts</span><span class="p">:</span> <span class="n">cv2</span><span class="p">.</span><span class="n">drawContours</span><span class="p">(</span><span class="n">bg_free_img</span><span class="p">,</span> <span class="p">[</span><span class="n">c</span><span class="p">],</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="p">(</span><span class="mi">255</span><span class="p">,</span> <span class="mi">255</span><span class="p">,</span> <span class="mi">255</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span>
<span class="c1"># Return the output image.
</span> <span class="k">return</span> <span class="n">bg_free_img</span>
</code></pre></div></div>
<h3 id="3-create-a-function-to-group-text-by-inferred-row">3. Create a function to group text by inferred row</h3>
<p>Tesseract isn’t so helpful as to return information on text’s row in a table. It doesn’t have a concept of table or row. It does tell you, however, text’s “top” value, which is effectively its y coordinate. Using this, we can cluster text into distinct rows.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_row_max_tops</span><span class="p">(</span><span class="n">img_df</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span> <span class="n">distance_threshold</span><span class="p">:</span> <span class="nb">float</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span><span class="p">:</span>
<span class="c1"># Create coordinates to use for clustering top values for rows. Note that
</span> <span class="c1"># we use (0, y), where why is "top." We specify 0 for x because we don't
</span> <span class="c1"># care here about the left value, only the top value.
</span> <span class="n">row_coordinates</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">0</span><span class="p">,</span> <span class="n">row</span><span class="p">[</span><span class="s">"top"</span><span class="p">])</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">img_df</span><span class="p">.</span><span class="n">iterrows</span><span class="p">()]</span>
<span class="c1"># Cluster rows by top values.
</span> <span class="n">row_clusters</span> <span class="o">=</span> <span class="n">AgglomerativeClustering</span><span class="p">(</span>
<span class="n">n_clusters</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
<span class="n">affinity</span><span class="o">=</span><span class="s">"manhattan"</span><span class="p">,</span>
<span class="n">linkage</span><span class="o">=</span><span class="s">"complete"</span><span class="p">,</span>
<span class="n">distance_threshold</span><span class="o">=</span><span class="n">distance_threshold</span><span class="p">)</span>
<span class="n">row_clusters</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">row_coordinates</span><span class="p">)</span>
<span class="c1"># Create max row tops values using row clusters and sort ascending.
</span> <span class="n">row_max_tops</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">row_index</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">row_clusters</span><span class="p">.</span><span class="n">labels_</span><span class="p">):</span>
<span class="n">row_coordinate_indexes</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">row_clusters</span><span class="p">.</span><span class="n">labels_</span> <span class="o">==</span> <span class="n">row_index</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">row_max_top</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="n">row_coordinates</span><span class="p">[</span><span class="n">row_coordinate_index</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">row_coordinate_index</span> <span class="ow">in</span> <span class="n">row_coordinate_indexes</span><span class="p">])</span>
<span class="n">row_max_tops</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">row_max_top</span><span class="p">)</span>
<span class="n">row_max_tops</span><span class="p">.</span><span class="n">sort</span><span class="p">()</span>
<span class="c1"># Return the row index and max top for each row.
</span> <span class="k">return</span> <span class="p">[(</span><span class="n">i</span><span class="p">,</span> <span class="n">row_max_top</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">row_max_top</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">row_max_tops</span><span class="p">)]</span>
</code></pre></div></div>
<h3 id="4-extract-table-data-from-the-preprocessed-image-using-table-row-clusters">4. Extract table data from the preprocessed image using table row clusters</h3>
<p>With the functions above to preprocess an image and cluster text by row, we’re ready to rock. The last function we need does the following:</p>
<ol>
<li>Preprocess the image</li>
<li>Cluster text into rows</li>
<li>Use Tesseract’s “left” and “word_num” attributes to sort text into appropriate columns</li>
<li>Return everything as a dataframe</li>
</ol>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">read</span><span class="p">(</span><span class="n">image_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">distance_threshold</span><span class="o">=</span><span class="mf">25.0</span><span class="p">)</span> <span class="o">-></span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
<span class="c1"># Preprocess the image.
</span> <span class="n">img</span> <span class="o">=</span> <span class="n">preprocess</span><span class="p">(</span><span class="n">image_path</span><span class="p">)</span>
<span class="c1"># Read the image into a Pytesseract data frame.
</span> <span class="n">img_df</span> <span class="o">=</span> <span class="n">pytesseract</span><span class="p">.</span><span class="n">image_to_data</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">output_type</span><span class="o">=</span><span class="s">"data.frame"</span><span class="p">)</span>
<span class="c1"># Drop any blank text.
</span> <span class="n">img_df</span><span class="p">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Add row numbers to the dataframe. We do this by clustering rows according
</span> <span class="c1"># to their "top" value. We then determine the max "top" value for each row.
</span> <span class="c1"># Then we assign row numbers to the dataframe based on top values.
</span> <span class="n">row_max_tops</span> <span class="o">=</span> <span class="n">get_row_max_tops</span><span class="p">(</span><span class="n">img_df</span><span class="p">,</span> <span class="n">distance_threshold</span><span class="p">)</span>
<span class="n">img_df</span><span class="p">[</span><span class="s">"row_number"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">([],</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">object</span><span class="p">)</span>
<span class="k">for</span> <span class="n">row_number</span><span class="p">,</span> <span class="n">row_max_top</span> <span class="ow">in</span> <span class="n">row_max_tops</span><span class="p">:</span>
<span class="k">if</span> <span class="n">row_number</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span> <span class="n">lower_bound</span> <span class="o">=</span> <span class="n">row_max_tops</span><span class="p">[</span><span class="n">row_number</span> <span class="o">-</span> <span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span> <span class="c1"># E.g. if the prior row has a max top of 50, the lower bound for the next row is 51
</span> <span class="k">else</span><span class="p">:</span> <span class="n">lower_bound</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">upper_bound</span> <span class="o">=</span> <span class="n">row_max_top</span>
<span class="n">img_df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">img_df</span><span class="p">[</span><span class="s">"top"</span><span class="p">].</span><span class="n">between</span><span class="p">(</span><span class="n">lower_bound</span><span class="p">,</span> <span class="n">upper_bound</span><span class="p">),</span> <span class="s">"row_number"</span><span class="p">]</span> <span class="o">=</span> <span class="n">row_number</span>
<span class="c1"># Sort the dataframe by row number, left, and word_num so we can build table content logically.
</span> <span class="n">img_df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">"row_number"</span><span class="p">,</span> <span class="s">"left"</span><span class="p">,</span> <span class="s">"word_num"</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Build the table content.
</span> <span class="n">table_content</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">row_number</span> <span class="ow">in</span> <span class="n">img_df</span><span class="p">[</span><span class="s">"row_number"</span><span class="p">].</span><span class="n">unique</span><span class="p">():</span>
<span class="n">row_content</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">cell_content</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">img_df</span><span class="p">[</span><span class="n">img_df</span><span class="p">[</span><span class="s">"row_number"</span><span class="p">]</span> <span class="o">==</span> <span class="n">row_number</span><span class="p">].</span><span class="n">iterrows</span><span class="p">():</span>
<span class="k">if</span> <span class="n">word</span><span class="p">[</span><span class="s">"word_num"</span><span class="p">]</span> <span class="o">==</span> <span class="mi">1</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">cell_content</span><span class="p">)</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">row_content</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">" "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">cell_content</span><span class="p">))</span>
<span class="n">cell_content</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">cell_content</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">word</span><span class="p">[</span><span class="s">"text"</span><span class="p">])</span>
<span class="n">row_content</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">" "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">cell_content</span><span class="p">))</span>
<span class="n">table_content</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">row_content</span><span class="p">)</span>
<span class="c1"># Convert the table content to a dataframe, and return it.
</span> <span class="k">return</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">table_content</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="the-result-pretty-good">The result? Pretty good!</h2>
<p>To extract table data from an image as a Pandas dataframe, now all you have to run is this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">img</span> <span class="o">=</span> <span class="s">"/path-to-your-image.jpg"</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">read</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
</code></pre></div></div>
<p>Below is the output you’ll get for the toy table. As you can see, the approach works fairly well, though it puts the column headers one column too far to the left.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 Column 1 header Column 2 header Column 3 header None
1 Row 1 header Row 1 column 1 Row 1 column 2 Row 1 column 3
2 Row 2 header Row 2 column 1 Row 2 column 2 Row 2 column 3
3 Row 3 header Row 3 column 1 Row 3 column 2 Row 3 column 3
4 Row 4 header Row 4 column 1 Row 4 column 2 Row 4 column 3
5 Row 5 header Row 5 column 1 Row 5 column 2 Row 5 column 3
6 Row 6 header Row 6 column 1 Row 6 column 2 Row 6 column 3
</code></pre></div></div>
<h2 id="beyond-the-toy-problem-how-does-it-do-in-the-real-world">Beyond the toy problem: How does it do in the real world?</h2>
<p>In the real world, of course, tables aren’t as neat and tidy as the toy problem. And indeed, when I try this approach on dirtier inputs, the results are nowhere near as clean. But as proof of concepts go, I think this works pretty well and has the potential for further improvement. Maybe you have some additional ideas?</p>
<h2 id="download-all-the-code">Download all the code</h2>
<p>If you’d like to download this, try it yourself, and maybe further improve it, please <a href="https://gist.github.com/simonmesmith/73face2a11e226f1cae2481a3927edf7">download the code from this Gist</a>.</p>
<p>PS: Thanks to <a href="https://pyimagesearch.com/2022/02/28/multi-column-table-ocr/">PyImageSearch</a> and <a href="https://stackoverflow.com/questions/33949831/how-to-remove-all-lines-and-borders-in-an-image-while-keeping-text-programmatica">Stack Overflow</a> for some guidance as I worked through this problem.</p>Simon SmithWhen extracting data from documents, one common challenge is processing text in images. This can be particularly difficult when the text is in tables. You don’t just want the text, but want it structured in relation to other text.