Skip to content

Latest commit

 

History

History
80 lines (52 loc) · 3.21 KB

18-henkei.md

File metadata and controls

80 lines (52 loc) · 3.21 KB

Day 18 - henkei Gem - Read Text and Meta Data from Word, PowerPoint, and PDF Files

Written by {% avatar swanson %} Matt Swanson

Contrarian-in-training. Building products. Karl Pilkington is my spirit animal. Hacking on Boring Rails.

Searching within uploaded files

If you've ever built an application that involved file uploads, inevitably you will receive a request to be able to search through those files.

While there are plenty of articles and tools for implementing full-text search with Ruby, nearly all of these examples are for searching your database records. But what if you need to search the contents of a PDF? Or a Microsoft Word document? Or even a PowerPoint presentation? Sounds like a nightmare.

The basic strategy for this problem is to extract as much textual content from the file as your can, break into into chunks -- maybe by page or paragraph -- and then index those chunks in a tool like ElasticSearch, Algolia, or PgSearch.

But how do you get the text out of these files? It's not as simple as reading a .txt file.

Enter henkei

The henkei gem is a small wrapper around the Apache Tika project.

You can extract the text of any supported file using a common interface:

require 'henkei'

data = File.read('TPS Report v2.docx)'
text = Henkei.read(:text, data)

Here are some of the formats supported:

  • Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
  • OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
  • Apple iWorks Formats
  • Rich Text Format (.rtf)
  • Portable Document Format (.pdf)

For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.

How it works in practice

In most cloud environments, files are stored on an external service. Henkei can also open a file from a URL:

henkei = Henkei.new 'http://my-bucket.s3.aws.com/uploads/2020-projections.pptx'
text = henkei.text

Now that you've got any text from the file in a big Ruby String, you can use whatever methods you want to split the data into chunks and integrate it into full-text search tools.

def extract_text_chunks(s3_url)
  raw_text = Henki.new(s3_url).text
  chunks = []
  chunk = ""

  raw_text.split(/[^[[:word:]]]+/).each do |word|
    chunk += "#{word} "
    if chunk.size > MAX_CHUNK_SIZE
      chunks << chunk.squish
      chunk = ""
    end
  end

  chunks.flatten.compact.reject(&:blank?)
end

Installation Note

One note is that since this gem wraps the Apache Tika library, you will need a Java runtime in your environment to use this gem. It's should not be a problem to add a Java runtime to most hosting providers, but be aware of this dependendancy.

Find Out More

References