Class: Heathen::Processor

Inherits:
Object
  • Object
show all
Defined in:
lib/heathen/processor.rb,
lib/heathen/processor_methods/pdftotext.rb,
lib/heathen/processor_methods/tesseract.rb,
lib/heathen/processor_methods/htmltotext.rb,
lib/heathen/processor_methods/wkhtmltopdf.rb,
lib/heathen/processor_methods/libreoffice.rb,
lib/heathen/processor_methods/convert_image.rb

Overview

The Processor is the heart of the Heathen conversion process. Mixed in to it are all of the processing steps available to Heathen (see the processing_methods directory).

See [Task] for how it is used.

Mixin methods (defined in processing_methods/) are expected to make their changes to the Processor#job object, setting content or other values as necessary.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(job:, logger: Logger.new(STDOUT), base_tmpdir: '/tmp') ⇒ Processor

Creates a new processor.

Parameters:

  • job (Job)

    the job to be performed.

  • logger (Logger) (defaults to: Logger.new(STDOUT))

    an optional logger.

  • base_tmpdir (String) (defaults to: '/tmp')

    the base directory for all temporary (sandbox_dir) files


20
21
22
23
24
25
26
# File 'lib/heathen/processor.rb', line 20

def initialize( job:, logger: Logger.new(STDOUT), base_tmpdir:'/tmp' )
  @job = job
  @logger = logger
  @executioner = Heathen::Executioner.new(@logger)
  @sandbox_dir = Dir.mktmpdir( "heathen", base_tmpdir.to_s )
  job.sandbox_dir = @sandbox_dir
end

Instance Attribute Details

#executionerObject (readonly)

Returns the value of attribute executioner


13
14
15
# File 'lib/heathen/processor.rb', line 13

def executioner
  @executioner
end

#jobObject (readonly)

Returns the value of attribute job


12
13
14
# File 'lib/heathen/processor.rb', line 12

def job
  @job
end

#sandbox_dirObject (readonly)

Returns the value of attribute sandbox_dir


14
15
16
# File 'lib/heathen/processor.rb', line 14

def sandbox_dir
  @sandbox_dir
end

Instance Method Details

#clean_upObject

Called to clean up temporary files at end of processing


42
43
44
# File 'lib/heathen/processor.rb', line 42

def clean_up
  FileUtils.remove_entry @sandbox_dir
end

#config_file(name) ⇒ Object


51
52
53
54
# File 'lib/heathen/processor.rb', line 51

def config_file name  # I don't like this. Change for C_ ? - I'd like to keep colore bits out so I can gemify heathen

  Pathname.new(__FILE__).realpath.parent.parent.parent + 'config' + name
end

#convert_image(to: 'tiff', params: nil) ⇒ Object

Converts an image to a different image format. This is done by running the 'convert' utility from ImageMagick. Sets the job content to the new format.

Parameters:

  • to (String) (defaults to: 'tiff')

    the format to convert to (suffix)

  • params (Array) (defaults to: nil)

    optional parameters to pass to the convert program.

Raises:


7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# File 'lib/heathen/processor_methods/convert_image.rb', line 7

def convert_image to: 'tiff', params: nil
  expect_mime_type 'image/*'

  target_file = temp_file_name '', ".#{to.to_s}"
  executioner.execute(
    *[ 'convert',
    params.split(/ +/),
    job.content_file,
    target_file ].flatten
  )
  raise ConversionFailed.new if executioner.last_exit_status != 0
  c = File.read(target_file)
  job.content = File.read(target_file)
  File.unlink(target_file)
end

#expect_mime_type(pattern) ⇒ Object

Compares the job current content's mime type with the given pattern, raising InvalidMimeTypeInStep if it does not match. This is a helper method for mixin methods.

Parameters:

  • pattern (String)

    a regex pattern, e.g. “image/.*”

Raises:


31
32
33
# File 'lib/heathen/processor.rb', line 31

def expect_mime_type pattern
  raise InvalidMimeTypeInStep.new(pattern,job.mime_type) unless job.mime_type =~ %r[#{pattern}]
end

#htmltotextObject


5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# File 'lib/heathen/processor_methods/htmltotext.rb', line 5

def htmltotext
  expect_mime_type 'text/html'

  begin
    doc = Nokogiri::HTML(File.open(job.content_file))

    # Strip JS / CSS from the file so it doesn't appear in the output
    doc.css('script, link').each { |node| node.remove }

    text = doc.css('body').text
  rescue Nokogiri::SyntaxError => e
    raise ConversionFailed.new(e)
  end

  job.content = text
end

#libreoffice(format:) ⇒ Object

Converts office documents to their counterpart (e.g. MS Word -> LibreOffice word, or MS Excel -> LibreOffice Sheet) or to PDF. Calls the external 'libreoffice' utility to achieve this. @param: format [String] output format. Must be one of:

pdf - convert to PDF (any libre-office format)
ms  - corresponding Microsoft format
oo  - corresponding LibreOffice format

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# File 'lib/heathen/processor_methods/libreoffice.rb', line 11

def libreoffice( format: )
  suffixes = {
    'pdf' => {
      '.*' => 'pdf',
    },
    'msoffice' => {
      'application/vnd.openxmlformats-officedocument.wordprocessingml.document' => 'docx',
      'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' => 'xlsx',
      'application/vnd.openxmlformats-officedocument.presentationml.presentation' => 'pptx',
      'application/vnd.oasis.opendocument.text' => 'docx',
      'application/vnd.oasis.opendocument.spreadsheet' => 'xlsx',
      'application/vnd.oasis.opendocument.presentation' => 'pptx',
      'application/zip' => 'docx',
    },
    'ooffice' => {
      'application/msword' => 'odt',
      'application/vnd.ms-word' => 'odt',
      'application/vnd.ms-excel' => 'ods',
      'application/vnd.ms-office' => 'odt',
      'application/vnd.ms-powerpoint' => 'odp',
      'application/vnd.openxmlformats-officedocument.wordprocessingml.document' => 'odt',
      'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' => 'ods',
      'application/vnd.openxmlformats-officedocument.presentationml.presentation' => 'odp',
    },
    'txt' => {
      '.*' => 'txt'
    }
  }

  raise InvalidParameterInStep.new('format', format) unless suffixes[format.to_s]
  to_suffix = nil
  suffixes[format.to_s].each do |k,v|
    to_suffix = v if job.mime_type =~ /#{k}/
  end
  raise InvalidMimeTypeInStep.new('(various document formats)', job.mime_type) unless to_suffix

  output = nil

  if to_suffix == 'txt'
    executioner.execute(
      'tika',
      '--text',
      job.content_file,
      binary: true
    )

    output = executioner.stdout
  else
    target_file = "#{job.content_file}.#{to_suffix}"
    executioner.execute(
      'libreoffice',
      '--convert-to', to_suffix,
      '--outdir', sandbox_dir,
      job.content_file,
      '--headless',
    )

    unless File.exist? target_file
      raise ConversionFailed.new("Cannot find converted file (looking for #{File.basename(target_file)})" )
    end

    output = File.read(target_file)
    File.unlink(target_file)
  end

  raise ConversionFailed.new(executioner.last_messages) if executioner.last_exit_status != 0

  job.content = output
end

#logObject


56
57
58
# File 'lib/heathen/processor.rb', line 56

def log
  @logger
end

#pdftotextObject

Raises:


3
4
5
6
7
8
9
10
11
12
13
14
15
# File 'lib/heathen/processor_methods/pdftotext.rb', line 3

def pdftotext
  expect_mime_type 'application/pdf'

  executioner.execute(
    'tika',
    '--text',
    job.content_file,
    binary: true
  )
  raise ConversionFailed.new(executioner.last_messages) if executioner.last_exit_status != 0

  job.content = executioner.stdout
end

#perform_task(action) ⇒ Object

Performs a sub-task, defined by action. See [Task] for details.


36
37
38
39
# File 'lib/heathen/processor.rb', line 36

def perform_task action
  task_proc = Task.find(action, job.mime_type)[:proc]
  self.instance_eval &task_proc
end

#temp_file_name(prefix = '', suffix = '') ⇒ Object

Creates a new temporary file in the sandbox


47
48
49
# File 'lib/heathen/processor.rb', line 47

def temp_file_name prefix='', suffix=''
  Dir::Tmpname.create( [prefix,suffix], @sandbox_dir ){}
end

#tesseract(format: nil) ⇒ Object

Performs OCR on the input document, which must be in TIFF format. Calls the 'tesseract' program to achieve this. @param: format - output format. Possibilities are nil, hocr and pdf

nil creates a text version
hocr creates a .hocr XML file preserving letter position
pdf creates a .pdf file, consisting of the image backed by the text.

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# File 'lib/heathen/processor_methods/tesseract.rb', line 10

def tesseract format: nil
  expect_mime_type 'image/tiff'

  # Grrrrrrrrrrrrrrrrrrrr Iso2/3 grrrrrrrrrrrrr
  lang = ISO_639.find job.language
  raise InvalidLanguageInStep.new(job.language) if lang.nil?

  target_file = temp_file_name
  executioner.execute(
    'tesseract',
    job.content_file,
    target_file,
    '-l', lang.alpha3.downcase,
    format,
  )
  raise ConversionFailed.new(executioner.last_messages) if executioner.last_exit_status != 0
  suffix = format ? format : 'txt'
  target_file = "#{target_file}.#{suffix}"
  job.content = File.read(target_file)
  File.unlink(target_file)
end

#wkhtmltopdf(params = '') ⇒ Object

Raises:


5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# File 'lib/heathen/processor_methods/wkhtmltopdf.rb', line 5

def wkhtmltopdf params=''
  expect_mime_type 'text/html'

  target_file = temp_file_name
  wkhtmltopdf = Colore::C_.wkhtmltopdf_path || 'wkhtmltopdf'
  executioner.execute(
    *[wkhtmltopdf, '-q',
    _wkhtmltopdf_options(job.content),
    params.split(/ +/),
    job.content_file('.html'),
    target_file,
    ].flatten
  )
  @logger.error(executioner.last_messages[:stderr])
  raise ConversionFailed.new('PDF converter rejected the request') if executioner.last_exit_status != 0
  job.content = File.read(target_file)
  File.unlink(target_file)
end