Class: MartSearch::IndexBuilder
- Inherits:
-
Object
- Object
- MartSearch::IndexBuilder
- Includes:
- MartSearch, IndexBuilderUtils, Utils
- Defined in:
- lib/martsearch/index_builder.rb
Overview
This class is responsible for building and updating of a Solr search index for use with a MartSearch application.
Constant Summary
Constant Summary
Constants included from MartSearch
Instance Attribute Summary (collapse)
-
- (Object) builder_config
readonly
Returns the value of attribute builder_config.
-
- (Object) document_cache
readonly
Returns the value of attribute document_cache.
-
- (Object) index_config
readonly
Returns the value of attribute index_config.
-
- (Object) log
readonly
Returns the value of attribute log.
Instance Method Summary (collapse)
-
- (Object) fetch_datasets
Function to control the dataset download process.
-
- (IndexBuilder) initialize
constructor
A new instance of IndexBuilder.
-
- (Object) process_datasets
Function to control the processing of the dataset downloads.
-
- (Object) save_solr_document_xmls
Function to build and store the XML files needed to update a Solr index based on the @document_cache store in this current instance.
-
- (Object) send_xml_to_solr
Function to send all of the XML files to the Solr instance.
Methods included from IndexBuilderUtils
#extract_value_to_index, #index_concatenated_ontology_terms, #index_extracted_attributes, #index_grouped_attributes, #index_ontology_terms, #new_document, #open_daily_directory, #process_attribute_map, #setup_and_move_to_work_directory, #solr_document_xml
Methods included from Utils
#build_http_client, #convert_array_to_hash
Constructor Details
- (IndexBuilder) initialize
A new instance of IndexBuilder
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# File 'lib/martsearch/index_builder.rb', line 16 def initialize() ms_config = MartSearch::Controller.instance().config @index_config = ms_config[:index] @builder_config = ms_config[:index_builder] @datasources_config = ms_config[:datasources] @builder_config[:number_of_docs_per_xml_file] = 1000 @log = Logger.new(STDOUT) @log.level = Logger::DEBUG @log.datetime_format = "%Y-%m-%d %H:%M:%S " # Create a document cache, and a helper lookup variable @file_based_cache = false @document_cache = {} @document_cache_keys = {} @document_cache_lookup = {} # Setup in an memory ontology cache - this will reduce the amount # of repetetive graph traversal and computation we need to do @ontology_cache = {} end |
Instance Attribute Details
- (Object) builder_config (readonly)
Returns the value of attribute builder_config
14 15 16 |
# File 'lib/martsearch/index_builder.rb', line 14 def builder_config @builder_config end |
- (Object) document_cache (readonly)
Returns the value of attribute document_cache
14 15 16 |
# File 'lib/martsearch/index_builder.rb', line 14 def document_cache @document_cache end |
- (Object) index_config (readonly)
Returns the value of attribute index_config
14 15 16 |
# File 'lib/martsearch/index_builder.rb', line 14 def index_config @index_config end |
- (Object) log (readonly)
Returns the value of attribute log
14 15 16 |
# File 'lib/martsearch/index_builder.rb', line 14 def log @log end |
Instance Method Details
- (Object) fetch_datasets
Function to control the dataset download process. Determines if we need to download each dataset (configured using the 'days_between_downlads' option) - then only downloads the datasets that need downloading.
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'lib/martsearch/index_builder.rb', line 42 def fetch_datasets @log.info "Starting dataset downloads..." pwd = Dir.pwd setup_and_move_to_work_directory() # First see which datasets we need to download (based on the age # of the 'current' dump file). Dir.chdir('dataset_dowloads/current') datasets_to_download = [] @builder_config[:datasets_to_index].each do |ds| ds_conf = @builder_config[:datasets][ds.to_sym] if File.exists?("#{ds}.marshal") = File.new("#{ds}.marshal").mtime = Time.now() file_age_in_days = ( ( ( ( - ).round / 60 ) / 60 ) / 24 ) if file_age_in_days >= ds_conf[:indexing][:days_between_downlads] datasets_to_download.push(ds) end else datasets_to_download.push(ds) end end open_daily_directory( 'dataset_dowloads', false ) Parallel.each( datasets_to_download, :in_threads => 10 ) do |ds| # datasets_to_download.each do |ds| # puts " - #{ds}: requesting data" @log.info " - #{ds}: requesting data" results = fetch_dataset( ds ) # puts " - #{ds}: #{results[:data].size} rows of data returned" @log.info " - #{ds}: #{results[:data].size} rows of data returned" end @log.info "Dataset downloads completed." Dir.chdir(pwd) end |
- (Object) process_datasets
Function to control the processing of the dataset downloads. Once the processing is complete it will also save the @document_cache to disk.
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/martsearch/index_builder.rb', line 85 def process_datasets @log.info "Starting dataset processing..." pwd = Dir.pwd setup_and_move_to_work_directory() Dir.chdir('dataset_dowloads/current') @builder_config[:datasets_to_index].each do |ds| @log.info " - #{ds}: processing results" process_dataset(ds) clean_document_cache() @log.info " - #{ds}: processing results complete" end @log.info "Finished dataset processing." @log.info "Saving @document_cache to disk." save_document_cache() Dir.chdir(pwd) end |
- (Object) save_solr_document_xmls
Function to build and store the XML files needed to update a Solr index based on the @document_cache store in this current instance.
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
# File 'lib/martsearch/index_builder.rb', line 109 def save_solr_document_xmls pwd = Dir.pwd open_daily_directory( 'solr_xml' ) batch_size = @builder_config[:number_of_docs_per_xml_file] @log.info "Creating Solr XML files (#{batch_size} docs per file)..." open_stored_document_cache if @document_cache_keys.empty? doc_chunks = @document_cache_keys.keys.chunk( batch_size ) doc_chunks_size = doc_chunks.size - 1 Parallel.each( (0..doc_chunks_size), :in_threads => 5 ) do |chunk_number| @log.info " - writing solr-xml-#{chunk_number+1}.xml" doc_names = doc_chunks[chunk_number] docs = [] doc_names.each do |name| docs.push( get_document( name ) ) end file = File.open( "solr-xml-#{chunk_number+1}.xml", "w" ) file.print solr_document_xml(docs) file.close end Dir.chdir(pwd) end |
- (Object) send_xml_to_solr
Function to send all of the XML files to the Solr instance.
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# File 'lib/martsearch/index_builder.rb', line 138 def send_xml_to_solr pwd = Dir.pwd open_daily_directory( 'solr_xml', false ) client = build_http_client() index_url = "#{@index_config[:builder_url]}/update" url = URI.parse( index_url ) client.start( url.host, url.port ) do |http| @log.info "Sending XML files to Solr (#{index_url})" Dir.glob("solr-xml-*.xml").each do |file| @log.info " - #{file}" data = File.read( file ) res = http.post( url.path, data, { 'Content-type' => 'text/xml; charset=utf-8' } ) if res.code.to_i != 200 raise "Error uploading #{file} to index!\ncode: #{res.code}\nbody: #{res.body}" end end @log.info " - commiting and optimising updates" ['<commit/>','<optimize/>'].each do |task| res = http.post( url.path, task, { 'Content-type' => 'text/xml; charset=utf-8' } ) if res.code.to_i != 200 raise "Error sending #{task} instruction to index!\ncode: #{res.code}\nbody: #{res.body}" end end end Dir.chdir(pwd) end |