Skip to content

bgfeldm/Csv2Lucene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Csv2Lucene

Note: code is largely a proof of concept.

Description:
   Csv2Lucene's goal is to bulk index a large amount of huge CSV files quickly.
   The focus is on the record level and not the file level when creating threads. 
   Multi-Threading on record lines instead of files has advantages when it comes
   to speed as well as recovery.

Requirements:
   - Quickly index a large amount of huge CSV files.
   - Thread on record lines not on files.

Advantages of threading on the record level instead of by file:
   - Working on a single huge database dump file.
   - Faster to index until the very end, keeping all threads busy until the last few lines
   - Simpler recovery from abrupt application halts, since we are reading a smaller set of 
   files at a time. 

Note: More than one file can be read at a time, when reading the tail end of one file
the beginning of the next file is read, keeping a continuous flow of record lines 
until the last record of the last file is read.

About

Fast CSV file indexer for Lucene Index

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published