GitHub - bgfeldm/Csv2Lucene: Fast CSV file indexer for Lucene Index

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
config		config
lib		lib
res		res
src/us/brianfeldman		src/us/brianfeldman
.gitattributes		.gitattributes
.gitignore		.gitignore
README		README
build.xml		build.xml

Repository files navigation

Csv2Lucene

Note: code is largely a proof of concept.

Description:
   Csv2Lucene's goal is to bulk index a large amount of huge CSV files quickly.
   The focus is on the record level and not the file level when creating threads. 
   Multi-Threading on record lines instead of files has advantages when it comes
   to speed as well as recovery.

Requirements:
   - Quickly index a large amount of huge CSV files.
   - Thread on record lines not on files.

Advantages of threading on the record level instead of by file:
   - Working on a single huge database dump file.
   - Faster to index until the very end, keeping all threads busy until the last few lines
   - Simpler recovery from abrupt application halts, since we are reading a smaller set of 
   files at a time. 

Note: More than one file can be read at a time, when reading the tail end of one file
the beginning of the next file is read, keeping a continuous flow of record lines 
until the last record of the last file is read.