Skip to content

file mon module

Sri Harsha Boda edited this page Sep 15, 2017 · 1 revision

Create a new module called file-ingestion in IM repo

Requirement:

The program when started will scan the entire source directory and upload all matching files to HDFS and register uploaded files to FILE table and enqueue to the downstream processes. After initial scan and load it would keep monitoring the source directory and upload new matching files to HDFS. It must upload only when the file has been completely written to the disk (no partial data upload ). The program must continuously run once started and should not terminate of its own. After completion of successful upload to HDFS the source file will either get deleted or renamed to _archived.

https://docs.oracle.com/javase/tutorial/essential/io/notification.html

Will have following argument

  • --monitoring-dir (will pick from properties table- this directory would be constantly monitored by the filemon program)
  • --hdfs-upload-dir (will pick from properties table – the HDFS directory where the matching files would be uploaded>
  • --delete-copied-source <true | false > (if true then after uploading the file to HDFS the source file will be deleted)
  • --file-name-pattern (only source files matching this pattern will be uploaded to HDFS)
Clone this wiki locally