Merge pull request #2 from microsoft/init-release

Init release
microsoft · Oct 26, 2024 · c1ee29f · c1ee29f
2 parents 7ee1b04 + f20c2d5
commit c1ee29f
Show file tree

Hide file tree

Showing 678 changed files with 80,733 additions and 44 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,10 @@
 ## Ignore Visual Studio temporary files, build results, and
 ## files generated by popular Visual Studio add-ons.
 ##
-## Get latest from https://github.com/github/gitignore/blob/main/VisualStudio.gitignore
+## Get latest from `dotnet new gitignore`
+
+# dotenv files
+.env
 
 # User-specific files
 *.rsuser
@@ -57,11 +60,14 @@ dlldata.c
 # Benchmark Results
 BenchmarkDotNet.Artifacts/
 
-# .NET Core
+# .NET
 project.lock.json
 project.fragment.lock.json
 artifacts/
 
+# Tye
+.tye/
+
 # ASP.NET Scaffolding
 ScaffoldingReadMe.txt
 
@@ -396,3 +402,83 @@ FodyWeavers.xsd
 
 # JetBrains Rider
 *.sln.iml
+.idea/
+
+##
+## Visual studio for Mac
+##
+
+
+# globs
+Makefile.in
+*.userprefs
+*.usertasks
+config.make
+config.status
+aclocal.m4
+install-sh
+autom4te.cache/
+*.tar.gz
+tarballs/
+test-results/
+
+# Mac bundle stuff
+*.dmg
+*.app
+
+# content below from: https://github.com/github/gitignore/blob/main/Global/macOS.gitignore
+# General
+.DS_Store
+.AppleDouble
+.LSOverride
+
+# Icon must end with two \r
+Icon
+
+
+# Thumbnails
+._*
+
+# Files that might appear in the root of a volume
+.DocumentRevisions-V100
+.fseventsd
+.Spotlight-V100
+.TemporaryItems
+.Trashes
+.VolumeIcon.icns
+.com.apple.timemachine.donotpresent
+
+# Directories potentially created on remote AFP share
+.AppleDB
+.AppleDesktop
+Network Trash Folder
+Temporary Items
+.apdisk
+
+# content below from: https://github.com/github/gitignore/blob/main/Global/Windows.gitignore
+# Windows thumbnail cache files
+Thumbs.db
+ehthumbs.db
+ehthumbs_vista.db
+
+# Dump file
+*.stackdump
+
+# Folder config file
+[Dd]esktop.ini
+
+# Recycle Bin used on file shares
+$RECYCLE.BIN/
+
+# Windows Installer files
+*.cab
+*.msi
+*.msix
+*.msm
+*.msp
+
+# Windows shortcuts
+*.lnk
+
+# Vim temporary swap files
+*.swp
diff --git a/App/Data_Processing.md b/App/Data_Processing.md
@@ -0,0 +1,50 @@
+## Content Processing
+Additional details about how content processing is handled in the solution. This includes the workflow steps and how to use your own data in the solution.
+
+### Workflow
+
+1. <u>Document upload</u><br/>
+Documents added to blob storage. Processing is triggered based on file check-in.
+
+2. <u>Text extraction, context extraction (image)</u><br/>
+Based on file type, an appropriate processing pipeline is used
+
+3. <u>Summarization</u><br/>
+LLM summarization of the extracted content.
+
+4. <u>Keyword and entity extraction</u><br/>
+Keywords extracted from full document through an LLM prompt. If document is too large, keywords are extracted from the summarization.
+
+5. <u>Text chunking from text extraction results</u><br/>
+Chunking size is aligned with the embedding model size.
+
+6. <u>Vectorization</u><br/>
+Creation of embeddings from chunked text using text-embedding-3-large model.
+
+7. <u>Save results to Azure AI Search index</u><br/>
+Data added to index including vectorized fields, text chunks, keywords, entity specific meta data.
+
+### Customizing With Your Own Documents
+
+There are two methods to use your own data in this solution. It takes roughly 10-15 minutes for a file to be processed and show up in the index and in results on the web app.
+
+1. <u>Web App - UI Uploading</u><br/>
+You can upload through the user interface files that you would like processed. These files are uploaded to blob storage, processed, and added to the Azure AI Search index. File uploads are limited to 500MB and restricted to the following file formats: Office Files, TXT, PDF, TIFF, JPG, PNG.
+
+2. <u>Bulk File Processing</u><br/>
+You can take buik file processing since the web app saves uploaded files here also. This would be the ideal to upload a large number of document or files that are large in size. 
+
+### Modifying Processing Prompts 
+
+Prompt based processing is used for context extraction, summarization, and keyword/entity extraction. Modifications to the prompts will change what is extracted for the related workflow step.
+
+You can find the prompt configuration text files for **summarization** and **keyword/entity** extraction in this folder:
+```
+\App\kernel-memory\service\Core\Prompts\
+```
+
+**Context extraction** requires a code re-compile. You can modify the prompt in this code file on <u>line 56</ul>:
+
+```
+\App\kernel-memory\service\Core\DataFormats\Image\ImageContextDecoder.cs
+ ```
diff --git a/App/Technical_Architecture.md b/App/Technical_Architecture.md
@@ -0,0 +1,52 @@
+## Technical Architecture
+
+Additional details about the technical architecture of the Document Knowledge Mining solution accelerator. This describes the purpose and additional context of each component in the solution.
+
+![image](../Images/readme/architecture.png)
+
+
+### Ingress Controller
+Using Azure's Application Gateway Ingress Controller for Kubernetes. Allowing for load balancing and dynamic traffic management across the application layer.
+
+### Azure Kubernetes
+Using Azure Kubernetes Service, the application is deployed as a managed containerized app. This is ideal for deploying a high availability, scalable, and portable application to multiple regions.
+
+### Container Registry
+Using Azure Container Registry, container images are built, stored, and managed in a private registry. These container images include the Document Processor, AI Service, and Web App.
+
+### Web App
+Using Azure App Service, a web app acts as the UI for the solutions. The app is built with React and TypeScript. it acts as an API client to create an experience for document search, an easy to use upload and processing interface, and an LLM powered conversational user interface.
+
+### Service - Document Processor
+Internal kubernetes cluster for document processing pods.
+
+### Document Processor Pods
+API end points to facilitate processing of documents that are stored in blob storage. Azure Kubernetes Pod that handles saving document chunks, vectors, and keywords to Azure AI Search and blob storage. It extracts content and context from images in order to derive knowledge, keywords, topics, and summarizations. Based on the file type, different processing pipelines are run to extract the data in the appropriate steps.
+
+### Service - AI Service
+Internal kubernetes cluster for AI service pods.
+
+### AI Service Processor Pods
+Azure Kubernetes Pod that acts as the solution's orchestration layer (with Semantic Kernel) for interaction with the LLM for the web app. This also includes chat end points to (syncronous and asyncrounous) to stream chat coversations on the web app and to save chat history. This facilitates saving document meta data, keywords and summarizatinons to Cosmos DB to show them through the web app's user interface.
+
+### App Configuration
+Using Azure App Configuration, app settings and configurations are centralized and used with the Document Processor Service, AI Service, and Web App.
+
+### Storage Queue
+Using Azure Storage Queue, pipeline work steps and processing jobs are added to the storage queue to be picked up and run for their respective jobs. Files uploaded are queue while being saved the blob storage and removed after successful completion. 
+
+### Azure AI Search
+Processed and extracted document information is added to an Azure AI Search vecortized index. This vectorized index includes columns relevant to the document set and is integrated with the web app to power the document search and document chatting experience.
+
+### Azure Document Intelligence
+One step of the data processing workflow where documents have Optical Character Recognition (OCR) applied to extract data. This includes text and handwriting extraction from documents.
+
+### GPT 4o mini
+Using Azure OpenAI, a deployment of the GPT 4o mini model (version 2024-07-18) is used during the data processing workflow to extract content, context, keywords, knowledge, topics and summarization. This model is also used in the web app's chat experience. This model can be changed to a different Azure OpenAI model if desired, but this has not been thoroughly tested and may be affected by the output token limits.
+
+### Blob Storage
+Using Azure Blog Storage, unprocessed document are stored as blobs. The data processing workflow reads the file and saves a JSON, text chunks, markdown, embedded text, and meta data including keywords and sumamrization of the processed data back to blob storage. Files uploaded through the web app's upload capabilities are uploaded here.
+
+
+### Cosmos DB for MongoDB
+Using Azure Cosmos DB for MongoDB, documents that have been processed have their processing results saved to a table. The web app chat experience saves chat history to a table. The processed document results and chat history are used to inform prompt recommendations and answers. 
diff --git a/App/backend-api/.dockerignore b/App/backend-api/.dockerignore
@@ -0,0 +1,30 @@
+**/.classpath
+**/.dockerignore
+**/.env
+**/.git
+**/.gitignore
+**/.project
+**/.settings
+**/.toolstarget
+**/.vs
+**/.vscode
+**/*.*proj.user
+**/*.dbmdl
+**/*.jfm
+**/azds.yaml
+**/bin
+**/charts
+**/docker-compose*
+**/Dockerfile*
+**/node_modules
+**/npm-debug.log
+**/obj
+**/secrets.dev.yaml
+**/values.dev.yaml
+LICENSE
+README.md
+!**/.gitignore
+!.git/HEAD
+!.git/config
+!.git/packed-refs
+!.git/refs/heads/**