Skip to content

Commit

Permalink
Merge pull request #2 from microsoft/init-release
Browse files Browse the repository at this point in the history
Init release
  • Loading branch information
toherman-msft authored Oct 26, 2024
2 parents 7ee1b04 + f20c2d5 commit c1ee29f
Show file tree
Hide file tree
Showing 678 changed files with 80,733 additions and 44 deletions.
90 changes: 88 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
## Ignore Visual Studio temporary files, build results, and
## files generated by popular Visual Studio add-ons.
##
## Get latest from https://github.com/github/gitignore/blob/main/VisualStudio.gitignore
## Get latest from `dotnet new gitignore`

# dotenv files
.env

# User-specific files
*.rsuser
Expand Down Expand Up @@ -57,11 +60,14 @@ dlldata.c
# Benchmark Results
BenchmarkDotNet.Artifacts/

# .NET Core
# .NET
project.lock.json
project.fragment.lock.json
artifacts/

# Tye
.tye/

# ASP.NET Scaffolding
ScaffoldingReadMe.txt

Expand Down Expand Up @@ -396,3 +402,83 @@ FodyWeavers.xsd

# JetBrains Rider
*.sln.iml
.idea/

##
## Visual studio for Mac
##


# globs
Makefile.in
*.userprefs
*.usertasks
config.make
config.status
aclocal.m4
install-sh
autom4te.cache/
*.tar.gz
tarballs/
test-results/

# Mac bundle stuff
*.dmg
*.app

# content below from: https://github.com/github/gitignore/blob/main/Global/macOS.gitignore
# General
.DS_Store
.AppleDouble
.LSOverride

# Icon must end with two \r
Icon


# Thumbnails
._*

# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent

# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk

# content below from: https://github.com/github/gitignore/blob/main/Global/Windows.gitignore
# Windows thumbnail cache files
Thumbs.db
ehthumbs.db
ehthumbs_vista.db

# Dump file
*.stackdump

# Folder config file
[Dd]esktop.ini

# Recycle Bin used on file shares
$RECYCLE.BIN/

# Windows Installer files
*.cab
*.msi
*.msix
*.msm
*.msp

# Windows shortcuts
*.lnk

# Vim temporary swap files
*.swp
50 changes: 50 additions & 0 deletions App/Data_Processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
## Content Processing
Additional details about how content processing is handled in the solution. This includes the workflow steps and how to use your own data in the solution.

### Workflow

1. <u>Document upload</u><br/>
Documents added to blob storage. Processing is triggered based on file check-in.

2. <u>Text extraction, context extraction (image)</u><br/>
Based on file type, an appropriate processing pipeline is used

3. <u>Summarization</u><br/>
LLM summarization of the extracted content.

4. <u>Keyword and entity extraction</u><br/>
Keywords extracted from full document through an LLM prompt. If document is too large, keywords are extracted from the summarization.

5. <u>Text chunking from text extraction results</u><br/>
Chunking size is aligned with the embedding model size.

6. <u>Vectorization</u><br/>
Creation of embeddings from chunked text using text-embedding-3-large model.

7. <u>Save results to Azure AI Search index</u><br/>
Data added to index including vectorized fields, text chunks, keywords, entity specific meta data.

### Customizing With Your Own Documents

There are two methods to use your own data in this solution. It takes roughly 10-15 minutes for a file to be processed and show up in the index and in results on the web app.

1. <u>Web App - UI Uploading</u><br/>
You can upload through the user interface files that you would like processed. These files are uploaded to blob storage, processed, and added to the Azure AI Search index. File uploads are limited to 500MB and restricted to the following file formats: Office Files, TXT, PDF, TIFF, JPG, PNG.

2. <u>Bulk File Processing</u><br/>
You can take buik file processing since the web app saves uploaded files here also. This would be the ideal to upload a large number of document or files that are large in size.

### Modifying Processing Prompts

Prompt based processing is used for context extraction, summarization, and keyword/entity extraction. Modifications to the prompts will change what is extracted for the related workflow step.

You can find the prompt configuration text files for **summarization** and **keyword/entity** extraction in this folder:
```
\App\kernel-memory\service\Core\Prompts\
```

**Context extraction** requires a code re-compile. You can modify the prompt in this code file on <u>line 56</ul>:

```
\App\kernel-memory\service\Core\DataFormats\Image\ImageContextDecoder.cs
```
52 changes: 52 additions & 0 deletions App/Technical_Architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
## Technical Architecture

Additional details about the technical architecture of the Document Knowledge Mining solution accelerator. This describes the purpose and additional context of each component in the solution.

![image](../Images/readme/architecture.png)


### Ingress Controller
Using Azure's Application Gateway Ingress Controller for Kubernetes. Allowing for load balancing and dynamic traffic management across the application layer.

### Azure Kubernetes
Using Azure Kubernetes Service, the application is deployed as a managed containerized app. This is ideal for deploying a high availability, scalable, and portable application to multiple regions.

### Container Registry
Using Azure Container Registry, container images are built, stored, and managed in a private registry. These container images include the Document Processor, AI Service, and Web App.

### Web App
Using Azure App Service, a web app acts as the UI for the solutions. The app is built with React and TypeScript. it acts as an API client to create an experience for document search, an easy to use upload and processing interface, and an LLM powered conversational user interface.

### Service - Document Processor
Internal kubernetes cluster for document processing pods.

### Document Processor Pods
API end points to facilitate processing of documents that are stored in blob storage. Azure Kubernetes Pod that handles saving document chunks, vectors, and keywords to Azure AI Search and blob storage. It extracts content and context from images in order to derive knowledge, keywords, topics, and summarizations. Based on the file type, different processing pipelines are run to extract the data in the appropriate steps.

### Service - AI Service
Internal kubernetes cluster for AI service pods.

### AI Service Processor Pods
Azure Kubernetes Pod that acts as the solution's orchestration layer (with Semantic Kernel) for interaction with the LLM for the web app. This also includes chat end points to (syncronous and asyncrounous) to stream chat coversations on the web app and to save chat history. This facilitates saving document meta data, keywords and summarizatinons to Cosmos DB to show them through the web app's user interface.

### App Configuration
Using Azure App Configuration, app settings and configurations are centralized and used with the Document Processor Service, AI Service, and Web App.

### Storage Queue
Using Azure Storage Queue, pipeline work steps and processing jobs are added to the storage queue to be picked up and run for their respective jobs. Files uploaded are queue while being saved the blob storage and removed after successful completion.

### Azure AI Search
Processed and extracted document information is added to an Azure AI Search vecortized index. This vectorized index includes columns relevant to the document set and is integrated with the web app to power the document search and document chatting experience.

### Azure Document Intelligence
One step of the data processing workflow where documents have Optical Character Recognition (OCR) applied to extract data. This includes text and handwriting extraction from documents.

### GPT 4o mini
Using Azure OpenAI, a deployment of the GPT 4o mini model (version 2024-07-18) is used during the data processing workflow to extract content, context, keywords, knowledge, topics and summarization. This model is also used in the web app's chat experience. This model can be changed to a different Azure OpenAI model if desired, but this has not been thoroughly tested and may be affected by the output token limits.

### Blob Storage
Using Azure Blog Storage, unprocessed document are stored as blobs. The data processing workflow reads the file and saves a JSON, text chunks, markdown, embedded text, and meta data including keywords and sumamrization of the processed data back to blob storage. Files uploaded through the web app's upload capabilities are uploaded here.


### Cosmos DB for MongoDB
Using Azure Cosmos DB for MongoDB, documents that have been processed have their processing results saved to a table. The web app chat experience saves chat history to a table. The processed document results and chat history are used to inform prompt recommendations and answers.
30 changes: 30 additions & 0 deletions App/backend-api/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
**/.classpath
**/.dockerignore
**/.env
**/.git
**/.gitignore
**/.project
**/.settings
**/.toolstarget
**/.vs
**/.vscode
**/*.*proj.user
**/*.dbmdl
**/*.jfm
**/azds.yaml
**/bin
**/charts
**/docker-compose*
**/Dockerfile*
**/node_modules
**/npm-debug.log
**/obj
**/secrets.dev.yaml
**/values.dev.yaml
LICENSE
README.md
!**/.gitignore
!.git/HEAD
!.git/config
!.git/packed-refs
!.git/refs/heads/**
Loading

0 comments on commit c1ee29f

Please sign in to comment.