-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdocumentation.html
647 lines (538 loc) · 33.8 KB
/
documentation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
<!DOCTYPE html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>HiCAL - A System for Efficient High-Recall Retrieval</title>
<!-- Bootstrap & Font Awesome CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" integrity="sha384-rwoIResjU2yc3z8GV/NPeZWAv56rSmLldC3R/AZzGRnGxQQKnKkoFVhFQhNUwEyJ" crossorigin="anonymous">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css">
<link rel="stylesheet" href="static/style.css">
<link rel="shortcut icon" href="images/favicon.ico" type="image/x-icon">
<link rel="icon" href="images/favicon.ico" type="image/x-icon">
<script src="https://code.jquery.com/jquery-3.1.1.slim.min.js" integrity="sha384-A7FZj7v+d/sdmMqp/nOQwliLvUsJfDHW+k9Omg/a/EheAdgtzNs3hpfag6Ed950n" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/tether/1.4.0/js/tether.min.js" integrity="sha384-DztdAPBWPRXSA/3eYEEUWrWCy7G5KFbe8fFjk5JAIxUYHKkDx6Qin1DkWx51bBrb" crossorigin="anonymous"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-alpha.6/js/bootstrap.min.js" integrity="sha384-vBWWzlZJ8ea9aCX4pEW3rVHjgjt7zpkNpZk+02D9phzyeVkE+jo0ieGizqPLForn" crossorigin="anonymous"></script>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-122235632-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-122235632-1');
</script>
</head>
<body>
<div class="navbar-container">
<nav class="navbar navbar-toggleable-md navbar-light ml-4 mr-4">
<a class="navbar-brand" href="index.html"><img src="images/hical.png" width="100px"></a>
<span class="navbar-text mb-3 mr-5">
<span class=" mr-1"></span> A System for Efficient High-Recall Retrieval.
</span>
<div class=" navbar-collapse-right" id="navbarTogglerDemo03">
<ul class="navbar-nav mr-auto mb-3">
<li class="nav-item">
<a class="nav-link" href="index.html">Install</a>
</li>
<li class="nav-item">
<a class="nav-link" href="usage.html">Usage</a>
</li>
<li class="nav-item active">
<a class="nav-link" href="#">Documentation<span class="sr-only">(current)</span></a>
</li>
<li class="nav-item">
<a class="nav-link" href="paper.html">Paper</a>
</li>
<li class="nav-item">
<a class="nav-link" href="https://github.com/hical">Github</a>
</li>
</ul>
</div>
</nav>
</div>
<div class="content">
<h2>Architecture</h2>
<p>The figure below shows the high-level architecture of the system.</p>
<div class="row">
<div class="col-md-6">
<figure class="figure">
<img src="images/architecure.png" class="figure-img img-fluid rounded p-1 border gray-background">
</figure>
</div>
<div class="col-md-6">
<p>Figure 1: Platform server (i.e. <a href="http://www.github.com/hical/hical/HiCALWeb">HiCAL-web</a>) responsible for displaying content and communicating requests and responses between components.
CAL component (i.e. <a href="">HiCAL-engine</a>) responsible for training and ranking documents.
The Search component is a query-based search engine responsible for retrieving document and ranking documents.
</p>
</div>
</div>
<h2>Documentation</h2>
<p>Click for more details.</p>
<div class="row" style="text-align: center">
<div class="col-sm-6"><h5><a href="#HiCAL-web">HiCAL web</a></h5></div>
<div class="col-sm-6"><h5><a href="#HiCAL-engine">HiCAL engine</a></h5></div>
</div>
<h4 class="mt-5"><a name="HiCAL-web">HiCAL-web</a></h4>
<p>The details below are for setting up HiCAL-web (no docker).</p>
<h5>Setting up</h5>
<p>
The system requires a database for storing document judgments, user profiles and topics.
Install postgres and run it, make sure you create a db called <code>HiCAL</code>, or whatever you have set in <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py">settings</a> (<code>config/settings/base.py</code>).
</p>
<p>
To create a db, use this command:
</p>
<pre><code>$ createdb HiCAL </code></pre>
<p>Create a python3 virtual env, and pip install everything in <code>requirements/base.txt:</code></p>
<pre><code>$ pip install -r requirements/base.txt</code></pre>
<p>If you are running it locally, you might also want to install everything in <code>requirements/local.txt:</code></p>
<pre><code>$ pip install -r requirements/local.txt</code></pre>
<p>
Your local db is initially empty, and you will need to create the tables according to the models.
To do this, run <code>python manage.py makemigrations</code> and then <code>python manage.py migrate.</code>
</p>
<p>
For running a uwsgi instance, <code>uwsgi --socket 127.0.0.1:8001 --module config.wsgi --master --process 2 --threads 4</code>
</p>
<p>To run locally, run <code>python manage.py runserver</code></p>
<p>This will start the web server running in the default port 8000. You can access the server by going to <a href="http://localhost:8000/">http://localhost:8000/</a>.</p>
<h5>Setting Up Your Users</h5>
<p>Each document judgments is associated with a user profile.</p>
<p>To create a normal user account, just go to Sign Up and fill out the form.</p>
<p>To create an superuser account, use this command:</p>
<pre><code>$ python manage.py createsuperuser</code></pre>
<p>For convenience, you can keep your normal user logged in on Chrome and your superuser logged in on Firefox (or similar), so that you can see how the site behaves for both kinds of users. A superuser can access the platform's admin page by going to <code>/admin</code> (e.g. <a href="http://localhost:8000/admin">http://localhost:8000/admin</a> ) or whatever is set in the settings.</p>
<h6>Configuring Signup Form</h6>
<p><code>users.models.User</code> contains the custom user model that you can modify to your need. <code>allauth.forms.SignupForm</code> is a custom signup form that you can change to include extra information that will be saved in the user model.</p>
<p>Registration and account management is done using django-allauth.</p>
<h5>Component Settings</h5>
<p>Communication from and to the component are done through http requests.</p>
<p>Make sure you update the IPs for each component in the <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py"><code>config/settings/base.py</code></a> file.</p>
<pre>
<code>
# HiCAL COMPONENTS IPS *REQUIRED*
# ------------------------------------------------------------------------------
CAL_SERVER_IP = '127.0.0.0'
CAL_SERVER_PORT = '9001' # The port for CAL engine
SEARCH_SERVER_IP = '127.0.0.0'
SEARCH_SERVER_PORT = '80'
DOCUMENTS_URL = '127.0.0.0:9000/doc'
PARA_URL = '127.0.0.0:9000/para'
</code>
</pre>
<h5><a name="IntegratingSearch">Integrating Search</a></h5>
<p>Once you have a search engine server running, make sure you update the <code>SEARCH_SERVER_IP</code> and <code>SEARCH_SERVER_PORT</code> in <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py"><code>config/settings/base.py</code></a>.
<p> You also need to update the <code>url</code> variable in the method <code>get_documents</code> in <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/interfaces/SearchEngine/functions.py"><code>interfaces/SearchEngine/functions.py</code></a>.
<p>The <code>get_documents</code> method will send a <code>GET</code> request to the <code>url</code> of the search engine.</p>
<p>The parameters of the request are:</p>
<ul>
<li><b>'start'</b>: Integer - Start from rank (keep 0 if you don't know).</li>
<li><b>'numdisplay'</b>: Integer - The number of SERP items to return. </li>
<li><b>'query'</b>: String - The query to submit to the search engine.</li>
</ul>
<p>You can change the request as you would like. Update the <code>get_documents</code> to parse the response. Make sure the method returns the following:</p>
<pre>
<code>
result: A list of documents, where each document is composed of
{
"rank": rank, # rank of document
"docno": docno, # doc id
"title": title, # Title of SERP item
"snippet": snippet # Snippet of SERP item
}
doc_ids: A list of document ids in the same order as appeared in result
time_taken: str of the time taken to complete the search request
</code>
</pre>
<h6><a name="IntegratingSearchExample">Example</a></h6>
<p>Suppose the response from the search server (using indri) is an XML of the following format: </p>
<pre>
<code>
<search-response>
<raw-query>foobar</raw-query>
<total-time>0.17908906936646</total-time>
<indri-query>foobar</indri-query>
<indri-sentence-query>#combine[sentence]( foobar )</indri-sentence-query>
<results>
<result>
<rank>1</rank>
<score>-8.00303</score>
<url/>
<docno>774090</docno>
<document-id>1055435</document-id>
<title>
What Goes On?; Don't Speak Unix? Oh? Try http:www.fooledyou
</title>
<snippet>
Purists would say, "lower-case aich tee tee pee colon forward slash forward slash double you double you double you dot <strong>foobar</strong> dot com," and might add that the letters signify a type of server computer that recognizes data transfers based on the hypertext transport protocol, followed by the branches of the specific data directories where Web pages can be found. Whew!
</snippet>
</result>
</results>
</search-response>
</code>
</pre>
<p>Here is an example of what the file <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/interfaces/SearchEngine/functions.py"><code>interfaces/SearchEngine/functions.py</code></a> can look like.
<pre>
<code>
from collections import OrderedDict
from config.settings.base import SEARCH_SERVER_IP
from config.settings.base import SEARCH_SERVER_PORT
import urllib.parse
import httplib2
import xmltodict
def get_documents(query, start=0, numdisplay=20):
"""
:param query: The query to submit
:param start: Start from
:param numdisplay: Number of documents to return
:return:
result: A list of documents, where each document is composed of
{
"rank": rank, # rank of document
"docno": docno, # doc id
"title": title, # Title of SERP item
"snippet": snippet # Snippet of SERP item
}
doc_ids: A list of document ids in the same order as appeared in result
time_taken: str of the time taken to complete the search request
"""
h = httplib2.Http()
url = "http://{}:{}/websearchapi/search.php?{}" # Update this url to whatever your search engine supports
parameters = {'start': start, 'numdisplay': numdisplay, 'query': query}
parameters = urllib.parse.urlencode(parameters)
resp, content = h.request(url.format(SEARCH_SERVER_IP,
SEARCH_SERVER_PORT,
parameters),
method="GET")
if resp and resp.get("status") == "200":
xmlDict = xmltodict.parse(content)
try:
xmlResult = xmlDict['search-response']['results']['result']
except TypeError:
return None, None, None
doc_ids = []
result = OrderedDict()
if not isinstance(xmlResult, list):
xmlResult = [xmlResult]
for doc in xmlResult:
docno = doc["docno"].zfill(7)
parsed_doc = {
"rank": doc["rank"],
"docno": docno,
"title": doc["title"],
"snippet": doc["snippet"]
}
result[docno] = parsed_doc
doc_ids.append(docno)
return result, doc_ids, xmlDict['search-response']['total-time']
return None, None, None
</code>
</pre>
<h5>Logging</h5>
<p>We use python’s built in logging module for system and analytics logging. You can configure logging from the project settings in <code>settings/base.py</code>.
<p>The repo currently has one logging handle. Logs are saved in <code>logs/web.log</code>.</p>
<p>Check <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py"><code>config/settings/base.py</code></a> for more logging configurations.</p>
<h6>Using the logger</h6>
<p>Once you have configured your logger, you can start placing logging calls into your code. This is very simple to do. Here’s an example:</p>
<pre>
<code>
import logging
logger = logging.getLogger(__name__)
class JudgmentAJAXView(views.CsrfExemptMixin, views.LoginRequiredMixin,
views.JsonRequestResponseMixin,
generic.View):
# some code
...
# log message
log_message = {
"user": self.request.user.username,
"client_time": client_time,
# ...
}
}
# log message
logger.info("{}".format(json.dumps(log_message)))
</code>
</pre>
<p>We suggest you follow a simple logging style that you can easily parse and understand. Place logging message where ever you need to log/track changes in the systems or user behaviour.</p>
<h5>Components</h5>
<p>Our system currently supports two retrieval components, CAL and Search.
All components in the architecture are stand-alone and interact with each other via HTTP API calls.
You can also add your own component if you wish.
</p>
<h6>Adding a component</h6>
<p>Let’s go through an example of adding a new component. For simplicity, the component will simply return a list of predefine documents that need to be judged. The component will allow user to access and judge these documents in the order they are define. We’ll call this component Iterative.</p>
<p>Let’s add a new app in our project by running this command:</p>
<pre><code>$ python manage.py createapp iterative</code></pre>
<p>This will create a folder with all the files we need. You might need to move the directory. Make sure the newly created app is in the right directory. It should be in <code>/HiCAL</code>.</p>
<p>Let’s add the new app to our settings file so that our server is aware of it. You can do this by going to <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py"><code>config/settings/base.py</code></a> file and adding <code>HiCAL.iterative</code> to <code>LOCAL_APPS</code>:
<pre>
<code>
# Apps specific for this project go here.
LOCAL_APPS = [
# custom users app
...
'hical.iterative',
]
</code>
</pre>
<p>Now add a specific url for the app in <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/urls.py"><code>config/urls.py</code></a>:</p>
<pre>
<code>
urlpatterns = [
...
# add this
url(r'^iterative/', include('hical.iterative.urls', namespace='iterative')),
]
</code>
</pre>
<p>and create a new <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/iterative/urls.py"><code>urls.py</code></a> file under the iterative app directory:</p>
<pre>
<code>
from django.conf.urls import url
from hical.iterative import views
urlpatterns = [
url(r'^$', views.HomePageView.as_view(), name='main'),
# Ajax views
url(r'^post_log/$', views.MessageAJAXView.as_view(), name='post_log_msg'),
url(r'^get_docs/$', views.DocAJAXView.as_view(), name='get_docs'),
]
</code>
</pre>
<p>This will allow access to the component from the url <code>/iterative/</code></p>
<p>We will use <code>get_docs/</code> url pattern to link to our view <code>DocAJAXView</code>, which will be responsible for getting the documents to judge. Please look at <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/iterative/views.py"><code>iterative/views.py</code></a> for more details on what these views do.</p>
<p>Let’s look at <code>DocAJAXView</code> view as an example:</p>
<pre>
<code>
from interfaces.DocumentSnippetEngine import functions as DocEngine
from interfaces.Iterative import functions as IterativeEngine
class DocAJAXView(views.CsrfExemptMixin, views.LoginRequiredMixin,
views.JsonRequestResponseMixin,
views.AjaxResponseMixin, generic.View):
"""
View to get a list of documents to judge
"""
require_json = False
def render_timeout_request_response(self, error_dict=None):
if error_dict is None:
error_dict = self.error_response_dict
json_context = json.dumps(
error_dict,
cls=self.json_encoder_class,
**self.get_json_dumps_kwargs()
).encode('utf-8')
return HttpResponse(
json_context, content_type=self.get_content_type(), status=502)
def get_ajax(self, request, *args, **kwargs):
try:
current_task = self.request.user.current_task
docs_ids = IterativeEngine.get_documents(current_task.topic.number)
docs_ids = helpers.remove_judged_docs(docs_ids,
self.request.user,
current_task)
documents = DocEngine.get_documents(docs_ids, query=None)
return self.render_json_response(documents)
except TimeoutError:
error_dict = {u"message": u"Timeout error. Please check status of servers."}
return self.render_timeout_request_response(error_dict)
</code>
</pre>
<p>The method <code>get_ajax</code> will be called from our template to retrieve a list of documents. The variable <code>docs_ids</code> contains the list of documents ids that we need to retrieve their content. We put all functions associated with the component retrieval in <code>/interfaces</code> directory.</p>
<p>The variable <code>documents</code> will contain a list of documents with their content, in the following format:</p>
<pre>
<code>
[
{
'doc_id': "012345",
'title': "Document Title",
'content': "Body of document",
'date': "Date of document"
},
...
]
</code>
</pre>
<p><code>documents</code> will be passed as context to our template, and loaded in the browser.</p>
<p>The HTML associated is under <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/iterative/templates/iterative/iterative.html"><code>iterative/templates/iterative/iterative.html</code></a>. The <code>HomePageView</code> view, which the url pattern <code>/iterative/^</code> is pointing to, will render this page. If you look at the <code>HomePageView</code> view, you will see that <code>template_name = 'iterative/iterative.html'</code>.</p>
<p>You can modify the <code>iterative.html</code> file as you like. The <code>iterative.html</code> page will call <code>iterative:get_docs</code> url once the page is loaded. You can customize the behaviour by modifying the javascript code in <code>iterative.html</code> to meet your needs.</p>
<p>The judgment buttons in the interface will call the <code>send_judgment()</code> function, which will send a call to the url <code>judgment:post_judgment</code> in <code>judgments/url.py</code> and run the view <code>JudgmentAJAXView</code> that will save the document's judgment to the database. If you would like to include more information to be saved with each judgment, you can modify the <code>send_judgment()</code> in <code>iterative.html</code>. You will also need to modify the database model, <code>Judgement</code>, associated with each judgment instance. The <code>Judgement</code> class is located in <code>judgment/models.py</code>.</p>
<h6>CAL and Search</h6>
<p>Both components follow a similar pattern as the Iterative component described above. The difference is that both are running separately. Communication to and from each component (e.g. getting the documents we need to judge) is done through HTTP calls. The IPs for each component must be set in <code>config/settings/base.py</code>. Take a look at the methods used to communicate with components (found in <code>/interfaces/</code>) for further details.</p>
<p>The HTML associated with each component is found under the templates folder under the component directory.</p>
<h4><a name="HiCAL-engine">HiCAL engine</a></h4>
<h5 id="requirements">Requirements</h5>
<ul>
<li>libfcgi</li>
<li>libarchive</li>
<li>g++</li>
<li>make</li>
<li>spawn-fcgi (to run <code>bmi_fcgi</code>)</li>
</ul>
<h5 id="command-line-tool">Command Line Tool</h5>
<pre><code>$ make bmi_cli
$ ./bmi_cli --help
Command line flag options:
--help Show Help
--async-mode Enable greedy async mode <span class="hljs-keyword">for</span> classifier <span class="hljs-keyword">and</span> rescorer,
overrides --judgment-per-iteration <span class="hljs-keyword">and</span> --num-iterations
--df Path <span class="hljs-keyword">of</span> the <span class="hljs-keyword">file</span> <span class="hljs-keyword">with</span> list <span class="hljs-keyword">of</span> terms <span class="hljs-keyword">and</span> their document
frequencies. The <span class="hljs-keyword">file</span> contains space-separated word <span class="hljs-keyword">and</span>
df <span class="hljs-keyword">on</span> every line. Specify only when df information <span class="hljs-keyword">is</span>
<span class="hljs-keyword">not</span> encoded <span class="hljs-keyword">in</span> the document features <span class="hljs-keyword">file</span>.
--jobs Number <span class="hljs-keyword">of</span> concurrent jobs (topics)
--threads Number <span class="hljs-keyword">of</span> threads <span class="hljs-keyword">to</span> use <span class="hljs-keyword">for</span> scoring
--doc-features Path <span class="hljs-keyword">of</span> the <span class="hljs-keyword">file</span> <span class="hljs-keyword">with</span> list <span class="hljs-keyword">of</span> document features
--para-features Path <span class="hljs-keyword">of</span> the <span class="hljs-keyword">file</span> <span class="hljs-keyword">with</span> list <span class="hljs-keyword">of</span> paragraph features (BMI_PARA)
--qrel Qrel <span class="hljs-keyword">file</span> <span class="hljs-keyword">to</span> use <span class="hljs-keyword">for</span> judgment
--max-effort <span class="hljs-keyword">Set</span> max effort (number <span class="hljs-keyword">of</span> judgments)
--max-effort-factor <span class="hljs-keyword">Set</span> max effort <span class="hljs-keyword">as</span> a factor <span class="hljs-keyword">of</span> recall
--num-iterations <span class="hljs-keyword">Set</span> max number <span class="hljs-keyword">of</span> refresh iterations
--training-iterations <span class="hljs-keyword">Set</span> number <span class="hljs-keyword">of</span> training iterations
--judgments-per-iteration Number <span class="hljs-keyword">of</span> docs <span class="hljs-keyword">to</span> judge per iteration (-<span class="hljs-number">1</span> <span class="hljs-keyword">for</span> BMI <span class="hljs-keyword">default</span>)
--query Path <span class="hljs-keyword">of</span> the <span class="hljs-keyword">file</span> <span class="hljs-keyword">with</span> queries (one seed per line; each
line <span class="hljs-keyword">is</span> <topic_id> <rel> <<span class="hljs-keyword">string</span>>; can have multiple seeds per topic)
--judgment-logpath Path <span class="hljs-keyword">to</span> log judgments. Specify a directory within which
topic-specific logs will be generated.
--mode <span class="hljs-keyword">Set</span> strategy: (<span class="hljs-keyword">default</span>) BMI_DOC, BMI_PARA, BMI_PARTIAL_RANKING,
BMI_ONLINE_LEARNING, BMI_PRECISION_DELAY, BMI_RECENCY_WEIGHTING, BMI_FORGET
... (Strategy specific arguments truncated)
</code></pre><ul>
<li>The document frequency data is encoded within the document features bin file.
<code>--df</code> shouldn't be used unless you are using the old document feature format.</li>
</ul>
<h5 id="corpus-parser">Corpus Parser</h5>
<p>This tool generates document features of a given corpus.</p>
<p>The tool takes as input an archived corpus <code>tar.gz</code>. Each file in the corpus is treated
as a separate document with the name of the file (excluding the directories) as its ID.
The files are processed as plaintext document. This gives users freedom to clean and
form corpuses in different ways. In order to use the <code>BMI_PARA</code> mode in the main tool,
the user needs to split each document such that each new document is a single paragraph.
The name of these paragraph files should be <code><doc-id>.<para-id></code> (this is used by the tool
to get the parent document for a given paragraph). For this reason, avoid using the character <code>.</code> in
the <code><doc-id></code>.</p>
<p><code>--type</code> as <code>svmlight</code> is not used by the main tool and should be used for debugging purposes.
<code>--out-df</code> only works when <code>--type</code> is <code>svmlight</code>.</p>
<p>The bin output format begins with the document frequency data followed by the document feature
data. The document frequency data begins with a <code>uint32_t</code> specifying the number of <code>dfreq</code> records which follow.
Each <code>dfreq</code> record begins with a null terminated string (word) followed by a <code>uint32_t</code> document frequency.
The document feature data begins with a <code>uint32_t</code> specifying the number of <code>dfeat</code> records which follow.
Each <code>dfeat</code> record begins with a null terminated string (document ID) followed by a <code>feature_list</code>. <code>feature_list</code>
begins with a <code>uint32_t</code> specifying the number of <code>feature_pair</code> which follow. Each <code>feature_pair</code>
is a <code>uint32_t</code> feature ID followed by a <code>float</code> feature weight.</p>
<pre><code>$ make corpus_parser
$ ./corpus_parser --<span class="hljs-keyword">help</span>
Command <span class="hljs-keyword">line</span> flag options:
--<span class="hljs-keyword">help</span> Show <span class="hljs-keyword">Help</span>
--<span class="hljs-keyword">in</span> <span class="hljs-keyword">Input</span> corpus archive
--<span class="hljs-keyword">out</span> Output feature <span class="hljs-keyword">file</span>
--<span class="hljs-keyword">out</span>-df Output document frequency <span class="hljs-keyword">file</span>
--<span class="hljs-keyword">type</span> Output <span class="hljs-keyword">file</span> <span class="hljs-keyword">format</span>: bin (default) or svmlight
</code></pre><h5 id="fastcgi-based-web-server">FastCGI based web server</h5>
<pre><code>$ make bmi_fcgi
$ ./bmi_fcgi --help
Command line flag <span class="hljs-attribute">options</span>:
--df Path <span class="hljs-keyword">of</span> the file <span class="hljs-keyword">with</span> <span class="hljs-built_in">list</span> <span class="hljs-keyword">of</span> terms and their <span class="hljs-built_in">document</span> frequencies
--doc-features Path <span class="hljs-keyword">of</span> the file <span class="hljs-keyword">with</span> <span class="hljs-built_in">list</span> <span class="hljs-keyword">of</span> <span class="hljs-built_in">document</span> features
--help Show Help
--para-features Path <span class="hljs-keyword">of</span> the file <span class="hljs-keyword">with</span> <span class="hljs-built_in">list</span> <span class="hljs-keyword">of</span> paragraph features
--threads <span class="hljs-built_in">Number</span> <span class="hljs-keyword">of</span> threads to use <span class="hljs-keyword">for</span> scoring
</code></pre><p><code>fcgi</code> libraries needs to be present in the system. <code>bmi_fcgi</code> uses <code>libfcgi</code> to communicate
with any FastCGI capable web server like nginx. <code>bmi_fcgi</code> only supports modes <code>BMI</code> and <code>BMI_PARA</code>.
Install <code>spawn-fcgi</code> in your system and run </p>
<pre><code>$ spawn-fcgi -p <span class="hljs-number">8002</span> -n -- bmi_fcgi --doc-features <span class="hljs-regexp">/path/</span>to<span class="hljs-regexp">/doc/</span>features --df <span class="hljs-regexp">/path/</span>to<span class="hljs-regexp">/df</span>
</code></pre><p>You can interact with the bmi_fcgi server through the HTTP API or the python bindings in <code>api.py</code>.</p>
<h5 id="http-api-spec">HTTP API Spec</h5>
<h6 id="begin-a-session">Begin a session</h6>
<pre><code>POST /begin
Data Params:
session_id=[string]
seed_query=[string]
async=[bool]
mode=[para,doc]
seed_judgments=doc1:rel1,doc2:rel2
judgments_per_iteration=[<span class="hljs-string">-1</span> or positive integer]
<span class="hljs-keyword">Success </span>Response:
Code: 200
Content: {'session-id': [string]}
<span class="hljs-keyword">Error </span>Response:
Code: 400
Content: {'error': 'session already exists'}
</code></pre><p>The server can run multiple sessions each of which are identified by a unique <code>sesion_id</code>.
The <code>seed_query</code> is the initial relevant string used for training. <code>async</code> determines if
the server should wait to respond while refreshing. If <code>async</code> is set to false, the server
immediately returns the next document to judge from the old ranklist while the refresh is
ongoing in the background. <code>mode</code> sets whether to judge on paragraphs or documents.
<code>seed_documents</code> can be used to initialize the session with some pre-determined judgments.
It can also be used to restores states of the sessions when restarting the bmi server.
Use positive integer for relevant and negative integer (or zero) for non-relevant.</p>
<h6 id="get-document-to-judge">Get document to judge</h6>
<pre><code>GET /get_docs
URL Params:
session_id=[string]
max_count=[int]
<span class="hljs-keyword">Success </span>Response:
Code: 200
Content: {'session-id': [string], 'docs': ["doc<span class="hljs-string">-1001</span>", "doc<span class="hljs-string">-1002</span>", "doc<span class="hljs-string">-1010</span>"]}
<span class="hljs-keyword">Error </span>Response:
Code: 404
Content: {'error': 'session not found'}
</code></pre><h6 id="submit-judgment">Submit Judgment</h6>
<pre><code>POST /judge
Data Params:
session_id=[string]
doc_id=[string]
rel=[<span class="hljs-string">-1</span>,1]
<span class="hljs-keyword">Success </span>Response:
Code: 200
Content: {'session-id': 'xyz', 'docs': ["doc<span class="hljs-string">-1001</span>", "doc<span class="hljs-string">-1002</span>", "doc<span class="hljs-string">-1010</span>"]}
<span class="hljs-keyword">Error </span>Response:
Code: 404
Content: {'error': 'session not found'}
Code: 404
Content: {'error': 'doc not found'}
Code: 400
Content: {'error': 'invalid judgment'}
</code></pre><p>The <code>docs</code> field in the success response provides next documents to be judged.
This is to save an additional call to <code>/get_docs</code>. If <code>async</code> is set to false and
the judgement triggers a refresh, the server will finish the refresh before responding.</p>
<h6 id="get-full-ranklist">Get Full Ranklist</h6>
<pre><code>GET /get_ranklist
URL Params:
session_id=[string]
<span class="hljs-keyword">Success </span>Response:
Code: 200
Content (text/plain): newline separated document ids with their scores ordered
from highest to lowest
<span class="hljs-keyword">Error </span>Response:
Code: 404
Content: {'error': 'session not found'}
</code></pre><p>Returns full ranklist based on the classifier from the last refresh.</p>
<h6 id="delete-a-session">Delete a session</h6>
<pre><code>DELETE /delete_session
Data Params:
session_id: [string]
<span class="hljs-keyword">Success </span>Response:
Code: 200
Content: {'session-id': [string]}
<span class="hljs-keyword">Error </span>Response:
Code: 404
Content: {'error': 'session not found'}
</code></pre>
</div>
<div class="content">
<ul class="sm" style="list-style: none;">
<li>
<a href="https://github.com/HiCAL" target="_blank"><i class="fa fa-github"></i></a> <span class="ml-3">/</span>
</li>
<li class="small mt-1">
GitHub code.
</li>
<li class="small mt-1">
GNU General Public License v3.0
</li>
</ul>
<p class="copy"></p>
</div>
</body>
</html>