documentation.html

<!DOCTYPE html>
<html lang="en">
<head>
    <!-- Required meta tags -->
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

    <title>HiCAL - A System for Efficient High-Recall Retrieval</title>

    <!-- Bootstrap & Font Awesome CSS -->
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" integrity="sha384-rwoIResjU2yc3z8GV/NPeZWAv56rSmLldC3R/AZzGRnGxQQKnKkoFVhFQhNUwEyJ" crossorigin="anonymous">
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css">
    <link rel="stylesheet" href="static/style.css">

    <link rel="shortcut icon" href="images/favicon.ico" type="image/x-icon">
    <link rel="icon" href="images/favicon.ico" type="image/x-icon">

    <script src="https://code.jquery.com/jquery-3.1.1.slim.min.js" integrity="sha384-A7FZj7v+d/sdmMqp/nOQwliLvUsJfDHW+k9Omg/a/EheAdgtzNs3hpfag6Ed950n" crossorigin="anonymous"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/tether/1.4.0/js/tether.min.js" integrity="sha384-DztdAPBWPRXSA/3eYEEUWrWCy7G5KFbe8fFjk5JAIxUYHKkDx6Qin1DkWx51bBrb" crossorigin="anonymous"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-alpha.6/js/bootstrap.min.js" integrity="sha384-vBWWzlZJ8ea9aCX4pEW3rVHjgjt7zpkNpZk+02D9phzyeVkE+jo0ieGizqPLForn" crossorigin="anonymous"></script>
    <!-- Global site tag (gtag.js) - Google Analytics -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=UA-122235632-1"></script>
    <script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js', new Date());

      gtag('config', 'UA-122235632-1');
    </script>

</head>
<body>
    <div class="navbar-container">
        <nav class="navbar navbar-toggleable-md navbar-light ml-4 mr-4">
            <a class="navbar-brand" href="index.html"><img src="images/hical.png" width="100px"></a>
            <span class="navbar-text mb-3 mr-5">
                <span class=" mr-1"></span> A System for Efficient High-Recall Retrieval.
            </span>

          <div class=" navbar-collapse-right" id="navbarTogglerDemo03">
            <ul class="navbar-nav mr-auto mb-3">
                    <li class="nav-item">
                        <a class="nav-link" href="index.html">Install</a>
                    </li>
                    <li class="nav-item">
                        <a class="nav-link" href="usage.html">Usage</a>
                    </li>
                    <li class="nav-item active">
                        <a class="nav-link" href="#">Documentation<span class="sr-only">(current)</span></a>
                    </li>
                    <li class="nav-item">
                        <a class="nav-link" href="paper.html">Paper</a>
                    </li>
                    <li class="nav-item">
                        <a class="nav-link" href="https://github.com/hical">Github</a>
                    </li>

                </ul>
          </div>
        </nav>
    </div>

    <div class="content">
        <h2>Architecture</h2>
           <p>The figure below shows the high-level architecture of the system.</p>

        <div class="row">
            <div class="col-md-6">
                <figure class="figure">
                  <img src="images/architecure.png" class="figure-img img-fluid rounded p-1 border gray-background">
                </figure>
            </div>
            <div class="col-md-6">
                <p>Figure 1: Platform server (i.e. <a href="http://www.github.com/hical/hical/HiCALWeb">HiCAL-web</a>) responsible for displaying content and communicating requests and responses between components.
                CAL component (i.e. <a href="">HiCAL-engine</a>) responsible for training and ranking documents.
                The Search component is a query-based search engine responsible for retrieving document and ranking documents.
                </p>
            </div>
        </div>

        <h2>Documentation</h2>
        <p>Click for more details.</p>
        <div class="row" style="text-align: center">
            <div class="col-sm-6"><h5><a href="#HiCAL-web">HiCAL web</a></h5></div>
            <div class="col-sm-6"><h5><a href="#HiCAL-engine">HiCAL engine</a></h5></div>
        </div>

        <h4 class="mt-5"><a name="HiCAL-web">HiCAL-web</a></h4>
        <p>The details below are for setting up HiCAL-web (no docker).</p>
        <h5>Setting up</h5>
        <p>
            The system requires a database for storing document judgments, user profiles and topics.
            Install postgres and run it, make sure you create a db called <code>HiCAL</code>, or whatever you have set in <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py">settings</a> (<code>config/settings/base.py</code>).
        </p>
        <p>
            To create a db, use this command:
        </p>
        <pre><code>$ createdb HiCAL </code></pre>

        <p>Create a python3 virtual env, and pip install everything in <code>requirements/base.txt:</code></p>
        <pre><code>$ pip install -r requirements/base.txt</code></pre>

        <p>If you are running it locally, you might also want to install everything in <code>requirements/local.txt:</code></p>

        <pre><code>$ pip install -r requirements/local.txt</code></pre>

        <p>
            Your local db is initially empty, and you will need to create the tables according to the models.
            To do this, run <code>python manage.py makemigrations</code> and then <code>python manage.py migrate.</code>
        </p>

        <p>
            For running a uwsgi instance, <code>uwsgi --socket 127.0.0.1:8001 --module config.wsgi --master --process 2 --threads 4</code>
        </p>

        <p>To run locally, run <code>python manage.py runserver</code></p>
        <p>This will start the web server running in the default port 8000. You can access the server by going to <a href="http://localhost:8000/">http://localhost:8000/</a>.</p>
        <h5>Setting Up Your Users</h5>
            <p>Each document judgments is associated with a user profile.</p>

        <p>To create a normal user account, just go to Sign Up and fill out the form.</p>

        <p>To create an superuser account, use this command:</p>

        <pre><code>$ python manage.py createsuperuser</code></pre>

        <p>For convenience, you can keep your normal user logged in on Chrome and your superuser logged in on Firefox (or similar), so that you can see how the site behaves for both kinds of users. A superuser can access the platform's admin page by going to <code>/admin</code> (e.g. <a href="http://localhost:8000/admin">http://localhost:8000/admin</a> ) or whatever is set in the settings.</p>

        <h6>Configuring Signup Form</h6>

        <p><code>users.models.User</code> contains the custom user model that you can modify to your need. <code>allauth.forms.SignupForm</code> is a custom signup form that you can change to include extra information that will be saved in the user model.</p>

        <p>Registration and account management is done using django-allauth.</p>

        <h5>Component Settings</h5>
        <p>Communication from and to the component are done through http requests.</p>
        <p>Make sure you update the IPs for each component in the <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py"><code>config/settings/base.py</code></a> file.</p>
<pre>
<code>
# HiCAL COMPONENTS IPS *REQUIRED*
# ------------------------------------------------------------------------------
CAL_SERVER_IP = '127.0.0.0'
CAL_SERVER_PORT = '9001' # The port for CAL engine
SEARCH_SERVER_IP = '127.0.0.0'
SEARCH_SERVER_PORT = '80'
DOCUMENTS_URL = '127.0.0.0:9000/doc'
PARA_URL = '127.0.0.0:9000/para'
</code>
</pre>

        <h5><a name="IntegratingSearch">Integrating Search</a></h5>
        <p>Once you have a search engine server running, make sure you update the <code>SEARCH_SERVER_IP</code> and <code>SEARCH_SERVER_PORT</code> in <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py"><code>config/settings/base.py</code></a>.
        <p> You also need to update the <code>url</code> variable in the method <code>get_documents</code> in <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/interfaces/SearchEngine/functions.py"><code>interfaces/SearchEngine/functions.py</code></a>.
        <p>The <code>get_documents</code> method will send a <code>GET</code> request to the <code>url</code> of the search engine.</p>
        <p>The parameters of the request are:</p>
        <ul>
            <li><b>'start'</b>: Integer - Start from rank (keep 0 if you don't know).</li>
            <li><b>'numdisplay'</b>: Integer - The number of SERP items to return. </li>
            <li><b>'query'</b>: String - The query to submit to the search engine.</li>
        </ul>

        <p>You can change the request as you would like. Update the <code>get_documents</code> to parse the response. Make sure the method returns the following:</p>

<pre>
<code>
result: A list of documents, where each document is composed of
    {
        "rank": rank, # rank of document
        "docno": docno, # doc id
        "title": title, # Title of SERP item
        "snippet": snippet # Snippet of SERP item
    }
doc_ids: A list of document ids in the same order as appeared in result
time_taken: str of the time taken to complete the search request
</code>
</pre>
<h6><a name="IntegratingSearchExample">Example</a></h6>

<p>Suppose the response from the search server (using indri) is an XML of the following format: </p>

<pre>
<code>
&lt;search-response&gt;
    &lt;raw-query&gt;foobar&lt;/raw-query&gt;
    &lt;total-time&gt;0.17908906936646&lt;/total-time&gt;
    &lt;indri-query&gt;foobar&lt;/indri-query&gt;
    &lt;indri-sentence-query&gt;#combine[sentence]( foobar )&lt;/indri-sentence-query&gt;
    &lt;results&gt;
        &lt;result&gt;
            &lt;rank&gt;1&lt;/rank&gt;
            &lt;score&gt;-8.00303&lt;/score&gt;
            &lt;url/&gt;
            &lt;docno&gt;774090&lt;/docno&gt;
            &lt;document-id&gt;1055435&lt;/document-id&gt;
            &lt;title&gt;
            What Goes On?; Don't Speak Unix? Oh? Try http:www.fooledyou
            &lt;/title&gt;
            &lt;snippet&gt;
            Purists would say, "lower-case aich tee tee pee colon forward slash forward slash double you double you double you dot &lt;strong&gt;foobar&lt;/strong&gt; dot com," and might add that the letters signify a type of server computer that recognizes data transfers based on the hypertext transport protocol, followed by the branches of the specific data directories where Web pages can be found. Whew!
            &lt;/snippet&gt;
        &lt;/result&gt;
    &lt;/results&gt;
&lt;/search-response&gt;
</code>
</pre>

<p>Here is an example of what the file <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/interfaces/SearchEngine/functions.py"><code>interfaces/SearchEngine/functions.py</code></a> can look like.

<pre>
<code>
from collections import OrderedDict
from config.settings.base import SEARCH_SERVER_IP
from config.settings.base import SEARCH_SERVER_PORT
import urllib.parse

import httplib2
import xmltodict


def get_documents(query, start=0, numdisplay=20):
    """
    :param query: The query to submit
    :param start: Start from
    :param numdisplay: Number of documents to return
    :return:
        result: A list of documents, where each document is composed of
            {
                "rank": rank, # rank of document
                "docno": docno, # doc id
                "title": title, # Title of SERP item
                "snippet": snippet # Snippet of SERP item
            }
        doc_ids: A list of document ids in the same order as appeared in result
        time_taken: str of the time taken to complete the search request
    """
    h = httplib2.Http()
    url = "http://{}:{}/websearchapi/search.php?{}" # Update this url to whatever your search engine supports

    parameters = {'start': start, 'numdisplay': numdisplay, 'query': query}
    parameters = urllib.parse.urlencode(parameters)
    resp, content = h.request(url.format(SEARCH_SERVER_IP,
                                         SEARCH_SERVER_PORT,
                                         parameters),
                              method="GET")

    if resp and resp.get("status") == "200":
        xmlDict = xmltodict.parse(content)
        try:
            xmlResult = xmlDict['search-response']['results']['result']
        except TypeError:
            return None, None, None

        doc_ids = []
        result = OrderedDict()

        if not isinstance(xmlResult, list):
            xmlResult = [xmlResult]

        for doc in xmlResult:
            docno = doc["docno"].zfill(7)
            parsed_doc = {
                "rank": doc["rank"],
                "docno": docno,
                "title": doc["title"],
                "snippet": doc["snippet"]
            }
            result[docno] = parsed_doc
            doc_ids.append(docno)

        return result, doc_ids,  xmlDict['search-response']['total-time']

    return None, None, None
</code>
</pre>


        <h5>Logging</h5>

        <p>We use python’s built in logging module for system and analytics logging. You can configure logging from the project settings in <code>settings/base.py</code>.
        <p>The repo currently has one logging handle. Logs are saved in <code>logs/web.log</code>.</p>
        <p>Check <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py"><code>config/settings/base.py</code></a> for more logging configurations.</p>

        <h6>Using the logger</h6>

        <p>Once you have configured your logger, you can start placing logging calls into your code. This is very simple to do. Here’s an example:</p>
<pre>
<code>
import logging
logger = logging.getLogger(__name__)

class JudgmentAJAXView(views.CsrfExemptMixin, views.LoginRequiredMixin,
                       views.JsonRequestResponseMixin,
                       generic.View):

    # some code
    ...

    # log message
    log_message = {
        "user": self.request.user.username,
        "client_time": client_time,
         # ...
        }
    }
    # log message
    logger.info("{}".format(json.dumps(log_message)))
</code>
</pre>

    <p>We suggest you follow a simple logging style that you can easily parse and understand. Place logging message where ever you need to log/track changes in the systems or user behaviour.</p>

    <h5>Components</h5>

    <p>Our system currently supports two retrieval components, CAL and Search.
        All components in the architecture are stand-alone and interact with each other via HTTP API calls.
        You can also add your own component if you wish.
    </p>

    <h6>Adding a component</h6>
    <p>Let’s go through an example of adding a new component. For simplicity, the component will simply return a list of predefine documents that need to be judged. The component will allow user to access and judge these documents in the order they are define. We’ll call this component Iterative.</p>

    <p>Let’s add a new app in our project by running this command:</p>

<pre><code>$ python manage.py createapp iterative</code></pre>

<p>This will create a folder with all the files we need. You might need to move the directory. Make sure the newly created app is in the right directory. It should be in <code>/HiCAL</code>.</p>
<p>Let’s add the new app to our settings file so that our server is aware of it. You can do this by going to <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/settings/base.py"><code>config/settings/base.py</code></a> file and adding <code>HiCAL.iterative</code> to <code>LOCAL_APPS</code>:

<pre>
<code>
# Apps specific for this project go here.
LOCAL_APPS = [
    # custom users app
    ...
    'hical.iterative',
]
</code>
</pre>

        <p>Now add a specific url for the app in <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/config/urls.py"><code>config/urls.py</code></a>:</p>
<pre>
<code>
urlpatterns = [
    ...
    # add this
    url(r'^iterative/', include('hical.iterative.urls', namespace='iterative')),

]
</code>
</pre>

    <p>and create a new <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/iterative/urls.py"><code>urls.py</code></a> file under the iterative app directory:</p>

<pre>
<code>
from django.conf.urls import url

from hical.iterative import views

urlpatterns = [
    url(r'^$', views.HomePageView.as_view(), name='main'),

    # Ajax views
    url(r'^post_log/$', views.MessageAJAXView.as_view(), name='post_log_msg'),
    url(r'^get_docs/$', views.DocAJAXView.as_view(), name='get_docs'),
]
</code>
</pre>

<p>This will allow access to the component from the url <code>/iterative/</code></p>

<p>We will use <code>get_docs/</code> url pattern to link to our view <code>DocAJAXView</code>, which will be responsible for getting the documents to judge. Please look at <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/iterative/views.py"><code>iterative/views.py</code></a> for more details on what these views do.</p>

<p>Let’s look at <code>DocAJAXView</code> view as an example:</p>

<pre>
<code>
from interfaces.DocumentSnippetEngine import functions as DocEngine
from interfaces.Iterative import functions as IterativeEngine

class DocAJAXView(views.CsrfExemptMixin, views.LoginRequiredMixin,
                  views.JsonRequestResponseMixin,
                  views.AjaxResponseMixin, generic.View):
    """
    View to get a list of documents to judge
    """
    require_json = False

    def render_timeout_request_response(self, error_dict=None):
        if error_dict is None:
            error_dict = self.error_response_dict
        json_context = json.dumps(
            error_dict,
            cls=self.json_encoder_class,
            **self.get_json_dumps_kwargs()
        ).encode('utf-8')
        return HttpResponse(
            json_context, content_type=self.get_content_type(), status=502)

    def get_ajax(self, request, *args, **kwargs):
        try:
            current_task = self.request.user.current_task
            docs_ids = IterativeEngine.get_documents(current_task.topic.number)
            docs_ids = helpers.remove_judged_docs(docs_ids,
                                                  self.request.user,
                                                  current_task)
            documents = DocEngine.get_documents(docs_ids, query=None)
            return self.render_json_response(documents)
        except TimeoutError:
            error_dict = {u"message": u"Timeout error. Please check status of servers."}
            return self.render_timeout_request_response(error_dict)
</code>
</pre>

<p>The method <code>get_ajax</code> will be called from our template to retrieve a list of documents. The variable <code>docs_ids</code> contains the list of documents ids that we need to retrieve their content. We put all functions associated with the component retrieval in <code>/interfaces</code> directory.</p>

<p>The variable <code>documents</code> will contain a list of documents with their content, in the following format:</p>

<pre>
<code>
[
    {
        'doc_id': "012345",
        'title': "Document Title",
        'content': "Body of document",
        'date': "Date of document"
    },
    ...
]
</code>
</pre>

<p><code>documents</code> will be passed as context to our template, and loaded in the browser.</p>

<p>The HTML associated is under <a href="https://github.com/hical/HiCAL/blob/master/HiCALWeb/hicalweb/iterative/templates/iterative/iterative.html"><code>iterative/templates/iterative/iterative.html</code></a>. The <code>HomePageView</code> view, which the url pattern <code>/iterative/^</code> is pointing to, will render this page. If you look at the <code>HomePageView</code> view, you will see that <code>template_name = 'iterative/iterative.html'</code>.</p>

<p>You can modify the <code>iterative.html</code> file as you like. The <code>iterative.html</code> page will call <code>iterative:get_docs</code> url once the page is loaded. You can customize the behaviour by modifying the javascript code in <code>iterative.html</code> to meet your needs.</p>

<p>The judgment buttons in the interface will call the <code>send_judgment()</code> function, which will send a call to the url <code>judgment:post_judgment</code> in <code>judgments/url.py</code> and run the view <code>JudgmentAJAXView</code> that will save the document's judgment to the database. If you would like to include more information to be saved with each judgment, you can modify the <code>send_judgment()</code> in <code>iterative.html</code>. You will also need to modify the database model, <code>Judgement</code>, associated with each judgment instance. The <code>Judgement</code> class is located in <code>judgment/models.py</code>.</p>


        <h6>CAL and Search</h6>
<p>Both components follow a similar pattern as the Iterative component described above. The difference is that both are running separately. Communication to and from each component (e.g. getting the documents we need to judge) is done through HTTP calls. The IPs for each component must be set in <code>config/settings/base.py</code>. Take a look at the methods used to communicate with components (found in <code>/interfaces/</code>) for further details.</p>

<p>The HTML associated with each component is found under the templates folder under the component directory.</p>


    <h4><a name="HiCAL-engine">HiCAL engine</a></h4>
<h5 id="requirements">Requirements</h5>
<ul>
<li>libfcgi</li>
<li>libarchive</li>
<li>g++</li>
<li>make</li>
<li>spawn-fcgi (to run <code>bmi_fcgi</code>)</li>
</ul>
<h5 id="command-line-tool">Command Line Tool</h5>
<pre><code>$ make bmi_cli
$ ./bmi_cli --help
Command line flag options: 
      --help                Show Help
      --async-mode          Enable greedy async mode <span class="hljs-keyword">for</span> classifier <span class="hljs-keyword">and</span> rescorer,
                            overrides --judgment-per-iteration <span class="hljs-keyword">and</span> --num-iterations

      --df                  Path <span class="hljs-keyword">of</span> the <span class="hljs-keyword">file</span> <span class="hljs-keyword">with</span> list <span class="hljs-keyword">of</span> terms <span class="hljs-keyword">and</span> their document
                            frequencies. The <span class="hljs-keyword">file</span> contains space-separated word <span class="hljs-keyword">and</span>
                            df <span class="hljs-keyword">on</span> every line. Specify only when df information <span class="hljs-keyword">is</span>
                            <span class="hljs-keyword">not</span> encoded <span class="hljs-keyword">in</span> the document features <span class="hljs-keyword">file</span>.

      --jobs                Number <span class="hljs-keyword">of</span> concurrent jobs (topics)
      --threads             Number <span class="hljs-keyword">of</span> threads <span class="hljs-keyword">to</span> use <span class="hljs-keyword">for</span> scoring
      --doc-features        Path <span class="hljs-keyword">of</span> the <span class="hljs-keyword">file</span> <span class="hljs-keyword">with</span> list <span class="hljs-keyword">of</span> document features
      --para-features       Path <span class="hljs-keyword">of</span> the <span class="hljs-keyword">file</span> <span class="hljs-keyword">with</span> list <span class="hljs-keyword">of</span> paragraph features (BMI_PARA)
      --qrel                Qrel <span class="hljs-keyword">file</span> <span class="hljs-keyword">to</span> use <span class="hljs-keyword">for</span> judgment
      --max-effort          <span class="hljs-keyword">Set</span> max effort (number <span class="hljs-keyword">of</span> judgments)
      --max-effort-factor   <span class="hljs-keyword">Set</span> max effort <span class="hljs-keyword">as</span> a factor <span class="hljs-keyword">of</span> recall
      --num-iterations      <span class="hljs-keyword">Set</span> max number <span class="hljs-keyword">of</span> refresh iterations
      --training-iterations  <span class="hljs-keyword">Set</span> number <span class="hljs-keyword">of</span> training iterations
      --judgments-per-iteration  Number <span class="hljs-keyword">of</span> docs <span class="hljs-keyword">to</span> judge per iteration (-<span class="hljs-number">1</span> <span class="hljs-keyword">for</span> BMI <span class="hljs-keyword">default</span>)

      --query               Path <span class="hljs-keyword">of</span> the <span class="hljs-keyword">file</span> <span class="hljs-keyword">with</span> queries (one seed per line; each
                            line <span class="hljs-keyword">is</span> &lt;topic_id&gt; &lt;rel&gt; &lt;<span class="hljs-keyword">string</span>&gt;; can have multiple seeds per topic)

      --judgment-logpath    Path <span class="hljs-keyword">to</span> log judgments. Specify a directory within which
                            topic-specific logs will be generated.

      --mode                <span class="hljs-keyword">Set</span> strategy: (<span class="hljs-keyword">default</span>) BMI_DOC, BMI_PARA, BMI_PARTIAL_RANKING,
                            BMI_ONLINE_LEARNING, BMI_PRECISION_DELAY, BMI_RECENCY_WEIGHTING, BMI_FORGET
    ... (Strategy specific arguments truncated)

</code></pre><ul>
<li>The document frequency data is encoded within the document features bin file.
<code>--df</code> shouldn&#39;t be used unless you are using the old document feature format.</li>
</ul>
<h5 id="corpus-parser">Corpus Parser</h5>
<p>This tool generates document features of a given corpus.</p>
<p>The tool takes as input an archived corpus <code>tar.gz</code>. Each file in the corpus is treated
as a separate document with the name of the file (excluding the directories) as its ID.
The files are processed as plaintext document. This gives users freedom to clean and
form corpuses in different ways. In order to use the <code>BMI_PARA</code> mode in the main tool,
the user needs to split each document such that each new document is a single paragraph.
The name of these paragraph files should be <code>&lt;doc-id&gt;.&lt;para-id&gt;</code> (this is used by the tool
to get the parent document for a given paragraph). For this reason, avoid using the character <code>.</code> in
the <code>&lt;doc-id&gt;</code>.</p>
<p><code>--type</code> as <code>svmlight</code> is not used by the main tool and should be used for debugging purposes.
<code>--out-df</code> only works when <code>--type</code> is <code>svmlight</code>.</p>
<p>The bin output format begins with the document frequency data followed by the document feature
data. The document frequency data begins with a <code>uint32_t</code> specifying the number of <code>dfreq</code> records which follow.
Each <code>dfreq</code> record begins with a null terminated string (word) followed by a <code>uint32_t</code> document frequency.
The document feature data begins with a <code>uint32_t</code> specifying the number of <code>dfeat</code> records which follow.
Each <code>dfeat</code> record begins with a null terminated string (document ID) followed by a <code>feature_list</code>. <code>feature_list</code>
begins with a <code>uint32_t</code> specifying the number of <code>feature_pair</code> which follow. Each <code>feature_pair</code>
is a <code>uint32_t</code> feature ID followed by a <code>float</code> feature weight.</p>
<pre><code>$ make corpus_parser
$ ./corpus_parser --<span class="hljs-keyword">help</span>
Command <span class="hljs-keyword">line</span> flag options: 
      --<span class="hljs-keyword">help</span>                Show <span class="hljs-keyword">Help</span>
      --<span class="hljs-keyword">in</span>                  <span class="hljs-keyword">Input</span> corpus archive
      --<span class="hljs-keyword">out</span>                 Output feature <span class="hljs-keyword">file</span>
      --<span class="hljs-keyword">out</span>-df              Output document frequency <span class="hljs-keyword">file</span>
      --<span class="hljs-keyword">type</span>                Output <span class="hljs-keyword">file</span> <span class="hljs-keyword">format</span>:  bin (default) or svmlight
</code></pre><h5 id="fastcgi-based-web-server">FastCGI based web server</h5>
<pre><code>$ make bmi_fcgi
$ ./bmi_fcgi --help
Command line flag <span class="hljs-attribute">options</span>: 
      --df                  Path <span class="hljs-keyword">of</span> the file <span class="hljs-keyword">with</span> <span class="hljs-built_in">list</span> <span class="hljs-keyword">of</span> terms and their <span class="hljs-built_in">document</span> frequencies
      --doc-features        Path <span class="hljs-keyword">of</span> the file <span class="hljs-keyword">with</span> <span class="hljs-built_in">list</span> <span class="hljs-keyword">of</span> <span class="hljs-built_in">document</span> features
      --help                Show Help
      --para-features       Path <span class="hljs-keyword">of</span> the file <span class="hljs-keyword">with</span> <span class="hljs-built_in">list</span> <span class="hljs-keyword">of</span> paragraph features
      --threads             <span class="hljs-built_in">Number</span> <span class="hljs-keyword">of</span> threads to use <span class="hljs-keyword">for</span> scoring
</code></pre><p><code>fcgi</code> libraries needs to be present in the system. <code>bmi_fcgi</code> uses <code>libfcgi</code> to communicate
with any FastCGI capable web server like nginx. <code>bmi_fcgi</code> only supports modes <code>BMI</code> and <code>BMI_PARA</code>.
Install <code>spawn-fcgi</code> in your system and run </p>
<pre><code>$ spawn-fcgi -p <span class="hljs-number">8002</span> -n -- bmi_fcgi --doc-features <span class="hljs-regexp">/path/</span>to<span class="hljs-regexp">/doc/</span>features --df <span class="hljs-regexp">/path/</span>to<span class="hljs-regexp">/df</span>
</code></pre><p>You can interact with the bmi_fcgi server through the HTTP API or the python bindings in <code>api.py</code>.</p>
<h5 id="http-api-spec">HTTP API Spec</h5>
<h6 id="begin-a-session">Begin a session</h6>
<pre><code>POST /begin
Data Params:
    session_id=[string]
    seed_query=[string]
    async=[bool]
    mode=[para,doc]
    seed_judgments=doc1:rel1,doc2:rel2
    judgments_per_iteration=[<span class="hljs-string">-1</span> or positive integer]

<span class="hljs-keyword">Success </span>Response:
    Code: 200
    Content: {'session-id': [string]}

<span class="hljs-keyword">Error </span>Response:
    Code: 400
    Content: {'error': 'session already exists'}
</code></pre><p>The server can run multiple sessions each of which are identified by a unique <code>sesion_id</code>.
The <code>seed_query</code> is the initial relevant string used for training. <code>async</code> determines if
the server should wait to respond while refreshing. If <code>async</code> is set to false, the server
immediately returns the next document to judge from the old ranklist while the refresh is
ongoing in the background. <code>mode</code> sets whether to judge on paragraphs or documents.
<code>seed_documents</code> can be used to initialize the session with some pre-determined judgments.
It can also be used to restores states of the sessions when restarting the bmi server.
Use positive integer for relevant and negative integer (or zero) for non-relevant.</p>
<h6 id="get-document-to-judge">Get document to judge</h6>
<pre><code>GET /get_docs
URL Params:
    session_id=[string]
    max_count=[int]

<span class="hljs-keyword">Success </span>Response:
    Code: 200
    Content: {'session-id': [string], 'docs': ["doc<span class="hljs-string">-1001</span>", "doc<span class="hljs-string">-1002</span>", "doc<span class="hljs-string">-1010</span>"]}

<span class="hljs-keyword">Error </span>Response:
    Code: 404
    Content: {'error': 'session not found'}
</code></pre><h6 id="submit-judgment">Submit Judgment</h6>
<pre><code>POST /judge
Data Params:
    session_id=[string]
    doc_id=[string]
    rel=[<span class="hljs-string">-1</span>,1]

<span class="hljs-keyword">Success </span>Response:
    Code: 200
    Content: {'session-id': 'xyz', 'docs': ["doc<span class="hljs-string">-1001</span>", "doc<span class="hljs-string">-1002</span>", "doc<span class="hljs-string">-1010</span>"]}

<span class="hljs-keyword">Error </span>Response:
    Code: 404
    Content: {'error': 'session not found'}

    Code: 404
    Content: {'error': 'doc not found'}

    Code: 400
    Content: {'error': 'invalid judgment'}
</code></pre><p>The <code>docs</code> field in the success response provides next documents to be judged.
This is to save an additional call to <code>/get_docs</code>. If <code>async</code> is set to false and
the judgement triggers a refresh, the server will finish the refresh before responding.</p>
<h6 id="get-full-ranklist">Get Full Ranklist</h6>
<pre><code>GET /get_ranklist
URL Params:
    session_id=[string]

<span class="hljs-keyword">Success </span>Response:
    Code: 200
    Content (text/plain): newline separated document ids with their scores ordered
                          from highest to lowest

<span class="hljs-keyword">Error </span>Response:
    Code: 404
    Content: {'error': 'session not found'}
</code></pre><p>Returns full ranklist based on the classifier from the last refresh.</p>
<h6 id="delete-a-session">Delete a session</h6>
<pre><code>DELETE /delete_session
Data Params:
    session_id: [string]

<span class="hljs-keyword">Success </span>Response:
    Code: 200
    Content: {'session-id': [string]}

<span class="hljs-keyword">Error </span>Response:
    Code: 404
    Content: {'error': 'session not found'}
</code></pre>


    </div>

    <div class="content">
        <ul class="sm" style="list-style: none;">
            <li>
                <a href="https://github.com/HiCAL" target="_blank"><i class="fa fa-github"></i></a> <span class="ml-3">/</span>
            </li>
            <li class="small mt-1">
                GitHub code.
            </li>
            <li class="small mt-1">
                GNU General Public License v3.0
            </li>
        </ul>
        <p class="copy"></p>
    </div>

</body>
</html>