Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation and RAM limit #2

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.idea
prevalence
target
150 changes: 150 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Wikispeech-Prerender

## Installing and running

### Requirements

```
apt-get install maven openjdk-11-jdk
```

This service keeps the state in RAM using [system prevalence pattern](https://en.wikipedia.org/wiki/System_prevalence) (rather than using a database).
The service has been coded in a way that it hopefully shouldn't grow the heap larger than 1GB,
but there is no guarantee that this limit won't be exceeded. If you have a lot of RAM on the machine,
consider increasing the -Xmx value in ```run.sh```.

### For the first time
```
./run.sh (start service on port 9090)
./register-wiki.sh [consumer url, defaults to svwp]
```


### When installed
```
./run.sh
```

### Clear state and start from scratch

The system state is store in directory ```prevalence```. To start from scratch,
simply stop service, delete the directory ```prevalence``` start the service again.
You will at this point once again have to register the wikis you want to pre-render.

```
rm -rf prevalence
./run.sh
./register-wiki.sh
```


## What this service does

* It finds pages to segment and synthesize by
* Polling main page metadata once every minute to detect updates.
(This could be improved by listening at recent changes, but that requires consideration).
* Harvesting wiki links from main page.
* Polling for updated pages from recent changes.

All you need to do is to register the "consumer URL" of a wiki (eg ```https://sv.wikipedia.org/w```), and this service will figure everything else out: languages, voices, etc.

The selected order to synthesize segments is evaluated from priority settings:

* The further down a segment occurs on a page (the greater the segment index),
the less priority the segment receives. This is a minuscule change of priority.
* Pages linked from main page get a multiplication factor of 5 to all segments.
* Main page get a multiplication factor of 10 on all segments.

Basically this means the following order when synthesizing:
1. All segements in the main page.
2. The first segment in pages linked from the current main page.
3. The second segment in pages linked from the current main page.
4. ... until all segments in all pages links from the current main page is synthesized.
5. The first segment in pages found in recent changes.
6. The second segment in pages found in recent changes.
7. ... until all segments in all pages found in recent changes has been synthesized.

Candidates to be synthesized is re-evaluated every five minutes.

### Automatic flushing of segments

As the number of candidates to be synthesized can grow very large in a rather short time,
a flushing mechanism kicks in when there are more than 100,000 candidates in the queue,
removing those with the lowest priority and retains the top 100,000.

Segments flushing exists in order to save RAM, as the state of the application is kept in heap.

### Automatic flushing of pages

After one day of inactivity to a page on a wiki,
the state of rendering for that page will be a candidate for being flushed out.
If there are still segments that have not been synthesized, this occurs after two days.

Flushing a page means that if there is a change after the flush,
the complete page will be re-synthesized.
(Re-synthesized as in requested to be listened to. Wikispeech backend might in fact be cached.)

The main page will never be flushed out.

Pages that are linked to from the main page will not be flushed out until five days after they were last seen on the main page.

Page flushing exists in order to save RAM, as the state of the application is kept in heap.

### Failing segment voices

Will be a candidate to be retried every n hours, where n=number of previous failures.

## TODO

* Add feature in Wikispeech to not send audio response on synthesis, in order to minimize network data.
* Add feature in Wikispeech to list all cached segments and voices for a given page, in order to avoid requesting synthesis when not needed.
* Make all hard coded value mentioned above configurable in a properties file or something.
* Report state of candidate, flushing, etc to influxdb.

## REST

Most REST calls are exist for debug and development reasons.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the "are" here

As a user of this service, all you need is ```POST /api/wiki```

### POST /api/wiki
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good with a one-liner describing what the call does (similar to what you added to the calls below)

* consumerUrl: Wiki to be monitored for pre rendered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a clarification here that this should be set to the script path of the consumer wiki (i.e. the /w part is expected)

* initialLastRecentChangesLimitInMinutes: (default: 60) Number of hours of initial recent changes backlog to be processed.
* mainPagePriority: (default: 10) Base priority of segments on Wiki main page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "Base priority multiplier of ..." per description under "## What this service does"

* maximumSynthesizedVoiceAgeInDays: (default: 30) Number of days before attempting to re-synthesizing segments on this Wiki.

If initialLastRecentChangesLimitInHours is set to 0, then only new recent changes will be processed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe integrate this line into the parameter description above.


Example: ```POST http://host:port/api/wiki?consumerUrl=https://sv.wikipedia.org/w```

### GET /api/synthesis/queue/candidates
* limit: (default 100) Maximum number of results
* startOffset: (default 0) Start offset for pagination

Queue of Wiki page segments in line to be synthesized using a specific language and voice.

### DELETE /api/synthesis/queue

Clears queue of Wiki page segments in line to be synthesized.

### GET /api/synthesis/errors
* limit: (default 100) Maximum number of results
* startOffset: (default 0) Start offset for pagination

A list of errors that have occurred during synthesis of Wiki page segments.

### GET /api/page
* consumerUrl: Wiki
* title: Wikie page title

Example: ```GET http://host:port/api/page?consumerUrl=https://sv.wikipedia.org/w&title=Portal:Huvudsida```

Displays status and statistics about a given Wiki page.
* Priority
* Language
* Revision at segmentation
* Timestamp segmented
* Segments
* Voices synthesized
* Timestamp synthesized
* Synthesized revision
* etc
9 changes: 9 additions & 0 deletions register-wiki.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/env bash

if [ $# -eq 0 ]; then
consumerUrl="https://sv.wikipedia.org/w"
else
consumerUrl=$1
fi

curl -d "consumerUrl=${consumerUrl}&initialLastRecentChangesLimitInMinutes=0&mainPagePriority=10&maximumSynthesizedVoiceAgeInDays=30" -X POST http://localhost:9090/api/wiki
4 changes: 2 additions & 2 deletions run.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
export MAVEN_OPTS="-Xmx2g"
export MAVEN_OPTS="-Xmx3g"
mvn clean install
mvn exec:java -Dinflux.username="" -Dinflux.password="" -Dexec.mainClass="se.wikimedia.wikispeech.prerender.WebApp"
mvn exec:java -Dinflux.username="" -Dinflux.password="" -Dexec.mainClass="se.wikimedia.wikispeech.prerender.WebApp" -Dserver.port="9090"
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
package se.wikimedia.wikispeech.prerender;

import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,24 @@
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.annotation.EnableScheduling;
import org.springframework.scheduling.annotation.SchedulingConfigurer;
import org.springframework.scheduling.config.ScheduledTaskRegistrar;
import se.wikimedia.wikispeech.prerender.service.Settings;

import java.io.IOException;
import java.util.concurrent.Executor;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

@Configuration
@EnableScheduling
@EnableAutoConfiguration(exclude = {DataSourceAutoConfiguration.class})
@ComponentScan(basePackages = "se.wikimedia.wikispeech.prerender")
public class WebAppConfiguration {
public class WebAppConfiguration implements SchedulingConfigurer {

@Bean
public OkHttpClient okHttpClient() {
public OkHttpClient okHttpClient(Settings settings) {
return new OkHttpClient.Builder()
.readTimeout(5, TimeUnit.MINUTES)
.addInterceptor(
Expand All @@ -33,7 +40,7 @@ public Response intercept(@NotNull Chain chain) throws IOException {
Request requestWithUserAgent = originalRequest
.newBuilder()
.header("Content-Type", "application/json")
.header("User-Agent", "WMSE Wikispeech API Java client")
.header("User-Agent", settings.getString("WebAppConfiguration.userAgent", "WMSE Wikispeech Prerender"))
.build();

return chain.proceed(requestWithUserAgent);
Expand All @@ -42,4 +49,14 @@ public Response intercept(@NotNull Chain chain) throws IOException {
.build();
}

@Bean
public Executor taskExecutor() {
return Executors.newScheduledThreadPool(10);
}

@Override
public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
taskRegistrar.setScheduler(taskExecutor());
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,13 @@
@Component
public class PageApi {

public static void main(String[] args) throws Exception {
PageApi api = new PageApi(new OkHttpClient());
PageInfo mainpage = api.getPageInfo("https://sv.wikipedia.org/w", "Portal:Huvudsida");
PageInfo quisling = api.getPageInfo("https://sv.wikipedia.org/w", "Vidkun_Quisling");
System.currentTimeMillis();
}

private final Logger log = LogManager.getLogger(getClass());

private final OkHttpClient client;
Expand Down Expand Up @@ -74,8 +81,9 @@ public okhttp3.Headers getHttpHeaders(String consumerUrl, String title) throws I
return response.headers();
}

@Autowired
public PageApi(
@Autowired OkHttpClient client
OkHttpClient client
) {
this.client = client;
objectMapper = new ObjectMapper()
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
package se.wikimedia.wikispeech.prerender.mediawiki;

public class PageUtil {

public static String normalizeTitle(String title) {
title = title.replaceAll("_", " ");
return title;
}

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
package se.wikimedia.wikispeech.prerender.mediawiki;

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.SerializationFeature;
import com.fasterxml.jackson.datatype.jsr310.JavaTimeModule;
import lombok.Data;
import okhttp3.*;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

import java.io.IOException;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.util.List;

@Component
public class WikipediaMetricsApi {

public static void main(String[] args) throws Exception {
WikipediaMetricsApi api = new WikipediaMetricsApi(new OkHttpClient());
WikipediaMetricsApi.PageViews pageViews = api.getPageViewsTop("sv.wikipedia", LocalDate.parse("2023-03-14"));
System.currentTimeMillis();

}

private final OkHttpClient client;
private final ObjectMapper objectMapper;

@Autowired
public WikipediaMetricsApi(
OkHttpClient client
) {
this.client = client;
objectMapper = new ObjectMapper()
.registerModule(new JavaTimeModule())
.configure(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS, false);

}

private static final DateTimeFormatter pathSuffixDateFormatter = DateTimeFormatter.ofPattern("yyyy/MM/dd");

public PageViews getPageViewsTop(String wiki, LocalDate date) throws IOException {
// /sv.wikipedia/all-access/2023/03/27
HttpUrl.Builder urlBuilder = HttpUrl.parse("https://wikimedia.org/api/rest_v1/metrics/pageviews/top/" + wiki + "/all-access/" + date.format(pathSuffixDateFormatter)).newBuilder();

Request request = new Request.Builder()
.url(urlBuilder.build())
.build();

Call call = client.newCall(request);
Response response = call.execute();

JsonNode json;
try {

if (response.code() == 404)
// occurs if there is no data for this date, e.g. a future date seen from the tz of the remote server
return null;

if (response.code() != 200) {
throw new IOException("Response" + response);
}

json = objectMapper.readTree(response.body().byteStream());
} finally {
response.close();
}

return objectMapper.convertValue(json.get("items").get(0), PageViews.class);
}

@Data
public static class PageViews {
private String project;
private String access;
private String year;
private String month;
private String day;
private List<PageViewArticle> articles;
}

@Data
public static class PageViewArticle {
private String article;
private int views;
private int rank;

}

}
Loading