Download data from MyAnimeList, clean it and load it into datasets

6/7/2022

The goal was to create a simple script that would extract and clean data from the MyAnimeList database and store it in CSV files.

Data was extracted from the MyAnimeList database using the MyAnimeList API
Cleaned and transformed data was stored into CSV files
Datasets were loaded onto the Kaggle website here: Kaggle

Prerequisites

To access the MyAnimeList API, you need to create an account on the website. After creating the account, you will need to obtain a token or rather Client ID from an API application which you can create in the API panel on your profile, here. Visit the MyAnimeList API to learn more about the API.

Acknowledgments

References:

Usage

To generate a README.md file from Jupyter Notebook, use the following command:

jupyter nbconvert --to markdown anime_extract.ipynb --output README.md

Import

Insert obtained Client ID and specify folder to save datasets to into the code below.

import pandas as pd
import numpy as np
import requests
import time
import ast

CLIENT_ID = 'YOUR_CLIENT_ID' # Client ID from MyAnimeList API
DATA_FOLDER = 'data/' # Folder to save data to

MyAnimeList API

To learn how to use MyAnimeList API, visit the official API website: https://myanimelist.net/apiconfig/references/api/v2 and this GitHub page which contains an Unofficial API Specification: https://github.com/SuperMarcus/myanimelist-api-specification#search-anime .

#data = requests.get('https://api.myanimelist.net/v2/anime/16498?fields=id,title,main_picture,alternative_titles,start_date,end_date,synopsis,mean,rank,popularity,num_list_users,num_scoring_users,nsfw,created_at,updated_at,media_type,status,genres,num_episodes,start_season,broadcast,source,average_episode_duration,rating,pictures,background,related_anime,related_manga,recommendations,studios,statistics&limit=4', headers={'X-MAL-CLIENT-ID': CLIENT_ID}) 
data = requests.get('https://api.myanimelist.net/v2/anime/ranking?ranking_type=all&limit=500', headers={'X-MAL-CLIENT-ID': CLIENT_ID}) 
print(data.json().keys()) 
data.json() 
print(data.json()['paging']['next']) # next page

dict_keys(['data', 'paging'])
https://api.myanimelist.net/v2/anime/ranking?offset=500&ranking_type=all&limit=500

Loop through all ranked anime and add anime id and title to a pandas dataframe

The API call has a limit of 500 anime per request.

%%time
data = requests.get('https://api.myanimelist.net/v2/anime/ranking?ranking_type=all&limit=500', headers={'X-MAL-CLIENT-ID': CLIENT_ID}) # get all anime from ranking
df_anime_ids = pd.json_normalize(data.json()['data']).drop(['node.main_picture.medium', 'node.main_picture.large','ranking.rank'], axis=1) # get only anime ids and convert to dataframe
next = data.json()['paging']['next'] # get next page url
while next != None: # while there is a next page
    data = requests.get(next, headers={'X-MAL-CLIENT-ID': CLIENT_ID}) # get next page
    df_anime_ids = pd.concat([df_anime_ids, pd.json_normalize(data.json()['data']).drop(['node.main_picture.medium', 'node.main_picture.large','ranking.rank'], axis=1)], ignore_index=True) # concatenate dataframe and drop unnecessary columns
    try:
        next = data.json()['paging']['next'] # get next page url
    except:
        next = None # no more pages       
df_anime_ids.head()

CPU times: total: 14.8 s
Wall time: 3min 17s

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	node.id	node.title
0	5114	Fullmetal Alchemist: Brotherhood
1	28977	Gintama°
2	9253	Steins;Gate
3	38524	Shingeki no Kyojin Season 3 Part 2
4	9969	Gintama'

df_anime_ids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20521 entries, 0 to 20520
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node.id     20521 non-null  int64 
 1   node.title  20521 non-null  object
dtypes: int64(1), object(1)
memory usage: 320.8+ KB

List of all fields I want to extract about a specific anime from the MyAnimeList API

fields = 'id,title,main_picture,alternative_titles,start_date,end_date,synopsis,mean,rank,popularity,num_list_users,num_scoring_users,nsfw,created_at,updated_at,media_type,status,genres,my_list_status,num_episodes,start_season,source,average_episode_duration,rating,pictures,background,related_anime,related_manga,recommendations,studios,statistics'

Display a sample request to the API, which returns all information about the specified anime in json format

The data obtained contains all fields specified in the above cell

data = requests.get('https://api.myanimelist.net/v2/anime/' + str(5114) + '?fields=' + fields, headers={'X-MAL-CLIENT-ID': CLIENT_ID})
data.json()

{'id': 5114,
 'title': 'Fullmetal Alchemist: Brotherhood',
 'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1223/96541.jpg',
  'large': 'https://api-cdn.myanimelist.net/images/anime/1223/96541l.jpg'},
 'alternative_titles': {'synonyms': ['Hagane no Renkinjutsushi: Fullmetal Alchemist',
   'Fullmetal Alchemist (2009)',
   'FMA',
   'FMAB'],
  'en': 'Fullmetal Alchemist: Brotherhood',
  'ja': '鋼の錬金術師 FULLMETAL ALCHEMIST'},
 'start_date': '2009-04-05',
 'end_date': '2010-07-04',
 'synopsis': 'After a horrific alchemy experiment goes wrong in the Elric household, brothers Edward and Alphonse are left in a catastrophic new reality. Ignoring the alchemical principle banning human transmutation, the boys attempted to bring their recently deceased mother back to life. Instead, they suffered brutal personal loss: Alphonse\'s body disintegrated while Edward lost a leg and then sacrificed an arm to keep Alphonse\'s soul in the physical realm by binding it to a hulking suit of armor.\n\nThe brothers are rescued by their neighbor Pinako Rockbell and her granddaughter Winry. Known as a bio-mechanical engineering prodigy, Winry creates prosthetic limbs for Edward by utilizing "automail," a tough, versatile metal used in robots and combat armor. After years of training, the Elric brothers set off on a quest to restore their bodies by locating the Philosopher\'s Stone—a powerful gem that allows an alchemist to defy the traditional laws of Equivalent Exchange.\n\nAs Edward becomes an infamous alchemist and gains the nickname "Fullmetal," the boys\' journey embroils them in a growing conspiracy that threatens the fate of the world.\n\n[Written by MAL Rewrite]',
 'mean': 9.14,
 'rank': 1,
 'popularity': 3,
 'num_list_users': 2892519,
 'num_scoring_users': 1843357,
 'nsfw': 'white',
 'created_at': '2008-08-21T03:35:22+00:00',
 'updated_at': '2022-04-18T05:06:13+00:00',
 'media_type': 'tv',
 'status': 'finished_airing',
 'genres': [{'id': 1, 'name': 'Action'},
  {'id': 2, 'name': 'Adventure'},
  {'id': 8, 'name': 'Drama'},
  {'id': 10, 'name': 'Fantasy'},
  {'id': 38, 'name': 'Military'},
  {'id': 27, 'name': 'Shounen'}],
 'num_episodes': 64,
 'start_season': {'year': 2009, 'season': 'spring'},
 'source': 'manga',
 'average_episode_duration': 1460,
 'rating': 'r',
 'pictures': [{'medium': 'https://api-cdn.myanimelist.net/images/anime/13/13738.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/13/13738l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/2/17090.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/2/17090l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/2/17472.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/2/17472l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/5/47603.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/5/47603l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/10/57095.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/10/57095l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/7/74317.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/7/74317l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1521/94614.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1521/94614l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1208/94745.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1208/94745l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1223/96541.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1223/96541l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1286/96542.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1286/96542l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1629/108486.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1629/108486l.jpg'}],
 'background': '',
 'related_anime': [{'node': {'id': 121,
    'title': 'Fullmetal Alchemist',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/10/75815.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/10/75815l.jpg'}},
   'relation_type': 'alternative_version',
   'relation_type_formatted': 'Alternative version'},
  {'node': {'id': 6421,
    'title': 'Fullmetal Alchemist: Brotherhood Specials',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1493/91571.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1493/91571l.jpg'}},
   'relation_type': 'side_story',
   'relation_type_formatted': 'Side story'},
  {'node': {'id': 7902,
    'title': 'Fullmetal Alchemist: Brotherhood - 4-Koma Theater',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/3/76154.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/3/76154l.jpg'}},
   'relation_type': 'spin_off',
   'relation_type_formatted': 'Spin-off'},
  {'node': {'id': 9135,
    'title': 'Fullmetal Alchemist: The Sacred Star of Milos',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/2/29550.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/2/29550l.jpg'}},
   'relation_type': 'side_story',
   'relation_type_formatted': 'Side story'}],
 'related_manga': [],
 'recommendations': [{'node': {'id': 11061,
    'title': 'Hunter x Hunter (2011)',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1337/99013.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1337/99013l.jpg'}},
   'num_recommendations': 101},
  {'node': {'id': 16498,
    'title': 'Shingeki no Kyojin',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/10/47347.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/10/47347l.jpg'}},
   'num_recommendations': 39},
  {'node': {'id': 1482,
    'title': 'D.Gray-man',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/13/75194.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/13/75194l.jpg'}},
   'num_recommendations': 23},
  {'node': {'id': 9919,
    'title': 'Ao no Exorcist',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/10/75195.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/10/75195l.jpg'}},
   'num_recommendations': 17},
  {'node': {'id': 38000,
    'title': 'Kimetsu no Yaiba',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1286/99889.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1286/99889l.jpg'}},
   'num_recommendations': 15},
  {'node': {'id': 1575,
    'title': 'Code Geass: Hangyaku no Lelouch',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/5/50331.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/5/50331l.jpg'}},
   'num_recommendations': 15},
  {'node': {'id': 2251,
    'title': 'Baccano!',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/3/14547.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/3/14547l.jpg'}},
   'num_recommendations': 15},
  {'node': {'id': 22199,
    'title': 'Akame ga Kill!',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1429/95946.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1429/95946l.jpg'}},
   'num_recommendations': 10},
  {'node': {'id': 23755,
    'title': 'Nanatsu no Taizai',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/8/65409.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/8/65409l.jpg'}},
   'num_recommendations': 9},
  {'node': {'id': 1735,
    'title': 'Naruto: Shippuuden',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1565/111305.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1565/111305l.jpg'}},
   'num_recommendations': 9}],
 'studios': [{'id': 4, 'name': 'Bones'}],
 'statistics': {'status': {'watching': '220764',
   'completed': '2095325',
   'on_hold': '98772',
   'dropped': '45327',
   'plan_to_watch': '432363'},
  'num_list_users': 2892551}}

Loop through all anime ids from previous dataframe and extract anime information for each anime

Filter out anime without mean score and rank (adult anime)
Pause for 5 minutes every 500 requests to avoid being blocked

%%time
print("Starting data aquasition for every anime ID in the list")
rq_limit = 500
print("Limited to " + str(rq_limit) + " sequential requests, otherwise server denies requests")

data = requests.get('https://api.myanimelist.net/v2/anime/' + str(df_anime_ids['node.id'][0]) + '?fields=' + fields, headers={'X-MAL-CLIENT-ID': CLIENT_ID}) # get json data for anime with myanimelist api
df_anime = pd.json_normalize(data.json()) # convert json to pandas dataframe

for cnt in range(1, len(df_anime_ids['node.id'])): # loop through all anime IDs
    data = requests.get('https://api.myanimelist.net/v2/anime/' + str(df_anime_ids['node.id'][cnt]) + '?fields=' + fields, headers={'X-MAL-CLIENT-ID': CLIENT_ID})
    try: # if the anime is found
        anim_json = data.json()  # get the json
        if(not np.isnan(anim_json['mean']) and not np.isnan(anim_json['rank'])): # if mean and rank are not nan
            #anim_json.pop('background', None) # remove background          
            df_anime = pd.concat([df_anime, pd.json_normalize(anim_json)], ignore_index=True)#.drop(['node.main_picture.medium', 'node.main_picture.large','ranking.rank'], axis=1)
    except:
        None
    if (cnt > 1 and cnt % 100 == 0): # print progress every 100 requests
        print(str(cnt) + " requests")
    if (cnt % rq_limit == 0):                           
        print("Waiting for 5 minutes before continuing") # pause the requests for 5 minutes, otherwise, the server will deny requests
        minutes = 6
        while (minutes > 1): # wait for 5 minutes
            print(str(minutes-1) + " minutes left")
            time.sleep(60)
            minutes -= 1
        print("Starting up again")
df_anime.head(5)

df_anime.columns

Index(['mal_id', 'title', 'start_date', 'end_date', 'synopsis', 'mean', 'rank',
       'popularity', 'num_list_users', 'num_scoring_users', 'nsfw',
       'created_at', 'updated_at', 'media_type', 'status', 'genres',
       'num_episodes', 'source', 'average_episode_duration', 'rating',
       'pictures', 'background', 'related_anime', 'related_manga',
       'recommendations', 'studios', 'main_picture.medium',
       'main_picture.large', 'alternative_titles.synonyms',
       'alternative_titles.en', 'alternative_titles.ja', 'start_season.year',
       'start_season.season', 'statistics.status.watching',
       'statistics.status.completed', 'statistics.status.on_hold',
       'statistics.status.dropped', 'statistics.status.plan_to_watch',
       'statistics.num_list_users'],
      dtype='object')

Save extracted anime information to a csv file

df_anime.index.name = 'Index' # rename the index
df_anime.rename(columns={"id": "mal_id"}, inplace=True) # rename the id column to mal_id
df_anime = df_anime[df_anime['media_type'].isin(['tv', 'movie', 'ova', 'special', 'ona'])] # only include TV, Movie, OVA, Special, ONA
df_anime.reset_index(drop=True, inplace=True) # reset the index
df_anime.to_csv(DATA_FOLDER + 'anime_extract.csv', sep=';', encoding='utf-8') # save the dataframe as a csv file
df_anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11613 entries, 0 to 11612
Data columns (total 39 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   mal_id                           11613 non-null  int64  
 1   title                            11613 non-null  object 
 2   start_date                       11603 non-null  object 
 3   end_date                         11501 non-null  object 
 4   synopsis                         11410 non-null  object 
 5   mean                             11613 non-null  float64
 6   rank                             11613 non-null  int64  
 7   popularity                       11613 non-null  int64  
 8   num_list_users                   11613 non-null  int64  
 9   num_scoring_users                11613 non-null  int64  
 10  nsfw                             11613 non-null  object 
 11  created_at                       11613 non-null  object 
 12  updated_at                       11613 non-null  object 
 13  media_type                       11613 non-null  object 
 14  status                           11613 non-null  object 
 15  genres                           11576 non-null  object 
 16  num_episodes                     11613 non-null  int64  
 17  source                           10106 non-null  object 
 18  average_episode_duration         11613 non-null  int64  
 19  rating                           11514 non-null  object 
 20  pictures                         11613 non-null  object 
 21  background                       1507 non-null   object 
 22  related_anime                    11613 non-null  object 
 23  related_manga                    11613 non-null  object 
 24  recommendations                  11613 non-null  object 
 25  studios                          11613 non-null  object 
 26  main_picture.medium              11609 non-null  object 
 27  main_picture.large               11609 non-null  object 
 28  alternative_titles.synonyms      11613 non-null  object 
 29  alternative_titles.en            6178 non-null   object 
 30  alternative_titles.ja            11595 non-null  object 
 31  start_season.year                10969 non-null  float64
 32  start_season.season              10969 non-null  object 
 33  statistics.status.watching       11613 non-null  int64  
 34  statistics.status.completed      11613 non-null  int64  
 35  statistics.status.on_hold        11613 non-null  int64  
 36  statistics.status.dropped        11613 non-null  int64  
 37  statistics.status.plan_to_watch  11613 non-null  int64  
 38  statistics.num_list_users        11613 non-null  int64  
dtypes: float64(2), int64(13), object(24)
memory usage: 3.5+ MB

df_anime = pd.read_csv('data/anime_extract.csv', sep=';', encoding='utf-8') df_anime.index.name = 'Index' df_anime.drop(df_anime.columns[0], axis=1, inplace=True) df_anime.rename(columns={"id": "mal_id"}, inplace=True) df_anime = df_anime[df_anime['media_type'].isin(['tv', 'movie', 'ova', 'special', 'ona'])] df_anime.reset_index(drop=True, inplace=True) df_anime.info()

Save anime ids and titles to a csv file

df_anime_titles = df_anime[['mal_id', 'title']].copy() # copy the dataframe columns to a new dataframe
df_anime_titles.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_titles.index.name = "Index" # rename the index
df_anime_titles.to_csv(DATA_FOLDER + 'anime_titles.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_titles.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	title
Index
0	1	Cowboy Bebop
1	5	Cowboy Bebop: Tengoku no Tobira
2	6	Trigun
3	7	Witch Hunter Robin
4	8	Bouken Ou Beet

Save anime rankings to csv file

df_anime_ranking = df_anime[['mal_id', 'mean', 'rank', 'popularity', 'rating', 'num_scoring_users']].copy() # copy the dataframe columns to a new dataframe
df_anime_ranking.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_ranking.index.name = "Index" # set the index name to "Index"
df_anime_ranking.to_csv(DATA_FOLDER + 'anime_ranking.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_ranking.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	mean	rank	popularity	rating	num_list_users	num_scoring_users
Index
0	1	8.76	37	42	r	1617259	832701
1	5	8.38	175	566	r	334185	192661
2	6	8.22	310	242	pg_13	659514	328258
3	7	7.26	2708	1678	pg_13	105582	41521
4	8	6.96	4073	4843	pg	14304	6239

Save anime ratings to csv file

df_anime_rating = df_anime[['mal_id', 'rating']].copy() # copy the dataframe columns to a new dataframe
df_anime_rating.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_rating.index.name = "Index" # set the index name to "Index"
df_anime_rating.to_csv(DATA_FOLDER + 'anime_rating.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_rating.head()

Save anime dates to csv file

df_anime_dates = df_anime[['mal_id', 'start_date', 'end_date', 'start_season.year', 'start_season.season']].copy() # copy the dataframe columns to a new dataframe
df_anime_dates.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_dates.index.name = "Index" # set the index name to "Index"

df_anime_dates.rename(columns={"start_season.year": "anime_season_year"}, inplace=True) # rename the column
df_anime_dates.rename(columns={"start_season_season": "anime_season"}, inplace=True) # rename the column
df_anime_dates["start_season_year"] = df_anime_dates["start_season_year"].fillna(0).astype(np.int64) # fill the NaN with 0 and convert to int64

df_anime_dates.to_csv(DATA_FOLDER + 'anime_dates.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_dates.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	start_date	end_date	start_season_year	start_season.season
Index
0	1	1998-04-03	1999-04-24	1998	spring
1	5	2001-09-01	2001-09-01	2001	summer
2	6	1998-04-01	1998-09-30	1998	spring
3	7	2002-07-03	2002-12-25	2002	summer
4	8	2004-09-30	2005-09-29	2004	fall

Save anime genres and demographics to csv files

df_anime_genres = pd.DataFrame(columns=['mal_id', 'genre_id']) # create a dataframe with the columns
df_genres_d = pd.DataFrame(columns=['genre_id', 'genre_de']) # create a dataframe with the columns
df_anime_demographic = pd.DataFrame(columns=['mal_id', 'demo_id']) # create a dataframe with the columns
df_demographic_d = pd.DataFrame(columns=['demo_id', 'demo_de']) # create a dataframe with the columns

for row in df_anime.iterrows(): # iterate through the dataframe
    genres_str = row[1]['genres'] 
    if(pd.isna(genres_str)): # if the genres_str is NaN continue
        continue
    genres_str = '{"genres": ' + genres_str.replace("'id'", "'genre_id'").replace("name", "genre_de") + '}'
    genre_d = ast.literal_eval(genres_str) # convert the string to a dictionary
    genres_d = pd.json_normalize(genre_d['genres']) # normalize the json
    
    for genre in genre_d['genres']: # iterate through the genres and demographics
        if(genre['genre_id'] in [15, 25, 27, 42, 43]): # if the genre is in the list of demographics
            df_anime_demographic.loc[df_anime_demographic.shape[0]] = [row[1]['mal_id'], genre['genre_id']] # add the demographic to the dataframe
            if(genre['genre_id'] not in df_demographic_d['demo_id'].values): # if the demographic is not in the dataframe
                df_demographic_d.loc[df_demographic_d.shape[0]] = [genre['genre_id'], genre['genre_de']] # if the demographic not already in add it to the dataframe
        else: # if the genre is not in the list of demographics
            df_anime_genres.loc[df_anime_genres.shape[0]] = [row[1]['mal_id'], genre['genre_id']] # add the genre to the dataframe
            if(genre['genre_id'] not in df_genres_d['genre_id'].values): # if the genre is not in the dataframe
                df_genres_d.loc[df_genres_d.shape[0]] = [genre['genre_id'], genre['genre_de']] # if the genre not already in add it to the dataframe

df_anime_demographic.sort_values(by=['mal_id', 'demo_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and demo_id
df_anime_demographic.index.name = "Index" # set the index name to "Index"
df_anime_demographic.to_csv(DATA_FOLDER + 'anime_demographics.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_demographic.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	demo_id
Index
0	6	27
1	8	27
2	15	27
3	16	43
4	17	27

df_demographic_d.sort_values(by=['demo_id'], inplace=True, ignore_index=True) # sort the dataframe by the demo_id
df_demographic_d.index.name = "Index" # set the index name to "Index"
df_demographic_d.to_csv(DATA_FOLDER + 'demographics_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_demographic_d.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	demo_id	demo_de
Index
0	15	Kids
1	25	Shoujo
2	27	Shounen
3	42	Seinen
4	43	Josei

df_genres_d.sort_values(by=['genre_id'], inplace=True, ignore_index=True) # sort the dataframe by the genre_id
df_genres_d.index.name = "Index" # set the index name to "Index"
df_genres_d.to_csv(DATA_FOLDER + 'genres_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_genres_d.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	genre_id	genre_de
Index
0	1	Action
1	2	Adventure
2	3	Racing
3	4	Comedy
4	5	Avant Garde

df_anime_genres.sort_values(by=['mal_id', 'genre_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and by genre_id
df_anime_genres.index.name = "Index" # set the index name to "Index"
df_anime_genres.to_csv(DATA_FOLDER + 'anime_genres.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_genres.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	genre_id
Index
0	1	1
1	1	24
2	1	29
3	1	50
4	5	1

Save anime media types and nsfw to csv file

df_anime_media_type = df_anime[['mal_id', 'media_type', 'nsfw']].copy() # copy the dataframe columns to a new dataframe
df_anime_media_type.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_media_type.index.name = "Index" # set the index name to "Index"
df_anime_media_type.to_csv(DATA_FOLDER + 'media_type_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_media_type.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	media_type	source	nsfw
Index
0	1	tv	original	white
1	5	movie	original	white
2	6	tv	manga	white
3	7	tv	original	white
4	8	tv	manga	white

Save anime sources to csv file

df_anime_source = df_anime[['mal_id', 'source']].copy() # copy the dataframe columns to a new dataframe
df_anime_source.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_source.index.name = "Index" # set the index name to "Index"
df_anime_source.to_csv(DATA_FOLDER + 'anime_source.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_source.head()

Save anime synopses to csv file

df_synopsis_d = df_anime[['mal_id', 'synopsis']].copy() # copy the dataframe columns to a new dataframe
df_synopsis_d.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_synopsis_d.index.name = "Index" # set the index name to "Index"
df_synopsis_d.to_csv(DATA_FOLDER + 'synopsis_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_synopsis_d.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	synopsis
Index
0	1	Crime is timeless. By the year 2071, humanity ...
1	5	Another day, another bounty—such is the life o...
2	6	Vash the Stampede is the man with a $$60,000,0...
3	7	Witches are individuals with special powers li...
4	8	It is the dark century and the people are suff...

Save anime studios to csv file

df_anime_studios = pd.DataFrame(columns=['mal_id', 'studio_id']) # create a dataframe with the columns
df_studios_d = pd.DataFrame(columns=['studio_id', 'studio_de']) # create a dataframe with the columns

cnt = 0
for row in df_anime.iterrows(): # iterate through the dataframe
    studios_str = row[1]['studios']
    if(pd.isna(genres_str)):
        continue
    studios_str = '{"studios": ' + studios_str.replace("id", 'studio_id').replace("name", 'studio_de') + '}'
    studio_d = ast.literal_eval(studios_str) # convert the string to a dictionary
    studios_d = pd.json_normalize(studio_d['studios']) # convert the dictionary to a dataframe
    
    df_studios_d = pd.concat([df_studios_d, studios_d], ignore_index=True).drop_duplicates() # concatenate the dataframes
    if(studios_d.empty): # if the dataframe is empty
        df_anime_studios.loc[df_anime_studios.shape[0]] = [int(row[1]['mal_id']), np.nan] # add the row to the dataframe and set it to NaN
        continue
    for studio in studio_d['studios']: # iterate through the studios
        df_anime_studios.loc[df_anime_studios.shape[0]] = [int(row[1]['mal_id']), int(studio['studio_id'])] # add the row to the dataframe

df_studios_d.sort_values(by=['studio_id'], inplace=True, ignore_index=True) # sort the dataframe by the genre_id
df_studios_d.index.name = "Index" # set the index name to "Index"
df_studios_d.to_csv(DATA_FOLDER + 'studios_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_studios_d.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	studio_id	studio_de
Index
0	1	Pierrot
1	2	Kyoto Animation
2	3	Gonzo
3	4	Bones
4	5	Bee Train

df_anime_studios.sort_values(by=['mal_id', 'studio_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and by genre_id
df_anime_studios.index.name = "Index" # set the index name to "Index"
df_anime_studios["mal_id"] = df_anime_studios["mal_id"].astype(np.int64) # convert the mal_id to an integer
df_anime_studios["studio_id"] = df_anime_studios["studio_id"].fillna(0).astype(np.int64) # convert the studio_id to an integer
df_anime_studios.to_csv(DATA_FOLDER + 'anime_studios.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_studios.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	studio_id
Index
0	1	14
1	5	4
2	6	11
3	7	14
4	8	18

Save anime recommendations to csv file

df_anime_recommendations = pd.DataFrame(columns=['mal_id', 'mal_id_rd', 'num_recommendations']) # create a dataframe with the columns
for row in df_anime.iterrows(): # iterate through the dataframe
    recommended_str = row[1]['recommendations']
    if(pd.isna(recommended_str)): # if the recommended_str is NaN
        continue
    recommended_str = '{"recommendations": ' + str(recommended_str) + '}'
    recom_d = ast.literal_eval(recommended_str) # convert the string to a dictionary
    for recommendation in recom_d['recommendations']: # iterate through the recommendations
        df_anime_recommendations.loc[df_anime_recommendations.shape[0]] = [row[1]['mal_id'], recommendation['node']['id'], recommendation['num_recommendations']] # add the row to the dataframe

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	mal_id_rd	num_recommendations
0	5114	11061	101
1	5114	16498	39
2	5114	1482	23
3	5114	9919	17
4	5114	1575	15

df_anime_recommendations.sort_values(by=['mal_id', 'num_recommendations'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and by num_recommendations
df_anime_recommendations.index.name = "Index" # set the index name to "Index"
df_anime_recommendations.to_csv(DATA_FOLDER + 'anime_recommendations.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_recommendations.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	mal_id_rd	num_recommendations
Index
0	1	13601	13
1	1	918	14
2	1	2025	16
3	1	4087	18
4	1	2251	30

Save anime relations to csv file

df_anime_relations = pd.DataFrame(columns=['mal_id', 'mal_id_re', 'relation_type']) # create a dataframe with the columns
for row in df_anime.iterrows(): # iterate through the dataframe
    related_str = row[1]['related_anime']
    if(pd.isna(related_str)): # if the related_str is NaN
        continue
    related_d = ast.literal_eval(related_str) # convert the string to a dictionary
    for relation in related_d: # iterate through the relations
        df_anime_relations.loc[df_anime_relations.shape[0]] = [row[1]['mal_id'], relation['node']['id'], relation['relation_type']] # add the row to the dataframe

df_anime_relations.sort_values(by=['mal_id', 'mal_id_re'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and by mal_id_re
df_anime_relations.index.name = "Index" # set the index name to "Index"
df_anime_relations.to_csv(DATA_FOLDER + 'anime_relations.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_relations.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	mal_id_re	relation_type
Index
0	1	5	side_story
1	1	4037	summary
2	1	17205	side_story
3	5	1	parent_story
4	6	4106	side_story

Save anime cover images to csv file (urls)

df_anime_covers = df_anime[['mal_id', 'main_picture.medium', 'main_picture.large']].copy() # copy the dataframe columns to a new dataframe #'pictures' 
df_anime_covers.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_covers.rename(columns={"main_picture.medium": "main_picture_medium"}, inplace=True) # rename the column
df_anime_covers.rename(columns={"main_picture.large": "main_picture_large"}, inplace=True)
df_anime_covers.index.name = "Index" # set the index name to "Index"
df_anime_covers.to_csv(DATA_FOLDER + 'anime_covers.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_covers.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	main_picture.medium	main_picture.large	pictures
Index
0	1	https://api-cdn.myanimelist.net/images/anime/4...	https://api-cdn.myanimelist.net/images/anime/4...	[{'medium': 'https://api-cdn.myanimelist.net/i...
1	5	https://api-cdn.myanimelist.net/images/anime/1...	https://api-cdn.myanimelist.net/images/anime/1...	[{'medium': 'https://api-cdn.myanimelist.net/i...
2	6	https://api-cdn.myanimelist.net/images/anime/7...	https://api-cdn.myanimelist.net/images/anime/7...	[{'medium': 'https://api-cdn.myanimelist.net/i...
3	7	https://api-cdn.myanimelist.net/images/anime/1...	https://api-cdn.myanimelist.net/images/anime/1...	[{'medium': 'https://api-cdn.myanimelist.net/i...
4	8	https://api-cdn.myanimelist.net/images/anime/7...	https://api-cdn.myanimelist.net/images/anime/7...	[{'medium': 'https://api-cdn.myanimelist.net/i...

Save anime statistics to csv file

df_anime_stats = df_anime[['mal_id', 'num_episodes', 'average_episode_duration', 
                           'statistics.status.watching', 'statistics.status.completed', 
                           'statistics.status.on_hold', 'statistics.status.dropped', 
                           'statistics.status.plan_to_watch', 'statistics.num_list_users']].copy() # copy the dataframe columns to a new dataframe
df_anime_stats.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_stats.rename(columns={"statistics.status.watching": "num_watching"}, inplace=True) # rename the columns
df_anime_stats.rename(columns={"statistics.status.completed": "num_completed"}, inplace=True) 
df_anime_stats.rename(columns={"statistics.status.on_hold": "num_on_hold"}, inplace=True)
df_anime_stats.rename(columns={"statistics.status.dropped": "num_dropped"}, inplace=True)
df_anime_stats.rename(columns={"statistics.status.plan_to_watch": "num_plan_to_watch"}, inplace=True)
df_anime_stats.rename(columns={"statistics.num_list_users": "num_list_users"}, inplace=True)
df_anime_stats.index.name = "Index" # set the index name to "Index"
df_anime_stats.to_csv(DATA_FOLDER + 'anime_stats.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_stats.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mal_id	num_episodes	average_episode_duration	statistics.status.watching	statistics.status.completed	statistics.status.on_hold	statistics.status.dropped	statistics.status.plan_to_watch	statistics.num_list_users
Index
0	1	26	1440	151257	921226	91835	35042	417977	1617337
1	5	1	6911	5606	249064	2472	974	76075	334191
2	6	26	1480	35889	393056	29722	16350	184521	659538
3	7	26	1500	4914	49236	5571	5785	40071	105577
4	8	52	1380	709	7754	801	1178	3861	14303

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
anime_extract.ipynb		anime_extract.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Download data from MyAnimeList, clean it and load it into datasets

The goal was to create a simple script that would extract and clean data from the MyAnimeList database and store it in CSV files.

Prerequisites

Acknowledgments

Usage

Import

MyAnimeList API

Loop through all ranked anime and add anime id and title to a pandas dataframe

List of all fields I want to extract about a specific anime from the MyAnimeList API

Display a sample request to the API, which returns all information about the specified anime in json format

Loop through all anime ids from previous dataframe and extract anime information for each anime

Save extracted anime information to a csv file

Save anime ids and titles to a csv file

Save anime rankings to csv file

Save anime ratings to csv file

Save anime dates to csv file

Save anime genres and demographics to csv files

Save anime media types and nsfw to csv file

Save anime sources to csv file

Save anime synopses to csv file

Save anime studios to csv file

Save anime recommendations to csv file

Save anime relations to csv file

Save anime cover images to csv file (urls)

Save anime statistics to csv file

About

Contributors 2

Languages

AshesOfPhoenix/myanimelist-database-extract

Folders and files

Latest commit

History

Repository files navigation

Download data from MyAnimeList, clean it and load it into datasets

The goal was to create a simple script that would extract and clean data from the MyAnimeList database and store it in CSV files.

Prerequisites

Acknowledgments

Usage

Import

MyAnimeList API

Loop through all ranked anime and add anime id and title to a pandas dataframe

List of all fields I want to extract about a specific anime from the MyAnimeList API

Display a sample request to the API, which returns all information about the specified anime in json format

Loop through all anime ids from previous dataframe and extract anime information for each anime

Save extracted anime information to a csv file

Save anime ids and titles to a csv file

Save anime rankings to csv file

Save anime ratings to csv file

Save anime dates to csv file

Save anime genres and demographics to csv files

Save anime media types and nsfw to csv file

Save anime sources to csv file

Save anime synopses to csv file

Save anime studios to csv file

Save anime recommendations to csv file

Save anime relations to csv file

Save anime cover images to csv file (urls)

Save anime statistics to csv file

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages