Skip to content

AshesOfPhoenix/myanimelist-database-extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Download data from MyAnimeList, clean it and load it into datasets

6/7/2022

The goal was to create a simple script that would extract and clean data from the MyAnimeList database and store it in CSV files.

  • Data was extracted from the MyAnimeList database using the MyAnimeList API
  • Cleaned and transformed data was stored into CSV files
  • Datasets were loaded onto the Kaggle website here: Kaggle

Prerequisites

To access the MyAnimeList API, you need to create an account on the website. After creating the account, you will need to obtain a token or rather Client ID from an API application which you can create in the API panel on your profile, here. Visit the MyAnimeList API to learn more about the API.

Acknowledgments

References:

Usage

To generate a README.md file from Jupyter Notebook, use the following command:

  • jupyter nbconvert --to markdown anime_extract.ipynb --output README.md

Import

Insert obtained Client ID and specify folder to save datasets to into the code below.

import pandas as pd
import numpy as np
import requests
import time
import ast

CLIENT_ID = 'YOUR_CLIENT_ID' # Client ID from MyAnimeList API
DATA_FOLDER = 'data/' # Folder to save data to

MyAnimeList API

To learn how to use MyAnimeList API, visit the official API website: https://myanimelist.net/apiconfig/references/api/v2 and this GitHub page which contains an Unofficial API Specification: https://github.com/SuperMarcus/myanimelist-api-specification#search-anime .

#data = requests.get('https://api.myanimelist.net/v2/anime/16498?fields=id,title,main_picture,alternative_titles,start_date,end_date,synopsis,mean,rank,popularity,num_list_users,num_scoring_users,nsfw,created_at,updated_at,media_type,status,genres,num_episodes,start_season,broadcast,source,average_episode_duration,rating,pictures,background,related_anime,related_manga,recommendations,studios,statistics&limit=4', headers={'X-MAL-CLIENT-ID': CLIENT_ID}) 
data = requests.get('https://api.myanimelist.net/v2/anime/ranking?ranking_type=all&limit=500', headers={'X-MAL-CLIENT-ID': CLIENT_ID}) 
print(data.json().keys()) 
data.json() 
print(data.json()['paging']['next']) # next page
dict_keys(['data', 'paging'])
https://api.myanimelist.net/v2/anime/ranking?offset=500&ranking_type=all&limit=500

Loop through all ranked anime and add anime id and title to a pandas dataframe

  • The API call has a limit of 500 anime per request.
%%time
data = requests.get('https://api.myanimelist.net/v2/anime/ranking?ranking_type=all&limit=500', headers={'X-MAL-CLIENT-ID': CLIENT_ID}) # get all anime from ranking
df_anime_ids = pd.json_normalize(data.json()['data']).drop(['node.main_picture.medium', 'node.main_picture.large','ranking.rank'], axis=1) # get only anime ids and convert to dataframe
next = data.json()['paging']['next'] # get next page url
while next != None: # while there is a next page
    data = requests.get(next, headers={'X-MAL-CLIENT-ID': CLIENT_ID}) # get next page
    df_anime_ids = pd.concat([df_anime_ids, pd.json_normalize(data.json()['data']).drop(['node.main_picture.medium', 'node.main_picture.large','ranking.rank'], axis=1)], ignore_index=True) # concatenate dataframe and drop unnecessary columns
    try:
        next = data.json()['paging']['next'] # get next page url
    except:
        next = None # no more pages       
df_anime_ids.head()
CPU times: total: 14.8 s
Wall time: 3min 17s
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
node.id node.title
0 5114 Fullmetal Alchemist: Brotherhood
1 28977 Gintama°
2 9253 Steins;Gate
3 38524 Shingeki no Kyojin Season 3 Part 2
4 9969 Gintama'
df_anime_ids.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20521 entries, 0 to 20520
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node.id     20521 non-null  int64 
 1   node.title  20521 non-null  object
dtypes: int64(1), object(1)
memory usage: 320.8+ KB

List of all fields I want to extract about a specific anime from the MyAnimeList API

fields = 'id,title,main_picture,alternative_titles,start_date,end_date,synopsis,mean,rank,popularity,num_list_users,num_scoring_users,nsfw,created_at,updated_at,media_type,status,genres,my_list_status,num_episodes,start_season,source,average_episode_duration,rating,pictures,background,related_anime,related_manga,recommendations,studios,statistics'

Display a sample request to the API, which returns all information about the specified anime in json format

The data obtained contains all fields specified in the above cell

data = requests.get('https://api.myanimelist.net/v2/anime/' + str(5114) + '?fields=' + fields, headers={'X-MAL-CLIENT-ID': CLIENT_ID})
data.json()
{'id': 5114,
 'title': 'Fullmetal Alchemist: Brotherhood',
 'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1223/96541.jpg',
  'large': 'https://api-cdn.myanimelist.net/images/anime/1223/96541l.jpg'},
 'alternative_titles': {'synonyms': ['Hagane no Renkinjutsushi: Fullmetal Alchemist',
   'Fullmetal Alchemist (2009)',
   'FMA',
   'FMAB'],
  'en': 'Fullmetal Alchemist: Brotherhood',
  'ja': '鋼の錬金術師 FULLMETAL ALCHEMIST'},
 'start_date': '2009-04-05',
 'end_date': '2010-07-04',
 'synopsis': 'After a horrific alchemy experiment goes wrong in the Elric household, brothers Edward and Alphonse are left in a catastrophic new reality. Ignoring the alchemical principle banning human transmutation, the boys attempted to bring their recently deceased mother back to life. Instead, they suffered brutal personal loss: Alphonse\'s body disintegrated while Edward lost a leg and then sacrificed an arm to keep Alphonse\'s soul in the physical realm by binding it to a hulking suit of armor.\n\nThe brothers are rescued by their neighbor Pinako Rockbell and her granddaughter Winry. Known as a bio-mechanical engineering prodigy, Winry creates prosthetic limbs for Edward by utilizing "automail," a tough, versatile metal used in robots and combat armor. After years of training, the Elric brothers set off on a quest to restore their bodies by locating the Philosopher\'s Stone—a powerful gem that allows an alchemist to defy the traditional laws of Equivalent Exchange.\n\nAs Edward becomes an infamous alchemist and gains the nickname "Fullmetal," the boys\' journey embroils them in a growing conspiracy that threatens the fate of the world.\n\n[Written by MAL Rewrite]',
 'mean': 9.14,
 'rank': 1,
 'popularity': 3,
 'num_list_users': 2892519,
 'num_scoring_users': 1843357,
 'nsfw': 'white',
 'created_at': '2008-08-21T03:35:22+00:00',
 'updated_at': '2022-04-18T05:06:13+00:00',
 'media_type': 'tv',
 'status': 'finished_airing',
 'genres': [{'id': 1, 'name': 'Action'},
  {'id': 2, 'name': 'Adventure'},
  {'id': 8, 'name': 'Drama'},
  {'id': 10, 'name': 'Fantasy'},
  {'id': 38, 'name': 'Military'},
  {'id': 27, 'name': 'Shounen'}],
 'num_episodes': 64,
 'start_season': {'year': 2009, 'season': 'spring'},
 'source': 'manga',
 'average_episode_duration': 1460,
 'rating': 'r',
 'pictures': [{'medium': 'https://api-cdn.myanimelist.net/images/anime/13/13738.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/13/13738l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/2/17090.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/2/17090l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/2/17472.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/2/17472l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/5/47603.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/5/47603l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/10/57095.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/10/57095l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/7/74317.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/7/74317l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1521/94614.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1521/94614l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1208/94745.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1208/94745l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1223/96541.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1223/96541l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1286/96542.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1286/96542l.jpg'},
  {'medium': 'https://api-cdn.myanimelist.net/images/anime/1629/108486.jpg',
   'large': 'https://api-cdn.myanimelist.net/images/anime/1629/108486l.jpg'}],
 'background': '',
 'related_anime': [{'node': {'id': 121,
    'title': 'Fullmetal Alchemist',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/10/75815.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/10/75815l.jpg'}},
   'relation_type': 'alternative_version',
   'relation_type_formatted': 'Alternative version'},
  {'node': {'id': 6421,
    'title': 'Fullmetal Alchemist: Brotherhood Specials',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1493/91571.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1493/91571l.jpg'}},
   'relation_type': 'side_story',
   'relation_type_formatted': 'Side story'},
  {'node': {'id': 7902,
    'title': 'Fullmetal Alchemist: Brotherhood - 4-Koma Theater',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/3/76154.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/3/76154l.jpg'}},
   'relation_type': 'spin_off',
   'relation_type_formatted': 'Spin-off'},
  {'node': {'id': 9135,
    'title': 'Fullmetal Alchemist: The Sacred Star of Milos',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/2/29550.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/2/29550l.jpg'}},
   'relation_type': 'side_story',
   'relation_type_formatted': 'Side story'}],
 'related_manga': [],
 'recommendations': [{'node': {'id': 11061,
    'title': 'Hunter x Hunter (2011)',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1337/99013.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1337/99013l.jpg'}},
   'num_recommendations': 101},
  {'node': {'id': 16498,
    'title': 'Shingeki no Kyojin',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/10/47347.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/10/47347l.jpg'}},
   'num_recommendations': 39},
  {'node': {'id': 1482,
    'title': 'D.Gray-man',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/13/75194.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/13/75194l.jpg'}},
   'num_recommendations': 23},
  {'node': {'id': 9919,
    'title': 'Ao no Exorcist',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/10/75195.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/10/75195l.jpg'}},
   'num_recommendations': 17},
  {'node': {'id': 38000,
    'title': 'Kimetsu no Yaiba',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1286/99889.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1286/99889l.jpg'}},
   'num_recommendations': 15},
  {'node': {'id': 1575,
    'title': 'Code Geass: Hangyaku no Lelouch',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/5/50331.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/5/50331l.jpg'}},
   'num_recommendations': 15},
  {'node': {'id': 2251,
    'title': 'Baccano!',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/3/14547.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/3/14547l.jpg'}},
   'num_recommendations': 15},
  {'node': {'id': 22199,
    'title': 'Akame ga Kill!',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1429/95946.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1429/95946l.jpg'}},
   'num_recommendations': 10},
  {'node': {'id': 23755,
    'title': 'Nanatsu no Taizai',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/8/65409.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/8/65409l.jpg'}},
   'num_recommendations': 9},
  {'node': {'id': 1735,
    'title': 'Naruto: Shippuuden',
    'main_picture': {'medium': 'https://api-cdn.myanimelist.net/images/anime/1565/111305.jpg',
     'large': 'https://api-cdn.myanimelist.net/images/anime/1565/111305l.jpg'}},
   'num_recommendations': 9}],
 'studios': [{'id': 4, 'name': 'Bones'}],
 'statistics': {'status': {'watching': '220764',
   'completed': '2095325',
   'on_hold': '98772',
   'dropped': '45327',
   'plan_to_watch': '432363'},
  'num_list_users': 2892551}}

Loop through all anime ids from previous dataframe and extract anime information for each anime

  • Filter out anime without mean score and rank (adult anime)
  • Pause for 5 minutes every 500 requests to avoid being blocked
%%time
print("Starting data aquasition for every anime ID in the list")
rq_limit = 500
print("Limited to " + str(rq_limit) + " sequential requests, otherwise server denies requests")

data = requests.get('https://api.myanimelist.net/v2/anime/' + str(df_anime_ids['node.id'][0]) + '?fields=' + fields, headers={'X-MAL-CLIENT-ID': CLIENT_ID}) # get json data for anime with myanimelist api
df_anime = pd.json_normalize(data.json()) # convert json to pandas dataframe

for cnt in range(1, len(df_anime_ids['node.id'])): # loop through all anime IDs
    data = requests.get('https://api.myanimelist.net/v2/anime/' + str(df_anime_ids['node.id'][cnt]) + '?fields=' + fields, headers={'X-MAL-CLIENT-ID': CLIENT_ID})
    try: # if the anime is found
        anim_json = data.json()  # get the json
        if(not np.isnan(anim_json['mean']) and not np.isnan(anim_json['rank'])): # if mean and rank are not nan
            #anim_json.pop('background', None) # remove background          
            df_anime = pd.concat([df_anime, pd.json_normalize(anim_json)], ignore_index=True)#.drop(['node.main_picture.medium', 'node.main_picture.large','ranking.rank'], axis=1)
    except:
        None
    if (cnt > 1 and cnt % 100 == 0): # print progress every 100 requests
        print(str(cnt) + " requests")
    if (cnt % rq_limit == 0):                           
        print("Waiting for 5 minutes before continuing") # pause the requests for 5 minutes, otherwise, the server will deny requests
        minutes = 6
        while (minutes > 1): # wait for 5 minutes
            print(str(minutes-1) + " minutes left")
            time.sleep(60)
            minutes -= 1
        print("Starting up again")
df_anime.head(5)
df_anime.columns 
Index(['mal_id', 'title', 'start_date', 'end_date', 'synopsis', 'mean', 'rank',
       'popularity', 'num_list_users', 'num_scoring_users', 'nsfw',
       'created_at', 'updated_at', 'media_type', 'status', 'genres',
       'num_episodes', 'source', 'average_episode_duration', 'rating',
       'pictures', 'background', 'related_anime', 'related_manga',
       'recommendations', 'studios', 'main_picture.medium',
       'main_picture.large', 'alternative_titles.synonyms',
       'alternative_titles.en', 'alternative_titles.ja', 'start_season.year',
       'start_season.season', 'statistics.status.watching',
       'statistics.status.completed', 'statistics.status.on_hold',
       'statistics.status.dropped', 'statistics.status.plan_to_watch',
       'statistics.num_list_users'],
      dtype='object')

Save extracted anime information to a csv file

df_anime.index.name = 'Index' # rename the index
df_anime.rename(columns={"id": "mal_id"}, inplace=True) # rename the id column to mal_id
df_anime = df_anime[df_anime['media_type'].isin(['tv', 'movie', 'ova', 'special', 'ona'])] # only include TV, Movie, OVA, Special, ONA
df_anime.reset_index(drop=True, inplace=True) # reset the index
df_anime.to_csv(DATA_FOLDER + 'anime_extract.csv', sep=';', encoding='utf-8') # save the dataframe as a csv file
df_anime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11613 entries, 0 to 11612
Data columns (total 39 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   mal_id                           11613 non-null  int64  
 1   title                            11613 non-null  object 
 2   start_date                       11603 non-null  object 
 3   end_date                         11501 non-null  object 
 4   synopsis                         11410 non-null  object 
 5   mean                             11613 non-null  float64
 6   rank                             11613 non-null  int64  
 7   popularity                       11613 non-null  int64  
 8   num_list_users                   11613 non-null  int64  
 9   num_scoring_users                11613 non-null  int64  
 10  nsfw                             11613 non-null  object 
 11  created_at                       11613 non-null  object 
 12  updated_at                       11613 non-null  object 
 13  media_type                       11613 non-null  object 
 14  status                           11613 non-null  object 
 15  genres                           11576 non-null  object 
 16  num_episodes                     11613 non-null  int64  
 17  source                           10106 non-null  object 
 18  average_episode_duration         11613 non-null  int64  
 19  rating                           11514 non-null  object 
 20  pictures                         11613 non-null  object 
 21  background                       1507 non-null   object 
 22  related_anime                    11613 non-null  object 
 23  related_manga                    11613 non-null  object 
 24  recommendations                  11613 non-null  object 
 25  studios                          11613 non-null  object 
 26  main_picture.medium              11609 non-null  object 
 27  main_picture.large               11609 non-null  object 
 28  alternative_titles.synonyms      11613 non-null  object 
 29  alternative_titles.en            6178 non-null   object 
 30  alternative_titles.ja            11595 non-null  object 
 31  start_season.year                10969 non-null  float64
 32  start_season.season              10969 non-null  object 
 33  statistics.status.watching       11613 non-null  int64  
 34  statistics.status.completed      11613 non-null  int64  
 35  statistics.status.on_hold        11613 non-null  int64  
 36  statistics.status.dropped        11613 non-null  int64  
 37  statistics.status.plan_to_watch  11613 non-null  int64  
 38  statistics.num_list_users        11613 non-null  int64  
dtypes: float64(2), int64(13), object(24)
memory usage: 3.5+ MB

df_anime = pd.read_csv('data/anime_extract.csv', sep=';', encoding='utf-8') df_anime.index.name = 'Index' df_anime.drop(df_anime.columns[0], axis=1, inplace=True) df_anime.rename(columns={"id": "mal_id"}, inplace=True) df_anime = df_anime[df_anime['media_type'].isin(['tv', 'movie', 'ova', 'special', 'ona'])] df_anime.reset_index(drop=True, inplace=True) df_anime.info()

Save anime ids and titles to a csv file

df_anime_titles = df_anime[['mal_id', 'title']].copy() # copy the dataframe columns to a new dataframe
df_anime_titles.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_titles.index.name = "Index" # rename the index
df_anime_titles.to_csv(DATA_FOLDER + 'anime_titles.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_titles.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id title
Index
0 1 Cowboy Bebop
1 5 Cowboy Bebop: Tengoku no Tobira
2 6 Trigun
3 7 Witch Hunter Robin
4 8 Bouken Ou Beet

Save anime rankings to csv file

df_anime_ranking = df_anime[['mal_id', 'mean', 'rank', 'popularity', 'rating', 'num_scoring_users']].copy() # copy the dataframe columns to a new dataframe
df_anime_ranking.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_ranking.index.name = "Index" # set the index name to "Index"
df_anime_ranking.to_csv(DATA_FOLDER + 'anime_ranking.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_ranking.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id mean rank popularity rating num_list_users num_scoring_users
Index
0 1 8.76 37 42 r 1617259 832701
1 5 8.38 175 566 r 334185 192661
2 6 8.22 310 242 pg_13 659514 328258
3 7 7.26 2708 1678 pg_13 105582 41521
4 8 6.96 4073 4843 pg 14304 6239

Save anime ratings to csv file

df_anime_rating = df_anime[['mal_id', 'rating']].copy() # copy the dataframe columns to a new dataframe
df_anime_rating.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_rating.index.name = "Index" # set the index name to "Index"
df_anime_rating.to_csv(DATA_FOLDER + 'anime_rating.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_rating.head()

Save anime dates to csv file

df_anime_dates = df_anime[['mal_id', 'start_date', 'end_date', 'start_season.year', 'start_season.season']].copy() # copy the dataframe columns to a new dataframe
df_anime_dates.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_dates.index.name = "Index" # set the index name to "Index"

df_anime_dates.rename(columns={"start_season.year": "anime_season_year"}, inplace=True) # rename the column
df_anime_dates.rename(columns={"start_season_season": "anime_season"}, inplace=True) # rename the column
df_anime_dates["start_season_year"] = df_anime_dates["start_season_year"].fillna(0).astype(np.int64) # fill the NaN with 0 and convert to int64

df_anime_dates.to_csv(DATA_FOLDER + 'anime_dates.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_dates.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id start_date end_date start_season_year start_season.season
Index
0 1 1998-04-03 1999-04-24 1998 spring
1 5 2001-09-01 2001-09-01 2001 summer
2 6 1998-04-01 1998-09-30 1998 spring
3 7 2002-07-03 2002-12-25 2002 summer
4 8 2004-09-30 2005-09-29 2004 fall

Save anime genres and demographics to csv files

df_anime_genres = pd.DataFrame(columns=['mal_id', 'genre_id']) # create a dataframe with the columns
df_genres_d = pd.DataFrame(columns=['genre_id', 'genre_de']) # create a dataframe with the columns
df_anime_demographic = pd.DataFrame(columns=['mal_id', 'demo_id']) # create a dataframe with the columns
df_demographic_d = pd.DataFrame(columns=['demo_id', 'demo_de']) # create a dataframe with the columns

for row in df_anime.iterrows(): # iterate through the dataframe
    genres_str = row[1]['genres'] 
    if(pd.isna(genres_str)): # if the genres_str is NaN continue
        continue
    genres_str = '{"genres": ' + genres_str.replace("'id'", "'genre_id'").replace("name", "genre_de") + '}'
    genre_d = ast.literal_eval(genres_str) # convert the string to a dictionary
    genres_d = pd.json_normalize(genre_d['genres']) # normalize the json
    
    for genre in genre_d['genres']: # iterate through the genres and demographics
        if(genre['genre_id'] in [15, 25, 27, 42, 43]): # if the genre is in the list of demographics
            df_anime_demographic.loc[df_anime_demographic.shape[0]] = [row[1]['mal_id'], genre['genre_id']] # add the demographic to the dataframe
            if(genre['genre_id'] not in df_demographic_d['demo_id'].values): # if the demographic is not in the dataframe
                df_demographic_d.loc[df_demographic_d.shape[0]] = [genre['genre_id'], genre['genre_de']] # if the demographic not already in add it to the dataframe
        else: # if the genre is not in the list of demographics
            df_anime_genres.loc[df_anime_genres.shape[0]] = [row[1]['mal_id'], genre['genre_id']] # add the genre to the dataframe
            if(genre['genre_id'] not in df_genres_d['genre_id'].values): # if the genre is not in the dataframe
                df_genres_d.loc[df_genres_d.shape[0]] = [genre['genre_id'], genre['genre_de']] # if the genre not already in add it to the dataframe
df_anime_demographic.sort_values(by=['mal_id', 'demo_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and demo_id
df_anime_demographic.index.name = "Index" # set the index name to "Index"
df_anime_demographic.to_csv(DATA_FOLDER + 'anime_demographics.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_demographic.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id demo_id
Index
0 6 27
1 8 27
2 15 27
3 16 43
4 17 27
df_demographic_d.sort_values(by=['demo_id'], inplace=True, ignore_index=True) # sort the dataframe by the demo_id
df_demographic_d.index.name = "Index" # set the index name to "Index"
df_demographic_d.to_csv(DATA_FOLDER + 'demographics_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_demographic_d.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
demo_id demo_de
Index
0 15 Kids
1 25 Shoujo
2 27 Shounen
3 42 Seinen
4 43 Josei
df_genres_d.sort_values(by=['genre_id'], inplace=True, ignore_index=True) # sort the dataframe by the genre_id
df_genres_d.index.name = "Index" # set the index name to "Index"
df_genres_d.to_csv(DATA_FOLDER + 'genres_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_genres_d.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
genre_id genre_de
Index
0 1 Action
1 2 Adventure
2 3 Racing
3 4 Comedy
4 5 Avant Garde
df_anime_genres.sort_values(by=['mal_id', 'genre_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and by genre_id
df_anime_genres.index.name = "Index" # set the index name to "Index"
df_anime_genres.to_csv(DATA_FOLDER + 'anime_genres.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_genres.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id genre_id
Index
0 1 1
1 1 24
2 1 29
3 1 50
4 5 1

Save anime media types and nsfw to csv file

df_anime_media_type = df_anime[['mal_id', 'media_type', 'nsfw']].copy() # copy the dataframe columns to a new dataframe
df_anime_media_type.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_media_type.index.name = "Index" # set the index name to "Index"
df_anime_media_type.to_csv(DATA_FOLDER + 'media_type_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_media_type.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id media_type source nsfw
Index
0 1 tv original white
1 5 movie original white
2 6 tv manga white
3 7 tv original white
4 8 tv manga white

Save anime sources to csv file

df_anime_source = df_anime[['mal_id', 'source']].copy() # copy the dataframe columns to a new dataframe
df_anime_source.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_source.index.name = "Index" # set the index name to "Index"
df_anime_source.to_csv(DATA_FOLDER + 'anime_source.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_source.head()

Save anime synopses to csv file

df_synopsis_d = df_anime[['mal_id', 'synopsis']].copy() # copy the dataframe columns to a new dataframe
df_synopsis_d.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_synopsis_d.index.name = "Index" # set the index name to "Index"
df_synopsis_d.to_csv(DATA_FOLDER + 'synopsis_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_synopsis_d.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id synopsis
Index
0 1 Crime is timeless. By the year 2071, humanity ...
1 5 Another day, another bounty—such is the life o...
2 6 Vash the Stampede is the man with a $$60,000,0...
3 7 Witches are individuals with special powers li...
4 8 It is the dark century and the people are suff...

Save anime studios to csv file

df_anime_studios = pd.DataFrame(columns=['mal_id', 'studio_id']) # create a dataframe with the columns
df_studios_d = pd.DataFrame(columns=['studio_id', 'studio_de']) # create a dataframe with the columns

cnt = 0
for row in df_anime.iterrows(): # iterate through the dataframe
    studios_str = row[1]['studios']
    if(pd.isna(genres_str)):
        continue
    studios_str = '{"studios": ' + studios_str.replace("id", 'studio_id').replace("name", 'studio_de') + '}'
    studio_d = ast.literal_eval(studios_str) # convert the string to a dictionary
    studios_d = pd.json_normalize(studio_d['studios']) # convert the dictionary to a dataframe
    
    df_studios_d = pd.concat([df_studios_d, studios_d], ignore_index=True).drop_duplicates() # concatenate the dataframes
    if(studios_d.empty): # if the dataframe is empty
        df_anime_studios.loc[df_anime_studios.shape[0]] = [int(row[1]['mal_id']), np.nan] # add the row to the dataframe and set it to NaN
        continue
    for studio in studio_d['studios']: # iterate through the studios
        df_anime_studios.loc[df_anime_studios.shape[0]] = [int(row[1]['mal_id']), int(studio['studio_id'])] # add the row to the dataframe    
df_studios_d.sort_values(by=['studio_id'], inplace=True, ignore_index=True) # sort the dataframe by the genre_id
df_studios_d.index.name = "Index" # set the index name to "Index"
df_studios_d.to_csv(DATA_FOLDER + 'studios_d.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_studios_d.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
studio_id studio_de
Index
0 1 Pierrot
1 2 Kyoto Animation
2 3 Gonzo
3 4 Bones
4 5 Bee Train
df_anime_studios.sort_values(by=['mal_id', 'studio_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and by genre_id
df_anime_studios.index.name = "Index" # set the index name to "Index"
df_anime_studios["mal_id"] = df_anime_studios["mal_id"].astype(np.int64) # convert the mal_id to an integer
df_anime_studios["studio_id"] = df_anime_studios["studio_id"].fillna(0).astype(np.int64) # convert the studio_id to an integer
df_anime_studios.to_csv(DATA_FOLDER + 'anime_studios.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_studios.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id studio_id
Index
0 1 14
1 5 4
2 6 11
3 7 14
4 8 18

Save anime recommendations to csv file

df_anime_recommendations = pd.DataFrame(columns=['mal_id', 'mal_id_rd', 'num_recommendations']) # create a dataframe with the columns
for row in df_anime.iterrows(): # iterate through the dataframe
    recommended_str = row[1]['recommendations']
    if(pd.isna(recommended_str)): # if the recommended_str is NaN
        continue
    recommended_str = '{"recommendations": ' + str(recommended_str) + '}'
    recom_d = ast.literal_eval(recommended_str) # convert the string to a dictionary
    for recommendation in recom_d['recommendations']: # iterate through the recommendations
        df_anime_recommendations.loc[df_anime_recommendations.shape[0]] = [row[1]['mal_id'], recommendation['node']['id'], recommendation['num_recommendations']] # add the row to the dataframe
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id mal_id_rd num_recommendations
0 5114 11061 101
1 5114 16498 39
2 5114 1482 23
3 5114 9919 17
4 5114 1575 15
df_anime_recommendations.sort_values(by=['mal_id', 'num_recommendations'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and by num_recommendations
df_anime_recommendations.index.name = "Index" # set the index name to "Index"
df_anime_recommendations.to_csv(DATA_FOLDER + 'anime_recommendations.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_recommendations.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id mal_id_rd num_recommendations
Index
0 1 13601 13
1 1 918 14
2 1 2025 16
3 1 4087 18
4 1 2251 30

Save anime relations to csv file

df_anime_relations = pd.DataFrame(columns=['mal_id', 'mal_id_re', 'relation_type']) # create a dataframe with the columns
for row in df_anime.iterrows(): # iterate through the dataframe
    related_str = row[1]['related_anime']
    if(pd.isna(related_str)): # if the related_str is NaN
        continue
    related_d = ast.literal_eval(related_str) # convert the string to a dictionary
    for relation in related_d: # iterate through the relations
        df_anime_relations.loc[df_anime_relations.shape[0]] = [row[1]['mal_id'], relation['node']['id'], relation['relation_type']] # add the row to the dataframe
df_anime_relations.sort_values(by=['mal_id', 'mal_id_re'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id and by mal_id_re
df_anime_relations.index.name = "Index" # set the index name to "Index"
df_anime_relations.to_csv(DATA_FOLDER + 'anime_relations.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_relations.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id mal_id_re relation_type
Index
0 1 5 side_story
1 1 4037 summary
2 1 17205 side_story
3 5 1 parent_story
4 6 4106 side_story

Save anime cover images to csv file (urls)

df_anime_covers = df_anime[['mal_id', 'main_picture.medium', 'main_picture.large']].copy() # copy the dataframe columns to a new dataframe #'pictures' 
df_anime_covers.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_covers.rename(columns={"main_picture.medium": "main_picture_medium"}, inplace=True) # rename the column
df_anime_covers.rename(columns={"main_picture.large": "main_picture_large"}, inplace=True)
df_anime_covers.index.name = "Index" # set the index name to "Index"
df_anime_covers.to_csv(DATA_FOLDER + 'anime_covers.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_covers.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id main_picture.medium main_picture.large pictures
Index
0 1 https://api-cdn.myanimelist.net/images/anime/4... https://api-cdn.myanimelist.net/images/anime/4... [{'medium': 'https://api-cdn.myanimelist.net/i...
1 5 https://api-cdn.myanimelist.net/images/anime/1... https://api-cdn.myanimelist.net/images/anime/1... [{'medium': 'https://api-cdn.myanimelist.net/i...
2 6 https://api-cdn.myanimelist.net/images/anime/7... https://api-cdn.myanimelist.net/images/anime/7... [{'medium': 'https://api-cdn.myanimelist.net/i...
3 7 https://api-cdn.myanimelist.net/images/anime/1... https://api-cdn.myanimelist.net/images/anime/1... [{'medium': 'https://api-cdn.myanimelist.net/i...
4 8 https://api-cdn.myanimelist.net/images/anime/7... https://api-cdn.myanimelist.net/images/anime/7... [{'medium': 'https://api-cdn.myanimelist.net/i...

Save anime statistics to csv file

df_anime_stats = df_anime[['mal_id', 'num_episodes', 'average_episode_duration', 
                           'statistics.status.watching', 'statistics.status.completed', 
                           'statistics.status.on_hold', 'statistics.status.dropped', 
                           'statistics.status.plan_to_watch', 'statistics.num_list_users']].copy() # copy the dataframe columns to a new dataframe
df_anime_stats.sort_values(by=['mal_id'], inplace=True, ignore_index=True) # sort the dataframe by the mal_id
df_anime_stats.rename(columns={"statistics.status.watching": "num_watching"}, inplace=True) # rename the columns
df_anime_stats.rename(columns={"statistics.status.completed": "num_completed"}, inplace=True) 
df_anime_stats.rename(columns={"statistics.status.on_hold": "num_on_hold"}, inplace=True)
df_anime_stats.rename(columns={"statistics.status.dropped": "num_dropped"}, inplace=True)
df_anime_stats.rename(columns={"statistics.status.plan_to_watch": "num_plan_to_watch"}, inplace=True)
df_anime_stats.rename(columns={"statistics.num_list_users": "num_list_users"}, inplace=True)
df_anime_stats.index.name = "Index" # set the index name to "Index"
df_anime_stats.to_csv(DATA_FOLDER + 'anime_stats.csv', sep=';', encoding='utf-8') # save the dataframe to a csv file
df_anime_stats.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mal_id num_episodes average_episode_duration statistics.status.watching statistics.status.completed statistics.status.on_hold statistics.status.dropped statistics.status.plan_to_watch statistics.num_list_users
Index
0 1 26 1440 151257 921226 91835 35042 417977 1617337
1 5 1 6911 5606 249064 2472 974 76075 334191
2 6 26 1480 35889 393056 29722 16350 184521 659538
3 7 26 1500 4914 49236 5571 5785 40071 105577
4 8 52 1380 709 7754 801 1178 3861 14303