
In the last few years, referral systems have taken up more and more space in our lives, almost any activity we do online is present. There they are in online commerce, in advertising, and recommendation systems also help in choosing movies, music. The purpose of these systems is to suggest relevant elements to the user.
In the last few years, referral systems have taken up more and more space in our lives, almost any activity we do online is present. There they are in online commerce, in advertising, and recommendation systems also help in choosing movies, music. The purpose of these systems is to suggest relevant elements to the user. As a Data Scientist, there are also tasks that we can solve with the help of referral systems. Now during my two-month stay at home, not in connection with my work, this problem arose. After randomly stumbling across a movie database, I decided to build a simple recommendation system for myself to recommend new movies based on ratings of movies I've already watched.
In these tasks, many times the lack of a suitable database is the biggest problem. It happens that a good task is given, but unfortunately there is no adequate amount or quality of data for it. This may be because the data is not public or the acquisition of the dataset is more costly than the expected benefit of the project. We also often run into the problem that the data is of inadequate quality, incomplete or inconsistent. Sometimes we have to create the database ourselves, which can be time consuming and expensive. Fortunately, I found a database that has been built since 2013.
The database I use isMovieTweets,is a dataset that includes movie ratings from twitter users. It gathers information from Twitter on a daily basis, based on well-structured tweets that include the”Jeg bewerkt #IMDbphrase. This dataset is the result of research by Simon Dooms, which was developed by MovieTweets: a Movie Rating Dataset Collected From Twitterstudy shows.
Once the database was available, I first checked its usability. To do this, I looked at how many people rated each film. The highest rated film is Gravity, which received 3086 votes. This is significantly below the number of IMDb ratings. There were also many films that had very few votes, so I filtered the movies to those that received at least 100 ratings, leaving 1644 movies in my database. Then I checked how the average scores of these films compare to each other, which they received on Twitter and IMDb respectively. In the following table and figure, it is clear that, interestingly, of the 1,644 films, there were only 10 that had a greater than 1 difference between the average scores, although on IMDb the films were rated by orders of magnitude more.


So I accepted this database to be of good use and continued to work with it. I also filtered out users who gave ratings to less than 20 movies. Thus, the number of users remained 6883. The initial board (twitterdataframe) contained the following: user (user_id), movie (movie_title), rating (rating).

There are three main types of referral systems: Collaboration based, content based and hybrid method, which is a mixture of the previous two solutions.
The Collaboration based approach is based solely on previous interactions between users and items, in order to generate new recommendations. The main idea of this method is that past user-item interactions are sufficient to find similar users and similar items and make suggestions based on the reported proximities. The main advantage of the collaborative approach is that it does not require additional information about users or items and can therefore be used in many situations. Moreover, the more users evaluate the items, the more accurate the new recommendations will be.
Unlike the previous method, the content-basedin the case of an approach, additional information about users and elements is also required. Such information may include age, gender or any other personal data about the user, as well as category, director, duration or other characteristics about the films (items).
Given that in the current situation, there is only so much information available about how each user rated the films, so I worked with the collaboration-based method.
Of the several varieties of the collaborative method, I have dealt with the user-user approach. In order to provide a new recommendation to that user, it tries to identify other users with the most similar tastes. This method is called “user-centered” because it plots users based on their interactions with elements and measures the distance between them. It then calculates a “similarity” between that user and all other users. This similarity indicator considers close two users who have the same interactions with the same elements. After calculating the similarities, it finds the closest neighbors to the user, and then recommends the new items based on the neighbors' ratings.
In this task, I wanted to suggest new movies for a specific user. To do this first, I have depicted each user as a vector of their ratings for various films. I created the vectors from the table (twitterdataframe) with pandas.DataFrame.Pivot package and filled the missing values with zero.

After that, I searched for neighbors K closest neighbor (K-nn) by his method. The algorithm aims to find the one closest to that user based on movie ratingsKnumber of nearest neighbors, that is, users. The number of neighbors can be chosen arbitrarily, taking into account the size of the base base and the purpose of the task. The larger this number is chosen, the more distant users will be included, so the system will suggest less and less relevant elements, but if we set this number very small, it can suggest films that maybe only one person rated as good. Here I set the number of neighbors to 100, so that the system suggests really good, but even relevant films. I used sklearn's NearestNeighbors package to search for neighbors. I applied the model to that user, so I got the nearest 100 neighbors.
Having found the nearest neighbors, it was necessary to select the most popular films by some method, and then suggest to the user those that he had not seen before. The choice depends on the purpose. You can choose the movies that most people rated at 10, but you can also suggest them based on an average score. I weighted the scores as follows: I punished bad ratings, ignored average ratings, and rewarded good ones. Then I summed up the points I received for each film. From the order formed in this way, I suggested the first 10 films that the user had not yet seen.

Once the model was ready, I tested its operation on the films I rated. The following table lists the 20 movies and scores based on which the nearest neighbors were searched.

The recommender recommended the following films for me to watch:
Knives Out (2019), Avengers: Endgame (2019), Captain Phillips (2013), The Wolf of Wall Street (2013), The Shawshank Redemption (1994), Hacksaw Ridge (2016), American Hustle (2013), The Imitation Game (2014), Prisoners (2013), The Gentlemen (2019)
Since I only reviewed 20 films before, there were films among the offers that I have already seen, thanks to this I was able to test the proposals. I would gladly recommend The Shawshank Redemption and The Imitation Game and Captain Phillips to others, so I am pleased with the way the system works. Fortunately, there were a couple of new things in the proposals.
In most recommendation algorithms, it is necessary to be extremely careful to avoid “getting richer” with popular products. In other words, for our system to suggest only popular items, and users only get recommendations that are extremely close to the ones they already liked, thereby not having a chance to learn about new items. To avoid this, we can increase the number of neighbors or expand the rating list of a given user with more varied films.
The other problem that arose after I recommended films to several of my colleagues was that one of my colleagues saw and rated 1354 of the 1644 films. Since the next highest rated user in the database saw 869 movies and more than 500 movies were rated by only 13 users, the algorithm found only very distant neighbors when searching for neighbors. Also, the list of recommended films has been reduced to 290, so the system may not recommend the most relevant films for him. To solve this problem, the solution would be to increase the database, which would be costly and time consuming, but fortunately this is a rare case.