您现在的位置是: > 科技科技
代做CS 7642 Reinforcement Learning and Decision
code2024-04-21 09:55:50科技
简介
CS 7642: Reinforcement Learning and Decision Making Project #3 Overcooked
1 Problem 1.1 Description
For the final project of this course, you have to bring together everything you have learned...
CS 7642: Reinforcement Learning and Decision Making Project #3 Overcooked
1 Problem 1.1 Description
For the final project of this course, you have to bring together everything you have learned thus far and solve the multi-agent Overcooked environment (modeled after the popular video game). In this environment, you have control over 2 chefs in a restaurant kitchen who have to collaborate to cook onion soups. To cook a soup, the agents need to put 3 onions into a cooking pot, initiate cooking, wait for the soup to cook, put the soup into a dish, and serve the dish at a serving area. This project serves as a capstone to the course and as such we expect much of the project to be open-ended and self-directed. Your primary goal is to maximize the number of soups delivered within an episode on a variety of layouts ranging from fairly easy to extremely difficult. In your quest to solve these layouts you may discover auxiliary goals or metrics that are worth analyzing.
Our expectation is that you have learned what is significant to include in this type of report from the previous projects and the material we have covered so far. It is thus up to you to define:
• The direction of your project including which aspect(s) you aim to focus upon.
• How you specify and measure such aspects.
• How to train your agents.
• How to structure your report and what graphs to include (in addition to the mandatory graphs discussed later).
Your focus should be on demonstrating your understanding of the algorithm(s)/solution(s), clarifying the ratio- nale behind your experiments, and analyzing their results. Your main goal is to develop an algorithm to solve the environment but you can also use everything else studied in the course such as reward and policy shaping. The environment provides a reward shaping data structure that you are free to use. You may also design your own reward shaping in place of, or in addition to, this default setup. However, all algorithms and solutions used to solve the environment should be your own. We encourage you to start off this project with your Project 2 solution and see how far that model takes you. This will provide context for why multi-agent methods may be necessary for this environment. It will also help to ease your transition into this environment by utilizing an algorithm you’ve already gotten to work.
Figure 1: Visualization of the Overcooked environment. Carroll et al. 2019
1.2 Environment and Task
In this project, you will be training a team of 2 agents to cook onion soups in a kitchen. The objective is always to deliver as many soups as possible within a 400-timestep episode. Each soup takes 20 timesteps to cook and
1
– Overcooked 2
delivering a soup successfully yields a +20 reward. Cooking a soup with less than 3 onions, dropping a soup on the ground, or serving the soup on the counter (instead of the designated serving area) yields no reward but hinders progress as agents lose precious time (and starve customers). Episodes are truncated to a 400 step horizon with no termination conditions. You are not permitted to increase the 400 step horizon. You are provided with 5 layouts of varying difficulty - [cramped room, asymmetric advantages, coordination ring, forced coordination, counter circuit 0 1order] as shown in Figure 2 1. Your task is to achieve a mean soup delivery count of ≥ 7 per episode across all layouts using a single approach. This means a single algorithm and a single reward-shaping function (if you utilize reward shaping). This also means a single set of hyperparameters, The idea is to build an agent that can solve any layout that is thrown at it, and not just these 5. Having a constant set of parameters also makes reproducibility much easier (something we gained an appreciation for in Project 1). Note that some layouts can be solved by a single agent algorithm and don’t require any collaboration. Other layouts benefit significantly from collaboration and some may only be solvable via collaboration. This means that a successful approach to solving all 5 layouts will likely require an explicit multi-agent approach. We also expect you to develop your approaches and analyze the results by explicitly looking at multi-agent metrics (see Section 1.8).
Figure 2: The 5 layouts you are tasked to solve. From left to right they are named [cramped room, asymmetric advantages, coordination ring, forced coordination, counter circuit 0 1order]. Car- roll et al. 2019
1.3 State Space
This is a fully-observable MDP and both agents have access to the full observation. Therefore, the state and observation spaces are equivalent. By default, the observations are provided as a 96-element vector, customized for each agent. The encoding for player i ∈ {0, 1} contains a player-centric featurized view for the ith player, and is as follows:
[player i features, other player features, player i dist to other player, player i position]
The first component, player i features has length 46 and is detailed below. Note that if you add all the feature lengths in the specification below, you will get 36 instead of the expected 46. This is because the five features related to the pot (having a combined length of 10) occur twice, once for each pot, and are concatenated together. Note also that none of our layouts contain tomatoes, so the features corresponding to tomatoes will always be 0. Finally, layouts containing only one cooking pot will have the second pot’s features zeroed out as well.
• p i orientation: one-hot-encoding of direction currently facing (length 4)
• p i obj: one-hot-encoding of object currently being held ([onion, soup, dish, tomato]) (all 0s if no object
held) (length 4)
• p i closest onion|tomato|dish|soup: (dx, dy) where dx = x dist to item, dy = y dist to item. (0, 0) if item is currently held (length 8)
• p i closest soup n onions|tomatoes: int value for number of this ingredient in closest soup (length 2)
• p i closest serving area|empty counter: (dx, dy) where dx = x dist to item, dy = y dist to item. (length
4)
1The overcooked environment has dozens of layouts but for this project we will only be focusing on these 5.
– Overcooked 3
• p i closest pot j exists: {0, 1} depending on whether jth closest pot is found. If 0, then all other pot features are 0. Note: can be 0 even if there are more than j pots on layout, if the pot is not reachable by player i (length 1)
• p i closest pot j is empty|is full|is cooking|is ready: {0, 1} depending on boolean value for jth closest pot (length 4)
• p i closest pot j num onions|num tomatoes: int value for number of this ingredient in jth closest pot (length 2)
• p i closest pot j cook time: int value for seconds remaining on soup. 0 if no soup is cooking (length 1)
• p i closest pot j: (dx, dy) to jth closest pot from player i location (length 2)
• p i wall j: {0, 1} boolean value of whether player i has a wall immediately in direction j (length 4)
The remaining components of the observation vector are as follows:
other player features (length 46): ordered concatenation of player j features for j ̸= i player i dist to other player (length 2): [player j.pos - player i.pos for j ̸= i]
player i position (length 2)
1.4 Action Space
The action space is discrete with six possible actions: up, down, left, right, stay, and ”interact,” which is a contextual action determined by the tile the player is facing (e.g. placing an onion when facing a counter). Each layout has one or more onion dispensers and dish dispensers, which provide an unlimited supply of onions and dishes respectively.
1.5 Installation Notes
The environment is officially supported on Python 3.7 and is installed via pip install overcooked-ai. We recommend you run in Anaconda. We require the use of PyTorch if using deep learning methods. You absolutely do not need a GPU to solve any of the layouts in less than 10 hours (in fact, GPUs typically slow RL algorithms down). To help you with getting started, we are providing you with a Jupyter notebook. You may create a copy of this notebook in order to run the starting code. This notebook demonstrates installing, building, interacting with, and visualizing the environment. You are not required to use this notebook in your project, but we encourage you to use it as a companion to this document to better understand the environment.
1.6 IMPORTANT: Reward Shaping Addendum
If you plan on using reward shaping, take a look at how the default shaped rewards are swapped by the agent index in the provided notebook. Upon episode reset, agents are assigned randomly to one of the 2 starting positions. This assignment is only reflected in the official observation that is returned to you by the environment’s step method. Any state variable you obtain from the Overcooked environment that is not in this observation variable (including anything in the info dictionary or the base environment) needs to be similarly swapped. Failure to do this means you will be assigning credit to the wrong agent roughly half the time, crippling your algorithm.
For more details on installation and operation, refer to the GitHub repository - https://github.com/ HumanCompatibleAI/overcooked_ai
1.7 Strategy Recommendations
You are free to pursue any multi-agent RL strategies in your soup-cooking quest. For instance, you may pursue a novel reward-shaping technique, however, make sure that the method chosen is relevant to multi-agent RL problems. We strongly recommend that you (1) start with your Project 2 solution adapted to this problem and (2) start with the cramped room and asymmetric advantages layouts. Below are further examples of strategies worth pursuing:
– Overcooked 4
• using reward shaping techniques for improving multi-agent considerations such as collaboration and credit
assignment;
• asynchronous methods Mnih et al. 2016;
• centralizing training and decentralizing execution (Lowe et al. 2017; J. N. Foerster et al. 2017);
• value factorisation Rashid, Samvelyan, De Witt, et al. 2020;
• employing curriculum learning (some single-agent ideas in this dissertation may be interesting and easy to extend to the multi-agent case e.g., Narvekar 2017).
• adding communication protocols (J. Foerster et al. 2016);
• improving multi-agent credit assignment (J. N. Foerster et al. 2017; Zhou et al. 2020);
• improving multi-agent exploration (Iqbal and Sha 2019; Wang et al. 2019)
• finding better inductive biases (i.e., choosing the function space for policy/value function approximation) to handle the exponential complexity of multi-agent learning, e.g., graph neural networks (Battaglia et al. 2018; Naderializadeh et al. 2020).
1.8 Procedure
This problem is more sophisticated than anything you have seen so far in this course. Make sure you reserve enough time to consider what an appropriate approach might involve and, of course, enough time to build and train it.
• Clearly define the direction of your project and which aspect(s) you aim to improve upon over your Project 2 baseline, assuming that that baseline was unable to solve all of the layouts. For example, do you want to improve collaboration among your agents?
– This includes why you think your algorithm/procedure will accomplish this and whether or not your results demonstrate success.
• Implement a solution that produces such improvements.
– Use any algorithms/strategy as inspiration for your solution.
– The focus of this project is to try new algorithms/solutions, rather than to simply im- prove hyper-parameters of the algorithms already implemented. Further, avoid search- ing for random seeds that happen to work the best as this is inconsequential analysis. Remember that the algorithm/reward-shaping/hyperparameters must be fixed across all 5 layouts.
– Justify the choice of that solution and explain why you expect it to produce these improvements.
– Even if your solution does not solve all of the layouts, you still have the ability to write
a solid paper.
– Upload/maintain your code in your private repo at https://github.gatech.edu/gt-omscs-rldm.
• Describe your experiments and create graphs that demonstrate the success/failure of your solution.
– You must provide one graph demonstrating the number of soups made across all five layouts during training. You can combine all five layouts’ plots onto one graph if you wish. Displaying a simple moving average for each layout’s training run is suggested to help with clarity.
– You must provide one graph demonstrating performance of your trained agent on each layout over at least 100 consecutive episodes. Again, you can combine all five layouts’ plots into one graph. If all five of these graphs are flat lines (a possible consequence of using a deterministic algorithm on a deterministic environment), then a bar graph is ok.
– Additionally, you must provide at least two graphs using metrics you decided on that are significant for your hypothesis/goal.
– Analyze your results and explain the reasons for the success/failure of your solution.
– Overcooked 5
– Since graphs are largely decided by you, they should have clear axis, labels, and captions. You will
lose points for graphs that do not have any description or label of the information being displayed.
– Example metrics you might consider are number of dish pickups, dropped dishes, incorrect deliveries, or picked up onions. These example metrics and more are built-in to the environment and are accessible via the info variable at the end of an episode. In your report you should clearly motivate why you are interested in a particular metric. See the provided notebook.
• We’ve created a private Georgia Tech GitHub repository for your code. Push your code to the personal repository found here: https://github.gatech.edu/gt-omscs-rldm.
• The quality of the code is not graded. You do not have to spend countless hours adding comments, etc. However, the TAs will examine code during grading.
• Make sure to include a README.md file for your repository that we can use to run your code.
– Include thorough and detailed instructions on how to run your source code in the README.md.
– If you work in a notebook, like Jupyter, include an export of your code in a .py file along with your notebook.
– The README.md file should be placed in the project 3 folder in your repository.
• You will be penalized by 25 points if you:
– Do not have any code or do not submit your full code to the GitHub repository; or – Do not include the git hash for your last commit in your paper.
• Write a paper describing your agents and the experiments you ran.
– Include the hash for your last commit to the GitHub repository in the header on the first page of
your paper.
– Make sure your graphs are legible and you cite sources properly. While it is not required, we recommend you use a conference paper format. For example: https://www.ieee.org/conferences/ publishing/templates.html.
– 5 pages maximum—really, you will lose points for longer papers.
– Explain your algorithm(s).
– Explain your training implementation and experiments.
– An ablation study would be a interesting way to find out the different components of the algorithm that contribute to your metric. (See J. N. Foerster et al. 2017.)
– Graphs highlighting your implementations successes and/or failures.
– Explanation of algorithms used: what worked best? what didn’t work? what could have worked
better?
– Justify your choices.
∗ Unlike Project 1, there are multiple ways of solving this problem and you have a lot of discretion over the general approach you take as well as experimental design decisions. Explain to the reader why, from amongst the multiple alternatives, you chose the ones you did.
∗ Your focus should be on justifying the algorithm/techniques you implemented.
– Explanation of pitfalls and problems you encountered.
– What would you try if you had more time?
– Save this paper in PDF format.
– Submit to Canvas!
1.9 Resources
1.9.1 Lectures
• Lesson 11A: Game Theory
• Lesson 11B: Game Theory Reloaded
• Lesson 11C: Game Theory Revolutions
– Overcooked 6
1.9.2 Readings
• J. N. Foerster et al. 2017
• Lowe et al. 2017
• Rashid, Samvelyan, Witt, et al. 2018
1.9.3 Talks
• Factored Value Functions for Cooperative Multi-Agent Reinforcement Learning • Counterfactual Multi-Agent Policy Gradients
• Learning to Communicate with Deep Multi-Agent Reinforcement Learning
• Automatic Curricula in Deep Multi-Agent Reinforcement Learning
1.10 Submission Details
The due date is indicated on the Canvas page for this assignment. Make sure you have set your timezone in Canvas to ensure the deadline is accurate.
Due Date: Indicated as “Due” on Canvas
Late Due Date [20 point penalty per day]: Indicated as “Until” on Canvas
The submission consists of:
• Your written report in PDF format (Make sure to include the git hash of your last commit.) • Your source code
To complete the assignment, submit your written report to Project 3 under your Assignments on Canvas (https://gatech.instructure.com) and submit your source code to your personal reposi- tory on Georgia Tech’s private GitHub
You may submit the assignment as many times as you wish up to the due date, but, we will only consider your last submission for grading purposes. Late submissions will receive a cumulative 20 point penalty per day. That is, any projects submitted after midnight AOE on the due date will receive a 20 point penalty. Any projects submitted after midnight AOE the following day will receive another 20 point penalty (a 40 point penalty in total) and so on. No project will receive a score less than a zero no matter what the penalty. Any projects more than 4 days late and any missing submissions will receive a 0.
Please be aware, if Canvas marks your assignment as late, you will be penalized. This means one second late is treated the same as three hours late, and will receive the same penalty as described in the breakdown above. Additionally, if you resubmit your project and your last submission is late, you will incur the penalty corresponding to the time of your last submission. Submit early and often.
Finally, if you have received an exception from the Dean of Students for a personal or medical emergency we will consider accepting your project up to 7 days after the initial due date with no penalty. Students requiring more time should consider taking an incomplete for this semester as we will not be able to grade their project.
1.11 Grading and Regrading
When your assignments, projects, and exams are graded, you will receive feedback explaining your successes and errors in some level of detail. This feedback is for your benefit, both on this assignment and for future assignments. It is considered a part of your learning goals to internalize this feedback. This is one of many learning goals for this course, such as: understanding game theory, random variables, and noise.
If you are convinced that your grade is in error in light of the feedback, you may request a regrade within a week of the grade and feedback being returned to you. A regrade request is only valid if it includes an explanation of where the grader made an error. Create a private Ed Discussion post titled “[Request] Regrade Project 3”. In the Details add sufficient explanation as to why you think the grader made a mistake. Be concrete and specific. We will not consider requests that do not follow these directions.
– Overcooked 7 1.12 Words of Encouragement
We understand this is a daunting project with many possible design directions to consider. As Graduate Students in Computer Science, projects that allow you to challenge and expand your skills in a practical and low-stakes manner are crucial. These projects are ideal for testing the knowledge you have garnered throughout the course and applying yourself to a difficult problem commonly faced when applying reinforcement learning in industry. After completing the course, a project like this can be valuable to highlight during interviews, to demonstrated your newfound knowledge to current employers, or to add a (new) section on your resume. Historically, many students have reported back the positive interactions encountered when discussing their projects, sometimes leading to job offers or promotions. However, please remember not to publicly post your report or code. The project is a good talking point and you would be within the bounds of the GT Honor Code if you were to share it privately with a potential employer (if you so desire), however making any part of this project publicly available would be a violation of the GT Honor Code.
We encourage you to start early and dive head-first into the project to try as many options as possible. We strongly believe the more successes and failures you experience, the greater your growth and learning will be.
The teaching staff is dedicated to helping as much as possible. We are excited to see how you will approach the problem and have many resources available to help. Over the next several Office Hours, we will be discussing various approaches in detail, as well as dive deeper into approaches on Ed Discussions. We are here to help you and want to see you succeed! With all that said:
Good luck and happy coding!
请加QQ:99515681 邮箱:99515681@qq.com WX:codinghelp