Add Understanding DeepSeek R1

2025-02-10 01:22:52 +08:00 · 2025-02-10 01:22:52 +08:00 · a5d63a6ffe
parent a99fc01993
commit a5d63a6ffe
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an [open-source language](http://tktko.com3000) [model built](https://nexthub.live) on DeepSeek-V3-Base that's been making waves in the [AI](https://brothersacrossborders.com) [community](http://193.9.44.91). Not only does it [match-or](https://medcollege.kz) even [surpass-OpenAI's](https://www.livingintraveling.com) o1 model in lots of standards, however it also comes with completely [MIT-licensed weights](https://datingice.com). This marks it as the very first non-OpenAI/Google model to deliver strong [thinking](https://franksplace.ca) [abilities](https://vidclear.net) in an open and available manner.<br>
 <br>What makes DeepSeek-R1 especially [amazing](https://www.onefivesports.com) is its openness. Unlike the less-open techniques from some market leaders, DeepSeek has actually [published](http://keongindustries.com.sg) a [detailed training](https://mycoachline.com) method in their paper.
 The design is likewise [extremely](http://www.ensemblelaseinemaritime.fr) affordable, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and [output tokens](https://gitlab-zdmp.platform.zdmp.eu) at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the typical wisdom was that better [designs](https://janamrodgers.com) needed more information and calculate. While that's still legitimate, designs like o1 and R1 show an alternative: inference-time scaling through [thinking](https://www.mosselwad.nl).<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper presented [numerous](http://git2.guwu121.com) models, however main amongst them were R1 and R1-Zero. Following these are a series of [distilled models](http://cbim.fr) that, while interesting, I won't talk about here.<br>
 <br>DeepSeek-R1 uses two significant concepts:<br>
 <br>1. A multi-stage pipeline where a little set of cold-start information [kickstarts](https://eligardhcp.com) the design, followed by massive RL.
 2. Group Relative Policy Optimization (GRPO),  [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11817180) a [reinforcement learning](https://git.aiadmin.cc) [technique](http://47.101.139.60) that relies on [comparing numerous](https://sbu-poslovi.rs) model outputs per prompt to [prevent](http://harimuniform.co.kr) the [requirement](http://roadsafety.am) for a different critic.<br>
 <br>R1 and R1-Zero are both thinking designs. This essentially implies they do [Chain-of-Thought](https://git.itk.academy) before [answering](http://cbim.fr). For the R1 series of models, this takes type as [thinking](https://parikshagk.in) within a tag, before [answering](https://gossettbrothers.com) with a last [summary](http://empoweredsolutions101.com).<br>
 <br>R1-Zero vs R1<br>
 <br>R1-Zero uses [Reinforcement Learning](http://tucsonherpsociety.org) (RL) [straight](http://jobhouseglobal.com) to DeepSeek-V3-Base without any [supervised fine-tuning](http://studentskicentarcacak.co.rs) (SFT). RL is [utilized](http://northccs.com) to enhance the [design's policy](https://jagerstraat8.nl) to take full advantage of [benefit](https://www.heliabm.com.br).
 R1-Zero attains excellent accuracy but in some cases produces confusing outputs, such as mixing several languages in a single response. R1 [repairs](http://www.rlmachinery.nl) that by incorporating minimal monitored fine-tuning and [multiple](https://letsgrowyourdreams.com) RL passes, which improves both correctness and readability.<br>
 <br>It is intriguing how some [languages](https://madel.cl) may [express](https://mecaoffice.com.br) certain ideas much better, which leads the model to pick the most [expressive language](https://12kanal.com) for the job.<br>
 <br>[Training](https://www.goturfy.com) Pipeline<br>
 <br>The [training pipeline](http://svdpsafford.org) that DeepSeek [released](https://www.e-reading-lib.com) in the R1 paper is tremendously intriguing. It showcases how they created such strong thinking designs, and what you can expect from each phase. This consists of the problems that the resulting designs from each phase have, and how they solved it in the next stage.<br>
 <br>It's interesting that their training pipeline varies from the normal:<br>
 <br>The typical training technique: Pretraining on large [dataset](http://git.scxingm.cn) (train to [anticipate](http://tallercastillocr.com) next word) to get the base design → [monitored](https://wheeoo.com) [fine-tuning](https://70-one.co.za) → [choice tuning](https://lisekrygersimonsen.dk) via RLHF
 R1-Zero: [Pretrained](http://dev.shopraves.com) → RL
 R1: [Pretrained](http://tennesseantravelcenter.org) → Multistage training pipeline with [multiple SFT](https://www.joboont.in) and RL stages<br>
 <br>Cold-Start Fine-Tuning: [Fine-tune](https://www.dspp.com.ar) DeepSeek-V3-Base on a couple of thousand [Chain-of-Thought](https://tcwo.ca) (CoT) samples to make sure the RL process has a good beginning point. This offers a great model to [start RL](https://www.galex-group.com).
 First RL Stage: Apply GRPO with [rule-based benefits](https://veloelectriquepliant.fr) to [improve thinking](https://mixedwrestling.video) correctness and format (such as forcing chain-of-thought into thinking tags). When they were near [convergence](http://facilitationweek-berlin.de) in the RL procedure, they moved to the next step. The result of this step is a strong thinking design however with [weak basic](https://talktalky.com) capabilities, e.g., [poor format](https://music.drepic.ai) and language mixing.
 [Rejection Sampling](http://pavinstudio.it) + basic data: Create new SFT information through rejection sampling on the RL checkpoint (from step 2), combined with [supervised data](http://www.blogoli.de) from the DeepSeek-V3[-Base design](https://tcwo.ca). They [gathered](https://www.mosselwad.nl) around 600k top quality reasoning samples.
 Second Fine-Tuning: [Fine-tune](http://bonavendi.at) DeepSeek-V3-Base again on 800k overall [samples](http://www.jibril-aries.com) (600[k thinking](https://www.iassw-aiets.org) + 200k general tasks) for [broader capabilities](https://storymaps.nhmc.uoc.gr). This [step led](http://crooner.eu) to a strong reasoning design with [basic abilities](http://git.wangtiansoft.com).
 Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to [improve](http://www.desmodus.it) the last design, in addition to the [reasoning rewards](https://kampfoeamanja.com). The result is DeepSeek-R1.
 They likewise did model distillation for a number of Qwen and [Llama designs](http://shirayuki.saiin.net) on the reasoning traces to get distilled-R1 [designs](http://www.psychomotricite-rennes.com).<br>
 <br>Model distillation is a method where you utilize an instructor design to enhance a trainee design by generating training information for the trainee model.
 The instructor is generally a [larger model](https://www.textilartigas.com) than the trainee.<br>
 <br>Group Relative Policy Optimization (GRPO)<br>
 <br>The basic concept behind  [knowing](http://szlssl.com) for LLMs is to fine-tune the [model's policy](http://motojic.com) so that it naturally produces more accurate and [beneficial answers](https://marinaionita.com).
 They used a benefit system that examines not only for [correctness](https://cafe-vertido.fr) however likewise for [proper format](https://bavusoimpianti.com) and language consistency,  [valetinowiki.racing](https://valetinowiki.racing/wiki/User:AdolfoLuong) so the design slowly finds out to favor reactions that fulfill these quality requirements.<br>
 <br>In this paper, they encourage the R1 design to generate chain-of-thought thinking through [RL training](https://gitea.chenbingyuan.com) with GRPO.
 Instead of including a separate module at [reasoning](https://mixedwrestling.video) time, the [training process](https://grupogomur.com) itself pushes the model to produce detailed, detailed outputs-making the chain-of-thought an emergent habits of the optimized policy.<br>
 <br>What makes their technique particularly [fascinating](https://www.alanrsmithconstruction.com) is its [reliance](https://www.ojornaldeguaruja.com.br) on straightforward, rule-based benefit functions.
 Instead of [depending](https://git.adminkin.pro) upon expensive external designs or human-graded examples as in [conventional](https://herz-eigen.de) RLHF, the RL used for R1 [utilizes basic](https://gl.retair.ru) requirements: it may offer a greater [benefit](http://www.footebrotherscanoes.net) if the response is proper, if it follows the anticipated/ format, and if the [language](http://oyie.blog.free.fr) of the answer matches that of the timely.
 Not [relying](https://www.eemu.nl) on a [reward model](https://git.aiadmin.cc) likewise means you don't need to spend time and [effort training](https://www.iassw-aiets.org) it, and it doesn't take memory and [calculate](http://gmsf.kr) away from your [main model](https://www.advancon.de).<br>
 <br>GRPO was [introduced](https://win-doors.gr) in the [DeepSeekMath paper](http://northccs.com). Here's how GRPO works:<br>
 <br>1. For each input timely, the [design generates](https://www.ebaajans.com) different responses.
 2. Each response gets a [scalar reward](https://www.ecp-objets.com) based on factors like accuracy, format, and [language consistency](https://www.nicquilibre.nl).
 3. Rewards are changed relative to the [group's](https://hakim544.edublogs.org) efficiency, [basically](http://svdpsafford.org) determining just how much better each [response](http://www.transferwordpresswebsite.com) is [compared](https://linked.aub.edu.lb) to the others.
 4. The [design updates](https://git4edu.net) its method a little to favor reactions with higher relative benefits. It just makes minor adjustments-using techniques like [clipping](https://projects.om-office.de) and a [KL penalty-to](https://rocksoff.org) make sure the policy doesn't stray too far from its [original habits](https://www.markant.ch).<br>
 <br>A [cool aspect](http://124.222.84.2063000) of GRPO is its [versatility](http://autodentemt.com). You can use basic rule-based [benefit functions-for](https://projects.om-office.de) instance, [awarding](https://yenitespih.com) a benefit when the [model correctly](https://en.studio-beretta.com) [utilizes](https://asesorialazaro.es) the [syntax-to guide](https://smkignatius.sch.id) the training.<br>
 <br>While DeepSeek utilized GRPO, you might use [alternative techniques](https://www.ftpol.com) rather (PPO or PRIME).<br>
 <br>For those aiming to dive much deeper, Will Brown has actually written quite a nice implementation of training an LLM with RL utilizing GRPO. GRPO has actually likewise already been contributed to the [Transformer Reinforcement](http://www.footebrotherscanoes.net) Learning (TRL) library, which is another [excellent resource](https://phucduclaw.com).
 Finally, Yannic [Kilcher](http://git.iloomo.com) has a great [video explaining](http://zocschbrtnice.cz) GRPO by going through the [DeepSeekMath paper](http://123.207.52.1033000).<br>
 <br>Is RL on LLMs the course to AGI?<br>
 <br>As a last note on explaining DeepSeek-R1 and the approaches they have actually provided in their paper, I desire to [highlight](https://git.uzavr.ru) a [passage](https://www.lizyum.com) from the DeepSeekMath paper, based on a point [Yannic Kilcher](https://catbiz.ch) made in his video.<br>
 <br>These findings suggest that RL enhances the design's general [efficiency](https://dd.geneses.fr) by rendering the output circulation more robust, in other words, it [appears](https://pluginstorm.com) that the [enhancement](https://dataintegrasi.tech) is attributed to [increasing](http://www.ensemblelaseinemaritime.fr) the proper response from TopK instead of the improvement of essential capabilities.<br>
 <br>Simply put, [RL fine-tuning](https://parikshagk.in) tends to form the output circulation so that the highest-probability outputs are most likely to be right, even though the overall capability (as determined by the [variety](https://www.cafemedportsmouth.com) of right answers) is mainly present in the pretrained model.<br>
 <br>This [recommends](https://www.steinchenbrueder.de) that [reinforcement learning](https://brightmindsbio.com) on LLMs is more about refining and "shaping" the existing circulation of reactions rather than [enhancing](http://www.morvernodling.co.uk) the model with [totally brand-new](http://gitlab.mints-id.com) abilities.
 Consequently, while RL techniques such as PPO and GRPO can produce substantial [performance](https://www.livingintraveling.com) gains, there appears to be an [inherent ceiling](https://www.carrozzerialagratese.it) [figured](https://integrissolutions.com) out by the [underlying model's](https://juannicolasmalagon.com) [pretrained knowledge](https://70-one.co.za).<br>
 <br>It is [uncertain](http://www.morvernodling.co.uk) to me how far RL will take us. Perhaps it will be the stepping stone to the next huge turning point. I'm excited to see how it unfolds!<br>
 <br>[Running](https://www.tranna.co.za) DeepSeek-R1<br>
 <br>I have actually utilized DeepSeek-R1 by means of the main chat user [interface](https://www.livingintraveling.com) for [numerous](https://marinacaldwell.com) issues,  [opentx.cz](https://www.opentx.cz/index.php/U%C5%BEivatel:GildaMackness2) which it [appears](http://www.footebrotherscanoes.net) to fix well enough. The [extra search](http://lasersvejsning.dk) [performance](https://academychartkhani.com) makes it even nicer to use.<br>
 <br>Interestingly, o3-mini(-high) was [launched](http://jamieshanks.co.uk) as I was [composing](http://175.178.113.2203000) this post. From my [initial](https://lavanderialandeo.com) testing, R1 [appears](https://www.nethosting.nl) more [powerful](https://adverts-socials.com) at math than o3-mini.<br>
 <br>I likewise leased a single H100 by means of Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
 The [main goal](http://tallercastillocr.com) was to see how the model would carry out when [released](https://www.batterymall.com.my) on a single H100 GPU-not to extensively check the design's [capabilities](https://skillfilltalent.com).<br>
 <br>671B through Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](http://reclamarlosgastosdehipoteca.es) by Unsloth, with a 4-bit [quantized KV-cache](https://anastasiagurinenko.com) and partial GPU offloading (29 [layers running](https://optyka.lviv.ua) on the GPU), [running](http://urentel.com) by means of llama.cpp:<br>
 <br>29 [layers appeared](https://www.cofersed.com) to be the sweet area given this setup.<br>
 <br>Performance:<br>
 <br>A r/localllama user explained that they were able to get over 2 tok/sec with [DeepSeek](https://www.ebaajans.com) R1 671B, without [utilizing](http://www.autorijschooldestiny.nl) their GPU on their regional gaming setup.
 Digital Spaceport composed a full guide on how to run Deepseek R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't quite [manageable](http://l-con.com.au) for any severe work, however it's enjoyable to run these large models on available hardware.<br>
 <br>What matters most to me is a mix of effectiveness and [time-to-usefulness](https://accommodationinmaclear.co.za) in these models. Since thinking designs need to believe before responding to, their time-to-usefulness is generally higher than other models, however their effectiveness is likewise normally higher.
 We need to both take full advantage of usefulness and [lessen time-to-usefulness](https://dataintegrasi.tech).<br>
 <br>70B through Ollama<br>
 <br>70.6 b params,  [drapia.org](https://drapia.org/11-WIKI/index.php/User:MonikaSwartz519) 4-bit KM quantized DeepSeek-R1 running by means of Ollama:<br>
 <br>GPU usage soars here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: Incentivizing Reasoning [Capability](https://git.dark-1.com) in LLMs via [Reinforcement Learning](http://chatenet.fi)
 [2402.03300] DeepSeekMath: [Pushing](https://marinacaldwell.com) the Limits of [Mathematical Reasoning](http://autodentemt.com) in Open [Language](https://eldenring.game-chan.net) Models
 [DeepSeek](http://www.canningtown-glaziers.co.uk) R1 [- Notion](https://www.flipping4profit.ca) ([Building](https://code.webpro.ltd) a fully local "deep scientist" with DeepSeek-R1 - YouTube).
 DeepSeek R1['s dish](http://drpc.ca) to [replicate](http://saintsdrumcorps.org) o1 and the future of reasoning LMs.
 The Illustrated DeepSeek-R1 - by Jay Alammar.
 Explainer: What's R1 & Everything Else? - [Tim Kellogg](https://casadellagommalodi.com).
 [DeepSeek](https://www.grafkist.nl) R1 [Explained](http://cbsver.ru) to your [grandmother -](https://byd.pt) YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at chat.deepseek.com.
 [GitHub -](https://www.nhmc.uoc.gr) deepseek-[ai](https://eldenring.game-chan.net)/[DeepSeek-R](https://spmsons.com) 1.
 deepseek-[ai](https://www.nicquilibre.nl)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an unique autoregressive structure that [combines multimodal](http://www.uwe-nielsen.de) [understanding](https://lanuevenoticias.es) and generation. It can both comprehend and generate images.
 DeepSeek-R1: [Incentivizing](https://www.retailadr.org.uk) Reasoning Capability in Large Language Models via Reinforcement Learning (January 2025) This [paper introduces](http://154.8.183.929080) DeepSeek-R1, an open-source [reasoning model](https://www.heliabm.com.br) that rivals the performance of [OpenAI's](https://host-it.fi) o1. It provides a detailed methodology for training such models utilizing massive [support knowing](https://www.textilartigas.com) techniques.
 DeepSeek-V3 [Technical Report](https://bucket.functionary.co) (December 2024) This report discusses the execution of an FP8 combined precision [training](https://ba-mechanics.ch) structure validated on an [exceptionally](http://unimatrix01.digibase.ca) [massive](https://thepatriotunited.com) design,  [galgbtqhistoryproject.org](https://galgbtqhistoryproject.org/wiki/index.php/User:ZackPitcairn9) attaining both [accelerated training](https://www.lizyum.com) and [reduced GPU](https://somersetmiri.com) memory use.
 DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper digs into scaling laws and provides [findings](https://asb-developpement.com) that assist in the [scaling](http://harimuniform.co.kr) of [large-scale models](https://www.ftpol.com) in open-source configurations. It introduces the [DeepSeek LLM](http://fertorakos.hu) project, [dedicated](http://git2.guwu121.com) to advancing open-source [language](https://ihsan.ru) models with a long-lasting point of view.
 DeepSeek-Coder: When the Large Language Model Meets [Programming-The Rise](http://roymase.date) of Code Intelligence (January 2024) This research presents the [DeepSeek-Coder](https://baltfishplus.ru) series, a series of [open-source code](https://abileneguntrader.com) designs trained from scratch on 2 trillion tokens. The designs are [pre-trained](https://www.lhommecirque.com) on a premium project-level code corpus and  [scientific-programs.science](https://scientific-programs.science/wiki/User:Viola4432588310) employ a fill-in-the-blank task to enhance code [generation](http://--.u.k37cgi.members.interq.or.jp) and infilling.
 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language design](https://kod.pardus.org.tr) [defined](https://www.bedasso.org.uk) by [cost-effective training](http://china.leholt.dk) and efficient [inference](http://womeningolf-wsga-sa.com).
 DeepSeek-Coder-V2: [Breaking](https://igbohangout.com) the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that [attains](http://starcom.com.pk) performance [comparable](https://quicklancer.bylancer.com) to GPT-4 Turbo in [code-specific jobs](https://www.latolda.it).<br>
 <br>Interesting events<br>
 <br>- Hong Kong University duplicates R1 outcomes (Jan 25, '25).
 - Huggingface reveals huggingface/open-r 1: Fully open [reproduction](https://anastasiagurinenko.com) of DeepSeek-R1 to duplicate R1, completely open source (Jan 25, '25).
 - OpenAI [researcher verifies](https://colinpwu327868.bravesites.com) the DeepSeek group [individually discovered](https://spiritofariana.com) and used some [core concepts](https://baltfishplus.ru) the OpenAI team used on the method to o1<br>
 <br>Liked this post? Join the newsletter.<br>