Episode 8
· 21:05
00;00;00;25 - 00;00;22;17
Unknown
Hello and welcome everyone. It's another Thursday and another big model release. My name is Hendrik. I'm here at the Code Rabbit San Francisco office, joined by our VP of AI, David. Hi. How are you doing? Great. Great to have you back on the pod. Thanks very much. Yeah. One of my favorite guests. David, it's been two months, I think roughly since we last sat here to talk about the Oppo's 4.6 release.
00;00;22;19 - 00;00;39;46
Unknown
Right. Today we got opus 4.7. Yeah, it's a big model release. I think a lot of people are kind of interested to see where it's going. Yeah, I think, that's very true. So what was your gut feeling when you first got to play around with it a little bit? So I mean, I was looking for what changed versus 4.6.
00;00;39;46 - 00;01;01;17
Unknown
Right. So as you're watching these, a lot of them have been shifting quite radically, I would say, even when in terms of how it responds to certain prompts. And so it's very interesting to see how 4.6 changed pretty dramatically from 4.5. And actually 4.7 is again, a pretty big shift from the way it responds, to prompting. So, yeah, clearly they've been working hard at something in there.
00;01;01;17 - 00;01;23;05
Unknown
And the way they've been approaching the problem has been shifting. And we saw some, some pretty good gains. Yeah. What, what was maybe one of the first impressions that you had when you got to play around with it? I'd say it's a lot more assertive than previous versions of generally the cloud ecosystem. I think as more of the co-developer kind of, we should probably look at this.
00;01;23;05 - 00;01;40;41
Unknown
This might be an issue. You should think about fixing it like having that that language of the more the inclusive language and the more of the, I call it hedging, kind of not not taking affirmative action like this is a bug. Fix it. It's like, I think this is probably an issue that we should probably think about fixing that, you know, hedging a little bit in terms of the language.
00;01;40;45 - 00;02;01;22
Unknown
4.7 seems to have moved a little bit away from that, which is interesting. It's a different tonality, I guess. Paired to 4.6. Yeah, that's super interesting. Let's talk a little bit numbers. So we obviously run, a lot of benchmarks, on that. And I think we've seen some massive improvements in certain senses are the parts that need to be fine tuned a little bit more.
00;02;01;23 - 00;02;23;36
Unknown
Yes. What can you tell us about that? Yeah. So I'd say every model that comes out comes with caveats, right. That's why you can't just flip a switch and have the model necessarily work perfectly. It'd be nice if that was the case, but that's not the way it works. And yeah, we saw some improvements. So I'd say raw processing power, the reasoning level or the skill or thinking through through multi-step problems and finding solutions.
00;02;23;41 - 00;02;46;01
Unknown
I saw an improvement there. Overall, I'd say we saw about a 15%, give or take improvement, depending on the variability in any given run. And so, yeah, we saw some improvements and recall just finding bugs, right. Both in our really hard data set, which is more about sheer model power and then in our real world benchmarking set, which is more about how this model is probably going to behave in the wild.
00;02;46;06 - 00;03;10;59
Unknown
So yeah, generally speaking, good. All on the on that front, what we saw in 4.6 did come about again, which is this idea of far more comments. So we found more issues. Yes. But we also sort of flooded the space with a lot more comments. And it was actually specific in the sense that it was, labeling major and minor comments a lot more of those category, not so much more in the critical sense.
00;03;11;04 - 00;03;32;45
Unknown
And so that is something we need to deal with, right? Because we don't want to end the date. Everybody, with these excess comments, we need to understand why those comments exist. If it's a particular issue that we can then prompt out. Right. And so then we did that. We spent some time diving deep into those comments, looking at them, figuring out and trying to classify the different areas where it was generating more things than it used to be.
00;03;32;50 - 00;03;51;34
Unknown
And so we noticed a few patterns that we were able to then do some prompt tweaking and sort of filter those out. And that's that got it down to a really good state, I think overall where we got both now an improvement in recall, but also an improvement in precision overall. Oh yeah, that's very interesting. Can you tell our audience a little bit more about what kind of tests do we run?
00;03;51;39 - 00;04;10;04
Unknown
Yeah. So we have two sets of tests right. We so we have the model benchmarking test of like how good at reasoning is this model? I'm just going to give you a really hard problem that requires you to keep a lot of things in mind, walk through multiple steps and come to some sort of conclusion. And so that's a really hard set, but not necessarily going to happen that often in the real world.
00;04;10;09 - 00;04;26;26
Unknown
Not that often. Do you have a really complex concurrency issue or memory bug that's in your system? More often you have these simpler bugs that are sort of more pernicious in a certain sense. They happen a lot more frequently, but they're not as like brain teaser you kind of things. You just need to know the right information to be able to find that bug.
00;04;26;31 - 00;04;46;18
Unknown
And so that's also a testing of our context engineering harness. So we have that other benchmark that tests both the model's ability to handle large amounts of context, but also our ability to find the right context and solve that problem. And so we test both of them, and we measure both things and try and determine for any given model, any given changes we might make.
00;04;46;23 - 00;05;04;07
Unknown
Where are we seeing gains, where are we seeing losses and how do we fix those losses. Right. So and you saw on both benchmarks, we saw the recall and the precision go up. The precision got went up after we made some fixes. Right. So it's first identifying what went wrong. And that was the comment count going way up.
00;05;04;11 - 00;05;26;23
Unknown
In in the major and minor comment. And so figuring out why. And so part of the process can actually be applied to just about anything. Right. So if you take your conversation with a with a model inside your system, even if it's in a generic system, and if I'm recording all of that right, including the thought processes that led to a specific moment, and that's a comment that shouldn't be there.
00;05;26;28 - 00;05;50;34
Unknown
And then I ask the same model to look through that backtrace and say, this comment is not supposed to be there because of. And I know why. And I say, what about my prompt? What about things that were going on led to this moment that made you make this choice? And it will then reason about all of the context it has in every bit of your system, prompt and user prompt, and give you, indicators as to what made it think that that was an important thing to tell you.
00;05;50;39 - 00;06;03;55
Unknown
And that's when it can start to point out, oh, maybe my prompt was a little bit vague, or maybe I had something in there that it latched on to that I didn't want it to. Or maybe there's just something missing from my prompt, and then you can ask it to give you ideas of how to update your prompt.
00;06;03;56 - 00;06;23;52
Unknown
Well, and you can then look at those, see which ones make sense, maybe generalize them a little bit because maybe it's too specific to that comment. And then you put that back in, test it again and you see the difference right. And if you have that ability that evaluation harness and suite to measure these things, you can do that iteratively, fairly rapidly and get to a much better state.
00;06;23;57 - 00;06;45;51
Unknown
So it's almost like a self-improving prompting engine. Yes. Yes it is. And I mean you have to be a little bit systematic, especially if you're dealing with a very large, very dynamic system in order to find those wins. Right. Because it's not a small prompt. It's a it's a big system, a big ego system. So to scan over that whole thing and understand where you're going to make these changes is quite dynamic.
00;06;45;51 - 00;07;12;31
Unknown
And so you need some some extra help there. Yeah, I can imagine. I mean, there are so many new parameters that also come into play. We also talked a little bit about, the different thinking modes that you have and, you know, a new adjustment error release there. Can you talk about that? Yeah. So previously, in opus 4.5 and prior, we would use something called a thinking budget where you would have a certain number of tokens that you would be able to use as part of the thinking process that would lead to the output.
00;07;12;36 - 00;07;32;41
Unknown
And you could try and get at that through Mary's means of like, how difficult is this PR really? How do I figure out how many tokens you might need and try and estimate that and do some smart logic behind the scenes? Adaptive thinking is something they introduced that that anthropic is introduced and it's called opus 4.6 and now is mandatory in 4.7.
00;07;32;46 - 00;07;52;17
Unknown
And so adaptive thinking is always on. And so now you have this effort level parameter like is it low medium high. And they added extra high. And you have max. And so that gives you some control over the most amount of thinking that is going to do to try and solve a problem. Now, it's not going to use that entire thing if it doesn't need to.
00;07;52;17 - 00;08;11;12
Unknown
Right. So the point of adaptive is that if it's a simple problem, even if you have it set on max, it's just going to think for a little tiny bit and it's going to answer the question. I think that's the whole point of it, is to try and get at what is the, the, way of increasing the average user's experience of those thinking budgets to make it so you don't have to choose that number.
00;08;11;12 - 00;08;36;16
Unknown
And we can kind of do it more, ourselves. Okay. Those more extended thinking sessions also come with a prize in terms of time. Right? So latency also went up and yes, that's right. So finding and fine tuning whatever parameter makes sense for your particular use case, is going to improve your latency. So if you can think a lot because it's a very difficult problem like a review for example.
00;08;36;20 - 00;09;00;50
Unknown
Then and you set it to max, it can overthink, it can think very, very hard and long about a problem when it might not need to. Right. And so having it know that it has a limit. So setting it a little bit lower, it might be a little bit more economical with its thinking. And so therefore you get lower latency because it spends less time turning out these, output tokens inside the thinking part.
00;09;00;55 - 00;09;22;08
Unknown
That makes a lot of sense. Yeah, I think I saw on the metrics as well that it use less tokens overall in our entire system. Yeah. After tuning. Yeah. So after tuning and getting some, some what is the right level when it comes to that when it comes to that parameter. And then after making some of these tweaks when it comes to the prompt, we were actually able to get it to think a little bit less and be a bit more economical in terms of the token usage.
00;09;22;08 - 00;09;40;38
Unknown
Yeah. What could you recommend to a developer? I mean, there must be thousands of millions of people who might be looking at the new model now thinking, okay, I have a system running in production, or b maybe 1 or 2 tips you could give somebody to make that step that we did. I think, again, being systematic about the way you're going to think about prompting, the system.
00;09;40;38 - 00;10;06;06
Unknown
So first of all, read the prompt and guide. It has a lot of, information in there that you can take in, distill and even feed it into an algorithm to help you. Here's my prompts. Here's the prompting guy for this new model. Can you help me figure out where I can do better? And the other one is again, finding those examples where things don't go well, and having that conversation, that interaction with the same model with the thinking traces in there at the point of while this happened, it shouldn't have happened.
00;10;06;06 - 00;10;26;05
Unknown
This is why it shouldn't have happened or this is what's wrong with it. Can you help me update my prompt to fix this issue, and do that for a number of different, varied examples so that you can get some ideas for how you can optimize your prompt. That's that's pretty cool. Yeah, I'll try that for sure. So, let's talk a little bit about general capabilities.
00;10;26;09 - 00;10;46;12
Unknown
So was there something that surprised you, maybe a bug that had found that previously in our data set? The other models historically have not caught or something that was just, oh, wow. This is a new unlock. Yeah. So like I was saying before, I think to a degree, the ability for it to do multi-step reasoning has improved over previous models.
00;10;46;12 - 00;11;07;49
Unknown
And so we did see some some gains when it came to finding some of the more tricky concurrency issues that typically those require you to make multiple things be true and then follow that logic through to conclusion, sometimes multiple layers down. And so that's a very complex problem to solve. And so it did this better, actually almost 20% better than sort of previous generations relative amount.
00;11;07;49 - 00;11;31;57
Unknown
Right. And and so yeah that's a big that's a big improvement overall. Just raw reasoning capability. And so I think you're going to find that bear out when you're doing things that like coding. Right. So being able to again hold this what is true and then reason through multiple steps to go down to maybe plan or do some sort of complex logic, it's probably more likely to succeed and not sort of, make mistakes in that, in that regard.
00;11;32;02 - 00;11;50;59
Unknown
Yeah. One other thing that stood out to me when I looked through the data points, comparing it against our baseline, was that it found more issues outside of the diff and give it a good PR. Yeah. So I have to first define for everybody what outside diff comments are for us. So GitHub has a limitation where you can't post a comment unless it's contained entirely within the diff.
00;11;51;13 - 00;12;07;59
Unknown
And so outside diff comments for us mean that it might contain part of the lines in the diff, but also might the bug might start a little bit higher or a little bit lower. It might even involve, more of the entire function or something else is going on. So more complex. To find more complex define, you have to.
00;12;08;04 - 00;12;28;41
Unknown
It's also sort of a little bit about our context engineering, right. So we're bringing in the information that then allows that bug to surface because it's not contained just in the diff. It's actually outside of the diff. And so you need that information. And once the model has that it's going to find that bug. Right. And in some cases it's going to know to something in a file that might not actually even be part of your code.
00;12;28;41 - 00;12;49;08
Unknown
Right now. That slipped in in previous years and nobody caught it. And now we're surfacing that information because it's related to your PR in the model. It's like, well that's a bug. Yeah. And so it's going to also point that out. And so that's also going to be an outside diff comment that we make. So what will be the difference in the experience for code rabbit user starting.
00;12;49;18 - 00;13;03;46
Unknown
Oh yeah. So I think ultimately we've increased the recall rate. So you're going to we're going to find more of those bugs. So long term your code base is going to be healthier. You're going to have less issues. And then you're going to get less or fewer, production outages, which I think is the goal of the entire experience.
00;13;03;46 - 00;13;19;26
Unknown
Right. I know code reviews can have this like positive and negative take to it. Like, you have to fix some stuff, but I, you know, I, what I say is use our fix with AI or ease, use our plugins and have the systems be fixed for you. Ultimately, at the end of the day, our job is to find as many bugs as we can see.
00;13;19;26 - 00;13;49;42
Unknown
Your software is as high up quality bar as possible, right? Yeah. That's great. Can you, give an estimate maybe to talk a little bit more practical? How much effort is that actually to bring in a new model like that when a new release that comes in. Yeah. It varies, but I'd say it takes at least a couple of weeks of somebody dedicated to a task in order to figure out all the different ways the model differs and all the different prompts that need to be changed, and all the different mechanisms that might, alter slightly because of the way it behaves differently.
00;13;49;51 - 00;14;11;01
Unknown
And so it's a process. It's definitely it requires a lot of hand tuning and it requires us especially, the quality level of comments is extremely important to us. Right? So making sure that that quality bar remains high. And so we don't just want to switch something on, and not without fully understanding what impact that's going to have on our customers.
00;14;11;01 - 00;14;33;43
Unknown
Right. And so this is a process of being, measuring the things that we can measure and understanding before we even roll it out. What is the impact of this going to be. And that requires effort, right? So what will be maybe some some tips that you could give to smaller teams that might not have the same expertise as we do here with great people like yourself and the great people that work in our engineering arc.
00;14;33;48 - 00;14;51;34
Unknown
One tip we heard Self-improving prompting loop kind of having the model help you out and improving your your system prompt or but other prompt, assets you might have. But are there some other tips that you have? Yeah. So no matter where you are, everyone starting from like when you're building something new, everyone starting from scratch, right.
00;14;51;47 - 00;15;09;19
Unknown
And so how do you think about building up some sort of data set, some sort of thing that I can use then to understand when models change what the impact is without having to just run it and see what's going on. And to my users is starting to think about, okay, I'm going to do an AI prompt updates, right?
00;15;09;19 - 00;15;33;19
Unknown
And I'm going through this loop and understanding what's happening. If I take those moments and I actually save them, I can then use them later when models change, to then revisit that same that same loop and see what the effect is. Right? So getting into this habit of when I'm noticing something interesting, when I'm on something, work well, when I'm noticing a feature that I build start to work or not, work is starting to store some of that.
00;15;33;19 - 00;15;53;21
Unknown
And understand. I can then rerun that later, put it into something and spend some time spin up cloud code and actually build something around this so I can automate and rerun some past examples through there. Now with my updated new model with my prompt changes and let it be a process. Don't don't think of it as I have to do the best thing right now.
00;15;53;26 - 00;16;10;14
Unknown
It's a process, right? I'm going to slowly build it up. And if you build it up, even if it's one new example a day or whatnot, over the course of a few months, now you have 100, right? And so the idea is to just make it part of your process of how you're thinking about solving the problem, and then you'll get there.
00;16;10;19 - 00;16;30;00
Unknown
How do you measure the quality? I mean, I guess bugs are kind of deterministic. Either you find something or you don't, but maybe comments are not really super deterministic in the sense like the quality of a comment can vary from the inside. I'm getting as the reader right. How do you measure one quality comment from opus 4.6 against 0.7?
00;16;30;15 - 00;16;51;19
Unknown
Yeah, so there's a lot of different things that we measure when it comes to quality. So there's this readability score that's sort of a more standard. I think it's called flush flusher. It's the reading score essentially. Right. So this is like how easily, someone's going to be able to digest this information, looking at how many lines there are in a comment, how many words are being used, if we're suddenly seeing, let's say, for example, right.
00;16;51;19 - 00;17;16;01
Unknown
You found 100 bucks. Yeah. And this other, this other prompt, our model found 100 bucks, but it did it in twice as many lines. Then, you know, it's being very verbose, right? It's doing something not necessary in order to actually articulate what what is wrong. And so you can start to get a feel for like. And then when you see this go out then and you test it online, you can see what is the impact of this secondary characteristic on my users.
00;17;16;13 - 00;17;31;48
Unknown
Do they like it? Do they not like it. So the hedging for example, this might be an issue you should think of fixing it. What impact does that actually have on the end user. Do they like it. Do they not. And so you can actually start to see the reactions that people have to these comments. Do they give feedback.
00;17;32;01 - 00;17;52;22
Unknown
Do they implement those comments more often? Less often depending on what's going on there? Do they sign up to your product more or less. And those are the things that you're kind of trying to figure out. What are the secondary characteristics that I can measure and correlate positively with user satisfaction of the product, people signing up and all these other downstream effects.
00;17;52;22 - 00;18;12;21
Unknown
So you want to kind of it takes time to figure out what those things are. And if you get feedback from people and you talk to your users, you can start to figure out, oh, they don't really like really verbose comments. So probably, you know, hundred line comments are probably not a good idea. Right. And so you can start to get this like this, this notion of what are the problems.
00;18;12;25 - 00;18;32;31
Unknown
And then we know, for example, that in our case having comments with patches. So actual code that fixes it is very well liked. People really like when when you have the solution there. And so when we're doing this again or evaluating a model, how many of the comments that it makes actually contain the fix within them because we're telling it to create the fix if it can.
00;18;32;36 - 00;18;51;04
Unknown
The question is can it all the time and does it so I've noticed more recent models do that a lot, much more so than last year. Right. So this is something that's been improving over time as well. And we know from users talking to them that they like that they like the fact that they can just click a button and have the have get sort of fix it for them, right?
00;18;51;09 - 00;19;16;36
Unknown
Yeah, that sounds exciting. Talking about the next generation of models, I think. Yeah. Thank you for all those insights. Those were incredibly valuable. Where do you see maybe the next generation going now? I think there might be oh plus five, 5.15.0 coming out soon. Do you see a trend in the way that they've been releasing those models, the emphasis that they had maybe on thinking, don't we already talked about multi-step reasoning?
00;19;16;41 - 00;19;44;20
Unknown
Is that something that you see that they're doubling down on, maybe even compared differently to OpenAI models or Google models? I feel like 4.7 moved into an efficiency, general trajectory, right? I felt like when I went from 4.5 to 4.6, I saw a lot more verbosity in the way that the model was thinking. There's a lot more time spent, sort of considering what might be, what might not be and kind of going through that thought process.
00;19;44;25 - 00;20;09;25
Unknown
I think 4.7, seems to have seems to be a little bit more efficient. We saw that in the token usage overall, through our benchmarking. And I think that, adaptive reasoning is kind of again, this idea of how do I get the model to adequately judge how much it needs to think about a problem rather than just letting it potentially ramble, in some cases around different ideas that might be very unlikely.
00;20;09;30 - 00;20;28;22
Unknown
And so that's I think it makes sense to me that overall, in order for us to allow us, to have better test time compute. So if I say I have 100,000 token budget in terms of time, that I'm going to be able to willing to wait for an answer to come about how I use those, that time is very valuable then, right?
00;20;28;22 - 00;20;47;43
Unknown
Whether I want it to go off thinking about all these different things or be efficient, and if we're going to get somewhere where we can do really, really deep, complex reasoning, reasoning tasks in the time frame that human beings are going to be okay with in this interactive setting, efficiency also needs to come into play. You know how much the model thinks and how it thinks about problems, right?
00;20;47;48 - 00;20;56;15
Unknown
Well, I think that's a good note. To leave it on. Thank you so much for coming on, David. And thanks for having me. Yeah. Yeah. Right. All right. Good. Okay. Yeah. Right.
Listen to The Merge (by CodeRabbit) using one of many popular podcasting apps or directories.