Video Summary and Transcription
This Talk explores the world of coding assistants powered by language models (LLMs) and their use cases in software development. It delves into challenges such as understanding big code and developing models for context in LLMs. The importance of ranking and code context is discussed, along with the use of weak supervision signals and fine-tuning models for code completion. The Talk also touches on the evaluation of models and the future trends in code AI, including automation and the role of tasks, programming languages, and code context.
1. Introduction to Coding Assistants
Today, we're going to talk about what kind of ML problems we face, the components of the system, and how to evaluate and consider the context. We'll also discuss the stack, problems in the context, models, evaluation, and nuances. Finally, we'll summarize the current state of code AI and explore the future, as well as touch on complex task automation.
Cool. Good evening, everyone. Thanks for coming back. Thanks for sticking around. It's like a sparse room. Hopefully people chime in.
So my name is Rishabh. This introduction was dated because I was working with personalization, and now I'm working on generative ML. I'm the head of AI at SourceCraft.
Just a quick raise of hands, how many of you are aware of Cody? A few? Okay. Nice. How many of you are using any coding assistant, including Copilot or Cody? A few. Nice, great. So, yes, today we're going to talk a bunch about what kind of ML problems do we face, high level, what are the different components of the system and how do we evaluate this, how do we think about context around this, essentially.
A quick background. I grew up in India. I was doing my PhD in search and recommendations. I was at Spotify looking at music recommendations, multi-objective decisioning, and recently I'm working on code ML. So I come in from a machine learning background and applying it to media recommendations and now coding as a domain.
So today we're going to talk about four main themes. Five minutes each, approximately. So we're going to start by talking about overall what coding assistants do. This is similar to what Cody and Copilot and others do, but I'm going to mention one or two nuances about these problems. We're going to spend most of the time talking about what does the stack look like, what are the problems in the context, what are the problems in the models, what are the problems in the evaluation and what are the nuances over there? Towards the end of the last five to eight minutes, I'm going to summarize where we are headed. What are the levels of code AI look like? Where are we right now, what we think could be coming in the future and then overall my favourite topic, complex task automation, and just sharing some thoughts about that.
Cool. So if you start with Cody, then Cody is a coding assistant. It lives in your browser, in your ID. So if you're looking at VSCode, JetBrains, Eclipse, any of these IDs you're working with, so it's kind of a coding assistant which helps you be a better productive developer. And especially what that means is we have features like autocomplete which says that if you're writing code, then we can help you write code faster.
2. Coding Assistants: Use Cases and Problem Areas
You can code faster using LLMs by utilizing completion, chat, unit test generation, explain code, code smells, security and vulnerability detection, and custom commands. These are the use cases people are currently using Coddy for. We'll discuss the overlapping problems and three main areas: context, models, and evaluation.
So you can code faster using LLMs. An example here would be that, hey, I'm writing code, but I'm writing code in a big repository. So there's a context about what I'm trying to do. And it's not just about this file, there's some other context in some other part of the repository or maybe some other repositories as well. You start to write the correct completion. Suggest that. And when you're suggesting, you can make a single line completion or a multiline completion as well. And then the latencies and all kind of factor in over there. So this is what completion looks like.
You can also chat about repositories, about your code. Imagine you're a new developer on a team and you want to understand what this code base is about. So then you'll go in and say, hey, tell me about what this is. Or if I want to do this, just chat about your code base, essentially. You can also start looking at unit test generation. And the problem would be that I select a piece of code and then I can go to Coddy and say, can you please generate a unit test for this? You can also look at explain code. And you can say just get up to speed. Similar to chat, but a bit more explanation centric. You can also look at code smells. So code smells is find me ways in which I can optimize this code a bit more. You can also extend it to security and vulnerability detection and a bunch of those. And the list goes on. You can create custom commands as well, which means that if there is something which you do ten times in a day, might as well create a custom command and let Coddy do that for you. So these are the use cases which right now people are using Coddy for. Now the question then becomes what does it take to build this? And what are we talking about? And part of this, 70% of this is overlapping with not just coding assistant, but also any co-pilot you see out there in the world. Even if it's a sales co-pilot, marketing co-pilot, risk analyst, all these people trying to develop these agents to help some developers across in the industry. So most of these problems are pretty common over there as well. So we're going to talk about three problems here. The context, the models, and evaluation. And all of these are super important. And this is not the first time we are seeing it.
3. Context, Models, and Evaluation in LLMs
At Spotify, we faced the problem of understanding the user's context and their preferences for music recommendations. We utilized LLMs as models for recommendations. The evaluation process is similar to other recommendation systems. The challenges we face in LLMs apps are not new, as we have been applying machine learning to different domains for the past 15-20 years.
I was developing a recommendation at Spotify. At Spotify we had the same problem. I don't know what you want. I want to understand your context. I want to find out which type of music you want to listen. So the context sources over there are the kind of music artists you want to listen to. The models we have are LLMs. The models recommendations people had were ranking models, recommendation models. Evaluation are similar to the kind of problems we are thinking and the kind of problems recommendations are thought for. So the point is the problems we are facing in the LLMs apps right now, the applied machine learning, this is not new. We have done it for the last 15, 20 years across search, across recommendations, and now we are trying to see it in this new application domain.
4. Understanding Big Code and Context in LLMs
Developers face the challenge of understanding big code, especially in code AI tools. The problem arises with repositories containing millions of lines of code, making it difficult to determine relevance. A query is used to find relevant code snippets, leveraging the universe of code available. However, determining relevance is a complex problem. The LLM industry is currently working on reducing hallucinations by identifying the correct context. For example, when adding tags to a completion dataset, the LLM uses context items that may or may not be correct. Solving this problem will require significant industry effort in the coming years.
So let's talk about context first. The part about context, especially in a code AI tool is that we all face this problem of big code. And what big code means is imagine you are a developer at a Fortune 500 bank in the U.S., right? You have 20,000 developers in the company, you have like 15,000 repositories, millions and millions of lines of code. Some of this was written 20 years ago. So how do you even do search and how do you know what is relevant and what is not?
So that is the big code problem. It is different when I'm like, hey, write me binary search in Rust, then there is no need of understanding repository context. But then if I'm a developer in a big company, I need to understand what is out there, what kind of APIs I use, what kind of styling guidelines I use, and that needs a lot of code search. So this is where we come on to, like, hey, given a query, a query could be a chat query, autocomplete query, I'm trying to do something, right? If I'm trying to do something, then find the relevant pieces of code in the universe for me, right? So developers, look at what I'm doing in my ID, look at the current repositories, look at all the repositories in my organization, and if you feel like, look at all the code ever written you have access to. I don't care. I just want you to help me do this job better, right?
So just find whatever is useful and relevant code snippet, use it, send it to the LLM and get a correct answer. The question here becomes what is relevant? Now, this is one of the hardest problems which everybody working in the LLM industry is facing right now. This is what helps you reduce hallucinations. If you get the right context, the LLM will give the right answer, but then you don't know what is the right context. So this is where I'll give you an example, right? The question over here is, hey, how do you add some tags in the completion data set I'm creating? So I'm in a repository, I fire up cody, ask this question. It says, hey, based on this, here is what I think is the right answer. If you zoom in, then it picked up eight of these context items, right? And this is the context which we are fetching. I'll talk about what these are and where these are coming in from. But the point is, you get this context item, the LLM uses it. Now, how do I know whether each of these eight are correct or not? That is a good problem. I think this is where the industry is going to spend the next three to four years in solving this problem regardless of the domain you're working on, which is I have a set of documents, I have a set of repositories, I have a set of videos, whatever it is across the domains you're working on. Context, rag, retrieval, augmented, generation means bring in relevant context. You bring in relevant context but I don't know if it's correct or not. Why? Because, yeah, I mean, maybe it's useful for this query, maybe it's not useful, maybe it's adding noise, maybe the LLM just gets distracted or maybe this has the exact answer which the LLM needs to make the right generation for you. But then this is hard to do in the wild. Let's take a step back.
5. Developing Models for Context in LLMs
In the context industry, there is a lack of feedback loop when using LLMs. Code snippets are sent to the LLM without being shown to the user, hindering model training and evaluation. Developing models that incorporate context is a multi-stage process involving multiple context sources and optimizing for recall.
My point is we've done this before in recommendations. But I can't reuse that. In recommendation, if you look at Spotify, right, how many of you use Spotify? Great. Of course. So, at Spotify, I would recommend music to you, right? I might bring in musicians from your local, if you're in Amsterdam, I might bring up some Dutch artists, I might bring up some popular, niche music, all of these, right? Each of them are context sources, each of them are providing me a set of artists which I can potentially recommend to you. If I pick up an artist, I show to you, you either skip the song, you like the song, you save the song. That is feedback which I get. Then I can go back and say, hey, you picked up this artist, the user saved that track to the playlist and streamed it later. Great. Good job. You got a plus one. So we have the feedback loop in the search and recommendation system. Why? Because we pick up the item, we show to the user, the user likes it, not likes it, interacts, skips, and I can train my model. As an ML engineer, I'm happy. Here the problem is I never show these items to the user. I just send it to the LLM. So I never get the feedback loop. This is the biggest problem which people in the context industry are facing right now. Why? Because you basically ask a question, you bring in some code snippets, you never show it to the user. You send it to the LLM, the LLM just generates an answer and that's it. So we don't ever get this label and that's why I can't train a model and I can't evaluate how my context sources are doing. And this is a problem which coding assistants, any LLM based application is facing right now. And the problem over here is how do I develop models which can bring in context? And the proposal over here is it's a multi stage process. You have many, many context sources, right? Imagine Spotify and Netflix. If you look at TikTok, it will bring in short videos about some down trend happening in the Netherlands right now, some popular event happening. I know you follow these creators, so I'm going to bring in some videos from those creators. It's going to bring in many, many context sources and then it's going to pick up the top five useful to you and then show that. Now, this is a two-step process. You bring in some items, wherein you're optimizing for recall. Recall means that if there is a useful thing, am I bringing it or not? So now you have three, four candidate sources, you're bringing in many, many items and making sure you don't miss out on any relevant information.
6. Ranking and Code Context in LLMs
The second step involves the ranker selecting the best items from the candidate sources and sending them to the LLM. Different techniques like keyword search, vector embeddings, and analyzing local editor files help in bringing relevant code snippets. The challenge lies in ranking the items to pick the top ones for the LLM. This is why using GPT or ChatGPT can't provide accurate answers without code context.
The second step of the ranker, ranker says that, hey, great, you guys have brought enough items, let me select the best ones and pick it in the top five and send it to the LLM. So that's the two-step process. You have many, many candidate sources which are going to bring in items, they are optimized for recall and then the ranker comes in, selects the best and then sends it to the LLM. That's the two-step process.
Now, what exactly does that mean in the code AI world, right? And this is what it means. I mean, you do basic keyword search like BM25 or Zook indexing or any of the other keyword search you have. So that is like exact match, right? Or keyword match or some query reformulation and all. You can do vector embeddings. You can embed the code, you can embed documentation of the code, you can embed a bunch of different things. And then you embed the query. And then you have the query, you have the code embedded and then you can do a similarity matching and bring some relevant items. You can also look at local editor files, what kind of files were you editing right now, recently? So there's a lot of and there's a large list of things which you should be looking at. All of these are potentially relevant pieces of code which I should be bringing to help you answer this question. Right?
So they all bring in some items. But the question becomes how do I rank them? Because you can bring in... I mean, imagine each of them bringing in 50 or 100 items. Some of them will be overlapping, some of them not overlapping. I have 150, 250 from each of them. How do I pick up the top 10 and get to the LLM? That's the ranking problem. And once you do this ranking, then you can create this entire beautiful prompt and send it to the LLM. Now, this is exactly the difference between using GPT, ChatGPT, or any of the copilots. Because in ChatGPT, it doesn't know about your code context, so it can't bring in the relevant code snippets. It can't give you the right answer because it just doesn't know the context. And here the hope is that you have developed these systems which can really understand context. Now, this is the main problem I mentioned. That I don't know. Like, I mean, in music, in short videos, and if you open up Netflix, you either watch the episode, you move on. Here is labeled data. I can train models. Here, I don't have labeled data.
7. Weak Supervision Signals and Heterogeneous Ranking
We rely on weak supervision signals and train on almost correct, synthetic data. The ranker not only has to rank code but also heterogeneous content from various sources. By bringing in non-digital code and other context sources, we can improve the accuracy of answering questions and completing code. Different ML models are available, and you can choose the LLM based on your requirements and latency tradeoffs.
So then we rely on weak supervision signals. There's an entire area of machine learning which is focused on how do we... I don't have labels. Let's cook up labels. Let's get synthetic data. These are like weak supervision signals. They're not exactly correct, but almost correct. Let's train on them. So that's one.
The other problem is I don't just have to rank code. Because in a world of... If you're in an enterprise, you're just not looking at code. You're looking at Jira issues, documentations, LAC messages, linear issues, and all the other roadmaps you've created, right? So now my ranker doesn't only have to rank code. It has to rank heterogeneous content. Again, we have seen this problem before.
If you look at TikTok, if you look at WeChat in China, WeChat is ranking apps and short videos and images and long videos and news items all in the same feed. Your Instagram page, right? You go to Instagram, Instagram has a mix of images, short videos, long videos, all of it combined. You go to Spotify, it's ranking podcasts, music, all of it together. In the industry we have tackled heterogeneous ranking before. We have just not done it for the LLMs, right? So there's a lot of commonalities we can bring across. What that means is I can now start looking at developing relevant functions and ranking modules for other context sources as well. So this is where you start bringing in non-digital code, but also a bunch of other context sources, all so that I have enough information to answer your question better. And the question may not be charted, it could be write a better use case for you, unit test for you, explain your code better, or even complete your code better, right? So this is what context is about, right?
Quickly going into models. Now we have a lot of ML models, right? Available so far. And Kodi also provides you an option. And this is one of the differences between some of the other coding systems versus Kodi. You can choose which LLM you want based on your choice, based on latency and all the other requirements. The point is each of these features are demanding different latency tradeoffs. When you're completing code, you're not going to wait for two seconds. You need it in 300, 400 milliseconds."
8. Small Models and Fine-Tuning for Code Completion
For different use cases, we need models with different performance and latency tradeoffs. Some smaller models perform better on latency, while larger models are preferred for other cases. We fine-tuned smaller models specifically for code completion by considering context information and using fill in the middle training. This has resulted in significant improvements in code completion. We also observe differences across languages in code metrics.
So there I have to go to small models. I'm not waiting for a code llama 3, which is 70 billion parameters, it's going to take long. But then if I'm looking at agent or chat, I'm okay to wait for maybe 800 milliseconds, maybe 1.2 seconds, because get it right. Take a time. Take the entire night, but write the right code. So there's a... My point is these numbers are dummy. But you get the sense of this tradeoff. That, hey, for some use cases, I need faster models. For some, I need better models. Take your time.
And that means there are some small LLMs which are... There are some larger LLMs which are preferred. But also there's a nice curve between performance and latencies, right? You start seeing that, hey, some of these smaller models, they do much better on latencies. It's not just, like, hey, what is the best metric overall in quality? No, I have to do this tradeoff accordingly. And the question then becomes... Can I pick up this small model and make them better? And that's what we did. We picked up some of these smaller models, fine tuned them and made them more performant.
And to do that, it's not just, hey, fine tune for everything. It is task specific fine tuning. What I mean by that is if you're doing code, it's not just left to right completions. You have to do prefix and suffix. This is fill in the middle. Because if you're writing this code, right, you have some information above the code, below the code, you have to use all of that, and then fine tune on this dataset. This is called fill in the middle training. And this just makes code completion much better. I mean, we saw significant gains we are rolling out to code users as we speak.
Finally, we see the differences across languages as well. I mean, on the Y axis, you have some code metrics. On the X axis, you have languages.
9. Fine-Tuning Models and Evaluation
Some languages are not represented in the pre-training datasets, so we performed language-specific fine-tuning, resulting in over 20% performance improvement for Rust. Fine-tuning models for different tasks and latency requirements is possible. Evaluation involves component-specific and end-to-end evaluations, considering context, ranking, and overall performance. Various evaluation needs include ensuring the right context and knowledge awareness. Both offline and online evaluations are crucial.
On the X axis, you have languages. You see that some of the popular models are at point eights for Python. They're at 0.4, 0.5 for Matlab and Rust. Same for OS models as well. There's a big gap. Right? I mean, why? Because some of these languages are not represented in the pre-training datasets.
What do we do? Maybe we can do language specific fine tuning. And that's what we did. We said that, hey, I want to make a model better for Rust. It's not the default models are not good enough on Rust. Let's fine tune them. And we saw above 20% improvement in performance once you start fine tuning for Rust only. And we wrote a blog post, posted it a couple of days ago on this. So again, the point is, in the model land, right, you can start looking at fine tuning models for tasks, for different latency requirements and the use cases.
Finally, evaluation, right? I'm going to rush through some of this. But in terms of evaluation, you're looking at components and end-to-end eval, right? Again, this is the entire stack for most of the LLM apps out there. Not just coding assistant. You have the user query, you're doing some processing, you're fetching context, you're ranking context, sending it to the LLM, getting it, doing some preprocessing and postprocessing and showing to the user. Now, I can do component evaluation. How well is my specific context working? Oh, how well am I ranking? Maybe the LLM prompts are not good. Maybe let's do some optimization and do that. Component-specific evaluation. And then I can look at overall. How is my context engine doing? This is end-to-end overall context engine evaluation. Or how well is Kodi doing overall? Is it helping you better? Is it not? So my point is, there's a lot of component-specific and end-to-end evaluations that you have to do. Now, the evaluation needs are various, right? I need to understand if you're bringing the right context, if you're retrieving the right items, are you generally aware of LLMs, of knowledge? Maybe there's a large-scale data library. And when some model was trained, it wasn't even aware of this library. So you're not aware of general knowhow. That's an evaluation. There's a bunch of evaluation I have to do, right? The key point is, I have to do it both in the offline evaluation and online.
10. Offline Evaluation and Future Trends
Offline evaluation is crucial for determining the effectiveness and worthiness of new models, prompts, and contexts. It allows for more iterations and saves time compared to relying solely on A-B tests. However, offline-online correlation is essential to ensure the accuracy and reliability of the evaluation. The entire recommendations and search industry has dedicated evaluation researchers for this purpose. Looking ahead, code AI is divided into three categories: human-initiated, AI-initiated, and AI-led, each with increasing levels of autonomy.
By offline evaluation, I mean if I have a new model, new prompt, new context, I need to know, before I ship it to users in an A-B test, is it better? Is it worth the productionization effort or not? That's offline evaluation. Offline is, I roll it out in an A-B test, was it useful for my users or not? Completion acceptance rate. I showed you completion, did you accept it? If you accepted it, did you remove it five minutes later, or was it persisting as well?
The question then becomes, in terms of offline iteration, if I don't have an offline evaluation, everything has to be A-B tested. You're only going to do F of them, like five of them in a month. If you have an offline evaluation, you can do 10X, 100X more. So that's the value of offline evaluation. But for that, you need to make sure that you have offline-online correlation. Again, this is not new. The entire recommendations and search industry at Google, YouTube, Netflix, Spotify, they have an army of evaluation researchers who focus on offline-online correlation. Why? Because if your offline evaluation says, model B is better than model A by 5%, you try it online, it turns out, no, model A is better than model B. Your offline evaluation is crappy, you should throw it away. I can't trust it. So it has to be directionally correct. If you say this model is better, I tried it in an online test, it has to be better. And it has to be sensitive. If it says 5% movement, I should get 4% to 6% movement. If it says it's going to make it better by 5%, online it only makes it better by 0.1%, then it's also useless. So that's the entire point of offline-online correlation.
In the last minute, I want to focus on where are we headed. This is what's going on right now. Again, I'm pretty sure this is not just coding assistant. This is not just Git or Cody or Sourcegraph. I think this is common problems across a bunch of LLM-based applications right now. Because they all have to iterate on context, evaluation, ranking, LLM fine tuning, all of this. But where are we headed? In terms of code AI, Quinn, our CEO, wrote a very nice blog post, which is over here, which is dividing the lens of code AI into three buckets. One is human-initiated, which we crossed maybe one year ago, where we went to code creation and code assistants, and then AI-initiated. Right now we're starting to see these agented frameworks getting set up. Wherein you can have the ML model try to do something for you. You step in, you say, no, this is not correct, let me iterate, and then do it. And then finally, AI-led, which is full autonomy.
11. Agentic Approach and Feedback Loop
We're currently between three to four in terms of the agentic approach to complex tasks and code scenarios. Reinforcement learning, planning, task decomposition, and feedback loops are necessary for automation. However, the lack of a feedback loop hinders progress in code development. Once we establish a feedback loop, we can gradually automate tasks. In terms of coding assistance, such as Cody, it enhances developer skills and addresses contextual problems. Avoiding info bubbles in code assistance involves focusing on task-driven applications rather than proactive recommendation systems.
We're probably far away from that. So right now, we're somewhere between three to four, based on the kind of feature and the product you work with.
The last point is the thing which I love the most. My PhD topic was complex task. So this is where I think, again, if you have to do an agentic approach to some of these code and non-code scenarios, you have to have the ability of breaking down complex tasks into some planning, some subtasks, attempting some tasks, see whether it's working or not, get that feedback loop, and then wrap it all up together.
So this is a very, very hand-wavy way of saying we need reinforcement learning and planning and task decomposition and attempting each of these subtasks, getting that feedback loop in place, and then end-to-end training it. Now this is where, again, I'm a machine learning engineer not belonging to code domain in the last ten years, but I don't have a feedback loop.
There is very less build systems which I can use to get a feedback that, does this code work better or not? Imagine if you're at any of the big banks, right? I mean, your repository is like huge, millions of lines of code. I cannot build it. If I generate a unit test, it takes me hours to build it, and probably like some developers in your company are unable to build it within a few minutes. The point is you generate these things, the feedback loop is missing. So I think that's my entire call, that, look, once we are able to get that feedback loop in, then we can start automating some of these tasks, like, one step at a time. So that's where we are.
I'll wrap up. Looking at coding assistance, especially Cody. It helps you be a better developer. Looking at some problems on the context, models, evaluation, some of the wins you have got in fine tuning some of these things, and thinking about where we are headed in terms of the agentic workflow, which is how do we look at complex tasks and how do we start automating these one task at a time? Yeah. That's it. All right. All right.
So someone asked, Rack systems have problems of creating an info bubble. So how do you make sure you avoid it for code assistance that they do not fixate too much on certain sources? Yeah. That's a good question. I think the question is, like, recommendation systems are a filter bubble. The eco chambers, if you like something, I'll keep on bubbling with you. The difference between LLM applications and recommendation systems is recommendations are, like, very proactive. They're, like, hey, you come to the feed, you come to the home page, you don't have to ask a query, I'll show you what you want, right? But then in the LLMs, most of the applications are task driven. You don't go to chat GPT and it gives you a feed. No, that feed doesn't exist. You have to have a task which the LLM is trying to help you with."
12. Tasks, Programming Language, and Code Context
The limitation of influencing LLMs is due to the presence of tasks. Currently, there is no feed around LLMs, unlike social media platforms. LLMs aim to be proactive in assisting users with their tasks, rather than creating filter bubbles. The idea of a programming language designed for higher predictability and abstraction is intriguing, as it could provide high-level steps for tackling complex tasks. However, defining such a language or protocol is yet to be determined. Code prediction in Kodi is not limited to code repository context but extends to other sources like documentation and conversations, making it valuable in multi-repository environments with thousands of developers and millions of lines of code.
So because there's a task, then the amount of influencing an LLM can do is limited. At least right now, right? I mean, right now, around the LLMs, there is no feed. And feed is I can be lazy. I can just open TikTok and I'll get a stream, I'll just stroll, low effort, and I'll be in my filter bubble. Right now, at least the setup is not that the copilots are not making you do that. They're, like, hey, what are you trying to do? I'll help you do better. So I think that task differentiation and that proactiveness, lack of proactive assistance is what's not making these LLMs, like, be filter bubbly.
Okay, let's check what we have up next. Okay, so would you ever want a programming language designed for a higher predictability with code assistance? That's a great question. I never thought about do we need, like, a programming language. I think we can talk about abstractions, right? Look, to tackle a complex task, what I need is I need high-level steps, right? These are sequence of steps which I would follow. Now, right now, the programming languages are, like, yeah, I mean, they're writing syntax for, like, low-level details, right? We have abstracted away from compilers to languages. But I think that's a great point, that maybe there is, like, things, especially as we start looking at the outer loop of things, right? One step away, two step above. These are, like, sub-steps of what is needed to tackle a task. Right now, there is no protocol to define that, right? There is no syntax or way to define or create this dataset. So does it look like a language? Does it look like a protocol? Does it look like a sequence of steps? I think the jury is still out, but I think that's a great thought. Something at the higher level abstraction has to exist to do, attempt at least, like, RL planning on top of these complex tasks.
Yeah, that makes a lot of sense. So the next person here, in the next voted question is... Let me just get it up here. Do I understand correctly that Kodi looks at the context of the repository for giving code prediction? Yeah, it looks at the context of your repository, of a lot of other, like, open context is something a protocol which the team is developing, which is, again, to answer your code, I need not just rely on code repositories. I can look at your documentation, your road map, maybe Slack conversations, so it's not just about looking at your code repository context, but everywhere I can get information which helps you solve the task. And again, it's not just your repository. Imagine you're in an enterprise, 20,000 developers, 15,000 repositories, millions of lines of code. That multi-repo world is where this is gonna be really, really helpful. Why? Because that search is harder. Within a repository, you might know there's only a few files, you're aware of them, but in that 15,000 repository world, or in that five repository world, that multi-repo search, that's non-trivial.
13. Multi-Repo Search and Suitable Tasks
In a multi-repo environment, searching across 15,000 repositories is challenging. The most suitable tasks for AI agents to automate are determined by the availability of datasets. Large content windows can solve the context problem, but they are not practical in current use cases.
Within a repository, you might know there's only a few files, you're aware of them, but in that 15,000 repository world, or in that five repository world, that multi-repo search, that's non-trivial.
And which common developer tasks you find the most suitable for AI agents to automate? Great question. I... Yeah, I think I don't have a specific answer like which of these tasks are immediately automatable, but I think, again, this is not a code domain expert answer, but an ML domain expert answer is whichever dataset I can create first, that's the model I can build first.
Would models with large content windows like 1 million plus tokens with good needle haystack performance largely solve the context problem by just sending everything? Yeah, I think that's a great point. I mean, so again, the question is, hey, we're thinking about context ranking. It's not just code here. Everybody's thinking about context. And the Gemini models are like 1 million tokens. Again, we saw that tradeoff between latencies and performance, right? So even if you throw in a million context items, it's gonna take a long time for the inference to happen. It's gonna cost you 100x more. It's gonna be far more latency. So yes, I mean, larger context sizes are helpful, but they're not practical in the use cases which most of these companies are dealing with right now.
14. Dataset Importance and Context Sizes
The dataset plays a crucial role in determining the faults and intricacies of a programming language. Context ranking is essential, but larger context sizes are not practical for most companies due to increased latency and cost.
It's more about what is in that dataset. If you're a great Rust programmer, then you know where the current faults are. You know the intricacies of Rust. And that's what your contribution probably is, that, hey, I mean, let's find out the failure cases and get these models to do really well on those.
Okay. And would models with large content windows like 1 million plus tokens with good needle haystack performance largely solve the context problem by just sending everything? Yeah, I think that's a great point. I mean, so again, the question is, hey, we're thinking about context ranking. It's not just code here. Everybody's thinking about context.
So even if you throw in a million context items, it's gonna take a long time for the inference to happen. It's gonna cost you 100x more. It's gonna be far more latency. So again, if you're able to help the LLM focus the right attention, then a lot of the tasks get easier. So yes, I mean, larger context sizes are helpful, but they're not practical in the use cases which most of these companies are dealing with right now.
Comments