Powering Cody Coding Assistant Using LLMs

This ad is not shown to multipass and full ticket holders
React Summit US
React Summit US 2025
November 18 - 21, 2025
New York, US & Online
The biggest React conference in the US
Learn More
In partnership with Focus Reactive
Upcoming event
React Summit US 2025
React Summit US 2025
November 18 - 21, 2025. New York, US & Online
Learn more
Bookmark
Rate this content

FAQ

Rishabh is the head of AI at SourceCraft.

Rishabh's talk covers four main themes: what coding assistants do, the stack and its problems (context, models, evaluation), the future of code AI, and complex task automation.

Cody offers features such as autocomplete, chatting about repositories, unit test generation, code explanation, code smell detection, security and vulnerability detection, and custom commands.

The main problems faced by coding assistants include understanding and retrieving the right context, model performance and latency trade-offs, and evaluating the effectiveness of these systems.

The "big code" problem refers to the challenge of understanding and searching through large codebases, such as those in a Fortune 500 company, which may have millions of lines of code and thousands of repositories.

Cody handles context by fetching relevant pieces of code from the current repository, other repositories in the organization, and potentially all code the user has access to, to help generate accurate code completions and answers.

Component-specific evaluations focus on individual parts of the system, such as context retrieval or model performance, while end-to-end evaluations assess the overall effectiveness of the coding assistant in helping the user.

Cody uses smaller machine learning models for tasks requiring low latency, such as code completions, to ensure responses are generated within 300-400 milliseconds.

The future direction for code AI includes moving towards agent-led frameworks and eventually achieving full autonomy in coding tasks. This involves breaking down complex tasks into subtasks, planning, and reinforcement learning.

Cody is a coding assistant that lives in your browser or integrated development environment (IDE) like VSCode, JetBrains, or Eclipse. It helps developers be more productive by offering features like autocomplete, chat about repositories, unit test generation, code explanation, and code smell detection.

Rishabh Mehrotra
Rishabh Mehrotra
29 min
15 Jun, 2024

Comments

Sign in or register to post your comment.
Video Summary and Transcription
This Talk explores the world of coding assistants powered by language models (LLMs) and their use cases in software development. It delves into challenges such as understanding big code and developing models for context in LLMs. The importance of ranking and code context is discussed, along with the use of weak supervision signals and fine-tuning models for code completion. The Talk also touches on the evaluation of models and the future trends in code AI, including automation and the role of tasks, programming languages, and code context.

1. Introduction to Coding Assistants

Short description:

Today, we're going to talk about what kind of ML problems we face, the components of the system, and how to evaluate and consider the context. We'll also discuss the stack, problems in the context, models, evaluation, and nuances. Finally, we'll summarize the current state of code AI and explore the future, as well as touch on complex task automation.

Cool. Good evening, everyone. Thanks for coming back. Thanks for sticking around. It's like a sparse room. Hopefully people chime in.

So my name is Rishabh. This introduction was dated because I was working with personalization, and now I'm working on generative ML. I'm the head of AI at SourceCraft.

Just a quick raise of hands, how many of you are aware of Cody? A few? Okay. Nice. How many of you are using any coding assistant, including Copilot or Cody? A few. Nice, great. So, yes, today we're going to talk a bunch about what kind of ML problems do we face, high level, what are the different components of the system and how do we evaluate this, how do we think about context around this, essentially.

A quick background. I grew up in India. I was doing my PhD in search and recommendations. I was at Spotify looking at music recommendations, multi-objective decisioning, and recently I'm working on code ML. So I come in from a machine learning background and applying it to media recommendations and now coding as a domain.

So today we're going to talk about four main themes. Five minutes each, approximately. So we're going to start by talking about overall what coding assistants do. This is similar to what Cody and Copilot and others do, but I'm going to mention one or two nuances about these problems. We're going to spend most of the time talking about what does the stack look like, what are the problems in the context, what are the problems in the models, what are the problems in the evaluation and what are the nuances over there? Towards the end of the last five to eight minutes, I'm going to summarize where we are headed. What are the levels of code AI look like? Where are we right now, what we think could be coming in the future and then overall my favourite topic, complex task automation, and just sharing some thoughts about that.

Cool. So if you start with Cody, then Cody is a coding assistant. It lives in your browser, in your ID. So if you're looking at VSCode, JetBrains, Eclipse, any of these IDs you're working with, so it's kind of a coding assistant which helps you be a better productive developer. And especially what that means is we have features like autocomplete which says that if you're writing code, then we can help you write code faster.

2. Coding Assistants: Use Cases and Problem Areas

Short description:

You can code faster using LLMs by utilizing completion, chat, unit test generation, explain code, code smells, security and vulnerability detection, and custom commands. These are the use cases people are currently using Coddy for. We'll discuss the overlapping problems and three main areas: context, models, and evaluation.

So you can code faster using LLMs. An example here would be that, hey, I'm writing code, but I'm writing code in a big repository. So there's a context about what I'm trying to do. And it's not just about this file, there's some other context in some other part of the repository or maybe some other repositories as well. You start to write the correct completion. Suggest that. And when you're suggesting, you can make a single line completion or a multiline completion as well. And then the latencies and all kind of factor in over there. So this is what completion looks like.

You can also chat about repositories, about your code. Imagine you're a new developer on a team and you want to understand what this code base is about. So then you'll go in and say, hey, tell me about what this is. Or if I want to do this, just chat about your code base, essentially. You can also start looking at unit test generation. And the problem would be that I select a piece of code and then I can go to Coddy and say, can you please generate a unit test for this? You can also look at explain code. And you can say just get up to speed. Similar to chat, but a bit more explanation centric. You can also look at code smells. So code smells is find me ways in which I can optimize this code a bit more. You can also extend it to security and vulnerability detection and a bunch of those. And the list goes on. You can create custom commands as well, which means that if there is something which you do ten times in a day, might as well create a custom command and let Coddy do that for you. So these are the use cases which right now people are using Coddy for. Now the question then becomes what does it take to build this? And what are we talking about? And part of this, 70% of this is overlapping with not just coding assistant, but also any co-pilot you see out there in the world. Even if it's a sales co-pilot, marketing co-pilot, risk analyst, all these people trying to develop these agents to help some developers across in the industry. So most of these problems are pretty common over there as well. So we're going to talk about three problems here. The context, the models, and evaluation. And all of these are super important. And this is not the first time we are seeing it.

3. Context, Models, and Evaluation in LLMs

Short description:

At Spotify, we faced the problem of understanding the user's context and their preferences for music recommendations. We utilized LLMs as models for recommendations. The evaluation process is similar to other recommendation systems. The challenges we face in LLMs apps are not new, as we have been applying machine learning to different domains for the past 15-20 years.

I was developing a recommendation at Spotify. At Spotify we had the same problem. I don't know what you want. I want to understand your context. I want to find out which type of music you want to listen. So the context sources over there are the kind of music artists you want to listen to. The models we have are LLMs. The models recommendations people had were ranking models, recommendation models. Evaluation are similar to the kind of problems we are thinking and the kind of problems recommendations are thought for. So the point is the problems we are facing in the LLMs apps right now, the applied machine learning, this is not new. We have done it for the last 15, 20 years across search, across recommendations, and now we are trying to see it in this new application domain.

4. Understanding Big Code and Context in LLMs

Short description:

Developers face the challenge of understanding big code, especially in code AI tools. The problem arises with repositories containing millions of lines of code, making it difficult to determine relevance. A query is used to find relevant code snippets, leveraging the universe of code available. However, determining relevance is a complex problem. The LLM industry is currently working on reducing hallucinations by identifying the correct context. For example, when adding tags to a completion dataset, the LLM uses context items that may or may not be correct. Solving this problem will require significant industry effort in the coming years.

So let's talk about context first. The part about context, especially in a code AI tool is that we all face this problem of big code. And what big code means is imagine you are a developer at a Fortune 500 bank in the U.S., right? You have 20,000 developers in the company, you have like 15,000 repositories, millions and millions of lines of code. Some of this was written 20 years ago. So how do you even do search and how do you know what is relevant and what is not?

So that is the big code problem. It is different when I'm like, hey, write me binary search in Rust, then there is no need of understanding repository context. But then if I'm a developer in a big company, I need to understand what is out there, what kind of APIs I use, what kind of styling guidelines I use, and that needs a lot of code search. So this is where we come on to, like, hey, given a query, a query could be a chat query, autocomplete query, I'm trying to do something, right? If I'm trying to do something, then find the relevant pieces of code in the universe for me, right? So developers, look at what I'm doing in my ID, look at the current repositories, look at all the repositories in my organization, and if you feel like, look at all the code ever written you have access to. I don't care. I just want you to help me do this job better, right?

So just find whatever is useful and relevant code snippet, use it, send it to the LLM and get a correct answer. The question here becomes what is relevant? Now, this is one of the hardest problems which everybody working in the LLM industry is facing right now. This is what helps you reduce hallucinations. If you get the right context, the LLM will give the right answer, but then you don't know what is the right context. So this is where I'll give you an example, right? The question over here is, hey, how do you add some tags in the completion data set I'm creating? So I'm in a repository, I fire up cody, ask this question. It says, hey, based on this, here is what I think is the right answer. If you zoom in, then it picked up eight of these context items, right? And this is the context which we are fetching. I'll talk about what these are and where these are coming in from. But the point is, you get this context item, the LLM uses it. Now, how do I know whether each of these eight are correct or not? That is a good problem. I think this is where the industry is going to spend the next three to four years in solving this problem regardless of the domain you're working on, which is I have a set of documents, I have a set of repositories, I have a set of videos, whatever it is across the domains you're working on. Context, rag, retrieval, augmented, generation means bring in relevant context. You bring in relevant context but I don't know if it's correct or not. Why? Because, yeah, I mean, maybe it's useful for this query, maybe it's not useful, maybe it's adding noise, maybe the LLM just gets distracted or maybe this has the exact answer which the LLM needs to make the right generation for you. But then this is hard to do in the wild. Let's take a step back.

5. Developing Models for Context in LLMs

Short description:

In the context industry, there is a lack of feedback loop when using LLMs. Code snippets are sent to the LLM without being shown to the user, hindering model training and evaluation. Developing models that incorporate context is a multi-stage process involving multiple context sources and optimizing for recall.

My point is we've done this before in recommendations. But I can't reuse that. In recommendation, if you look at Spotify, right, how many of you use Spotify? Great. Of course. So, at Spotify, I would recommend music to you, right? I might bring in musicians from your local, if you're in Amsterdam, I might bring up some Dutch artists, I might bring up some popular, niche music, all of these, right? Each of them are context sources, each of them are providing me a set of artists which I can potentially recommend to you. If I pick up an artist, I show to you, you either skip the song, you like the song, you save the song. That is feedback which I get. Then I can go back and say, hey, you picked up this artist, the user saved that track to the playlist and streamed it later. Great. Good job. You got a plus one. So we have the feedback loop in the search and recommendation system. Why? Because we pick up the item, we show to the user, the user likes it, not likes it, interacts, skips, and I can train my model. As an ML engineer, I'm happy. Here the problem is I never show these items to the user. I just send it to the LLM. So I never get the feedback loop. This is the biggest problem which people in the context industry are facing right now. Why? Because you basically ask a question, you bring in some code snippets, you never show it to the user. You send it to the LLM, the LLM just generates an answer and that's it. So we don't ever get this label and that's why I can't train a model and I can't evaluate how my context sources are doing. And this is a problem which coding assistants, any LLM based application is facing right now. And the problem over here is how do I develop models which can bring in context? And the proposal over here is it's a multi stage process. You have many, many context sources, right? Imagine Spotify and Netflix. If you look at TikTok, it will bring in short videos about some down trend happening in the Netherlands right now, some popular event happening. I know you follow these creators, so I'm going to bring in some videos from those creators. It's going to bring in many, many context sources and then it's going to pick up the top five useful to you and then show that. Now, this is a two-step process. You bring in some items, wherein you're optimizing for recall. Recall means that if there is a useful thing, am I bringing it or not? So now you have three, four candidate sources, you're bringing in many, many items and making sure you don't miss out on any relevant information.

6. Ranking and Code Context in LLMs

Short description:

The second step involves the ranker selecting the best items from the candidate sources and sending them to the LLM. Different techniques like keyword search, vector embeddings, and analyzing local editor files help in bringing relevant code snippets. The challenge lies in ranking the items to pick the top ones for the LLM. This is why using GPT or ChatGPT can't provide accurate answers without code context.

The second step of the ranker, ranker says that, hey, great, you guys have brought enough items, let me select the best ones and pick it in the top five and send it to the LLM. So that's the two-step process. You have many, many candidate sources which are going to bring in items, they are optimized for recall and then the ranker comes in, selects the best and then sends it to the LLM. That's the two-step process.

Now, what exactly does that mean in the code AI world, right? And this is what it means. I mean, you do basic keyword search like BM25 or Zook indexing or any of the other keyword search you have. So that is like exact match, right? Or keyword match or some query reformulation and all. You can do vector embeddings. You can embed the code, you can embed documentation of the code, you can embed a bunch of different things. And then you embed the query. And then you have the query, you have the code embedded and then you can do a similarity matching and bring some relevant items. You can also look at local editor files, what kind of files were you editing right now, recently? So there's a lot of and there's a large list of things which you should be looking at. All of these are potentially relevant pieces of code which I should be bringing to help you answer this question. Right?

So they all bring in some items. But the question becomes how do I rank them? Because you can bring in... I mean, imagine each of them bringing in 50 or 100 items. Some of them will be overlapping, some of them not overlapping. I have 150, 250 from each of them. How do I pick up the top 10 and get to the LLM? That's the ranking problem. And once you do this ranking, then you can create this entire beautiful prompt and send it to the LLM. Now, this is exactly the difference between using GPT, ChatGPT, or any of the copilots. Because in ChatGPT, it doesn't know about your code context, so it can't bring in the relevant code snippets. It can't give you the right answer because it just doesn't know the context. And here the hope is that you have developed these systems which can really understand context. Now, this is the main problem I mentioned. That I don't know. Like, I mean, in music, in short videos, and if you open up Netflix, you either watch the episode, you move on. Here is labeled data. I can train models. Here, I don't have labeled data.

7. Weak Supervision Signals and Heterogeneous Ranking

Short description:

We rely on weak supervision signals and train on almost correct, synthetic data. The ranker not only has to rank code but also heterogeneous content from various sources. By bringing in non-digital code and other context sources, we can improve the accuracy of answering questions and completing code. Different ML models are available, and you can choose the LLM based on your requirements and latency tradeoffs.

So then we rely on weak supervision signals. There's an entire area of machine learning which is focused on how do we... I don't have labels. Let's cook up labels. Let's get synthetic data. These are like weak supervision signals. They're not exactly correct, but almost correct. Let's train on them. So that's one.

The other problem is I don't just have to rank code. Because in a world of... If you're in an enterprise, you're just not looking at code. You're looking at Jira issues, documentations, LAC messages, linear issues, and all the other roadmaps you've created, right? So now my ranker doesn't only have to rank code. It has to rank heterogeneous content. Again, we have seen this problem before.

If you look at TikTok, if you look at WeChat in China, WeChat is ranking apps and short videos and images and long videos and news items all in the same feed. Your Instagram page, right? You go to Instagram, Instagram has a mix of images, short videos, long videos, all of it combined. You go to Spotify, it's ranking podcasts, music, all of it together. In the industry we have tackled heterogeneous ranking before. We have just not done it for the LLMs, right? So there's a lot of commonalities we can bring across. What that means is I can now start looking at developing relevant functions and ranking modules for other context sources as well. So this is where you start bringing in non-digital code, but also a bunch of other context sources, all so that I have enough information to answer your question better. And the question may not be charted, it could be write a better use case for you, unit test for you, explain your code better, or even complete your code better, right? So this is what context is about, right?

Quickly going into models. Now we have a lot of ML models, right? Available so far. And Kodi also provides you an option. And this is one of the differences between some of the other coding systems versus Kodi. You can choose which LLM you want based on your choice, based on latency and all the other requirements. The point is each of these features are demanding different latency tradeoffs. When you're completing code, you're not going to wait for two seconds. You need it in 300, 400 milliseconds."

8. Small Models and Fine-Tuning for Code Completion

Short description:

For different use cases, we need models with different performance and latency tradeoffs. Some smaller models perform better on latency, while larger models are preferred for other cases. We fine-tuned smaller models specifically for code completion by considering context information and using fill in the middle training. This has resulted in significant improvements in code completion. We also observe differences across languages in code metrics.

So there I have to go to small models. I'm not waiting for a code llama 3, which is 70 billion parameters, it's going to take long. But then if I'm looking at agent or chat, I'm okay to wait for maybe 800 milliseconds, maybe 1.2 seconds, because get it right. Take a time. Take the entire night, but write the right code. So there's a... My point is these numbers are dummy. But you get the sense of this tradeoff. That, hey, for some use cases, I need faster models. For some, I need better models. Take your time.

And that means there are some small LLMs which are... There are some larger LLMs which are preferred. But also there's a nice curve between performance and latencies, right? You start seeing that, hey, some of these smaller models, they do much better on latencies. It's not just, like, hey, what is the best metric overall in quality? No, I have to do this tradeoff accordingly. And the question then becomes... Can I pick up this small model and make them better? And that's what we did. We picked up some of these smaller models, fine tuned them and made them more performant.

And to do that, it's not just, hey, fine tune for everything. It is task specific fine tuning. What I mean by that is if you're doing code, it's not just left to right completions. You have to do prefix and suffix. This is fill in the middle. Because if you're writing this code, right, you have some information above the code, below the code, you have to use all of that, and then fine tune on this dataset. This is called fill in the middle training. And this just makes code completion much better. I mean, we saw significant gains we are rolling out to code users as we speak.

Finally, we see the differences across languages as well. I mean, on the Y axis, you have some code metrics. On the X axis, you have languages.

9. Fine-Tuning Models and Evaluation

Short description:

Some languages are not represented in the pre-training datasets, so we performed language-specific fine-tuning, resulting in over 20% performance improvement for Rust. Fine-tuning models for different tasks and latency requirements is possible. Evaluation involves component-specific and end-to-end evaluations, considering context, ranking, and overall performance. Various evaluation needs include ensuring the right context and knowledge awareness. Both offline and online evaluations are crucial.

On the X axis, you have languages. You see that some of the popular models are at point eights for Python. They're at 0.4, 0.5 for Matlab and Rust. Same for OS models as well. There's a big gap. Right? I mean, why? Because some of these languages are not represented in the pre-training datasets.

What do we do? Maybe we can do language specific fine tuning. And that's what we did. We said that, hey, I want to make a model better for Rust. It's not the default models are not good enough on Rust. Let's fine tune them. And we saw above 20% improvement in performance once you start fine tuning for Rust only. And we wrote a blog post, posted it a couple of days ago on this. So again, the point is, in the model land, right, you can start looking at fine tuning models for tasks, for different latency requirements and the use cases.

Finally, evaluation, right? I'm going to rush through some of this. But in terms of evaluation, you're looking at components and end-to-end eval, right? Again, this is the entire stack for most of the LLM apps out there. Not just coding assistant. You have the user query, you're doing some processing, you're fetching context, you're ranking context, sending it to the LLM, getting it, doing some preprocessing and postprocessing and showing to the user. Now, I can do component evaluation. How well is my specific context working? Oh, how well am I ranking? Maybe the LLM prompts are not good. Maybe let's do some optimization and do that. Component-specific evaluation. And then I can look at overall. How is my context engine doing? This is end-to-end overall context engine evaluation. Or how well is Kodi doing overall? Is it helping you better? Is it not? So my point is, there's a lot of component-specific and end-to-end evaluations that you have to do. Now, the evaluation needs are various, right? I need to understand if you're bringing the right context, if you're retrieving the right items, are you generally aware of LLMs, of knowledge? Maybe there's a large-scale data library. And when some model was trained, it wasn't even aware of this library. So you're not aware of general knowhow. That's an evaluation. There's a bunch of evaluation I have to do, right? The key point is, I have to do it both in the offline evaluation and online.

10. Offline Evaluation and Future Trends

Short description:

Offline evaluation is crucial for determining the effectiveness and worthiness of new models, prompts, and contexts. It allows for more iterations and saves time compared to relying solely on A-B tests. However, offline-online correlation is essential to ensure the accuracy and reliability of the evaluation. The entire recommendations and search industry has dedicated evaluation researchers for this purpose. Looking ahead, code AI is divided into three categories: human-initiated, AI-initiated, and AI-led, each with increasing levels of autonomy.

By offline evaluation, I mean if I have a new model, new prompt, new context, I need to know, before I ship it to users in an A-B test, is it better? Is it worth the productionization effort or not? That's offline evaluation. Offline is, I roll it out in an A-B test, was it useful for my users or not? Completion acceptance rate. I showed you completion, did you accept it? If you accepted it, did you remove it five minutes later, or was it persisting as well?

The question then becomes, in terms of offline iteration, if I don't have an offline evaluation, everything has to be A-B tested. You're only going to do F of them, like five of them in a month. If you have an offline evaluation, you can do 10X, 100X more. So that's the value of offline evaluation. But for that, you need to make sure that you have offline-online correlation. Again, this is not new. The entire recommendations and search industry at Google, YouTube, Netflix, Spotify, they have an army of evaluation researchers who focus on offline-online correlation. Why? Because if your offline evaluation says, model B is better than model A by 5%, you try it online, it turns out, no, model A is better than model B. Your offline evaluation is crappy, you should throw it away. I can't trust it. So it has to be directionally correct. If you say this model is better, I tried it in an online test, it has to be better. And it has to be sensitive. If it says 5% movement, I should get 4% to 6% movement. If it says it's going to make it better by 5%, online it only makes it better by 0.1%, then it's also useless. So that's the entire point of offline-online correlation.

In the last minute, I want to focus on where are we headed. This is what's going on right now. Again, I'm pretty sure this is not just coding assistant. This is not just Git or Cody or Sourcegraph. I think this is common problems across a bunch of LLM-based applications right now. Because they all have to iterate on context, evaluation, ranking, LLM fine tuning, all of this. But where are we headed? In terms of code AI, Quinn, our CEO, wrote a very nice blog post, which is over here, which is dividing the lens of code AI into three buckets. One is human-initiated, which we crossed maybe one year ago, where we went to code creation and code assistants, and then AI-initiated. Right now we're starting to see these agented frameworks getting set up. Wherein you can have the ML model try to do something for you. You step in, you say, no, this is not correct, let me iterate, and then do it. And then finally, AI-led, which is full autonomy.

11. Agentic Approach and Feedback Loop

Short description:

We're currently between three to four in terms of the agentic approach to complex tasks and code scenarios. Reinforcement learning, planning, task decomposition, and feedback loops are necessary for automation. However, the lack of a feedback loop hinders progress in code development. Once we establish a feedback loop, we can gradually automate tasks. In terms of coding assistance, such as Cody, it enhances developer skills and addresses contextual problems. Avoiding info bubbles in code assistance involves focusing on task-driven applications rather than proactive recommendation systems.

We're probably far away from that. So right now, we're somewhere between three to four, based on the kind of feature and the product you work with.

The last point is the thing which I love the most. My PhD topic was complex task. So this is where I think, again, if you have to do an agentic approach to some of these code and non-code scenarios, you have to have the ability of breaking down complex tasks into some planning, some subtasks, attempting some tasks, see whether it's working or not, get that feedback loop, and then wrap it all up together.

So this is a very, very hand-wavy way of saying we need reinforcement learning and planning and task decomposition and attempting each of these subtasks, getting that feedback loop in place, and then end-to-end training it. Now this is where, again, I'm a machine learning engineer not belonging to code domain in the last ten years, but I don't have a feedback loop.

There is very less build systems which I can use to get a feedback that, does this code work better or not? Imagine if you're at any of the big banks, right? I mean, your repository is like huge, millions of lines of code. I cannot build it. If I generate a unit test, it takes me hours to build it, and probably like some developers in your company are unable to build it within a few minutes. The point is you generate these things, the feedback loop is missing. So I think that's my entire call, that, look, once we are able to get that feedback loop in, then we can start automating some of these tasks, like, one step at a time. So that's where we are.

I'll wrap up. Looking at coding assistance, especially Cody. It helps you be a better developer. Looking at some problems on the context, models, evaluation, some of the wins you have got in fine tuning some of these things, and thinking about where we are headed in terms of the agentic workflow, which is how do we look at complex tasks and how do we start automating these one task at a time? Yeah. That's it. All right. All right.

So someone asked, Rack systems have problems of creating an info bubble. So how do you make sure you avoid it for code assistance that they do not fixate too much on certain sources? Yeah. That's a good question. I think the question is, like, recommendation systems are a filter bubble. The eco chambers, if you like something, I'll keep on bubbling with you. The difference between LLM applications and recommendation systems is recommendations are, like, very proactive. They're, like, hey, you come to the feed, you come to the home page, you don't have to ask a query, I'll show you what you want, right? But then in the LLMs, most of the applications are task driven. You don't go to chat GPT and it gives you a feed. No, that feed doesn't exist. You have to have a task which the LLM is trying to help you with."

12. Tasks, Programming Language, and Code Context

Short description:

The limitation of influencing LLMs is due to the presence of tasks. Currently, there is no feed around LLMs, unlike social media platforms. LLMs aim to be proactive in assisting users with their tasks, rather than creating filter bubbles. The idea of a programming language designed for higher predictability and abstraction is intriguing, as it could provide high-level steps for tackling complex tasks. However, defining such a language or protocol is yet to be determined. Code prediction in Kodi is not limited to code repository context but extends to other sources like documentation and conversations, making it valuable in multi-repository environments with thousands of developers and millions of lines of code.

So because there's a task, then the amount of influencing an LLM can do is limited. At least right now, right? I mean, right now, around the LLMs, there is no feed. And feed is I can be lazy. I can just open TikTok and I'll get a stream, I'll just stroll, low effort, and I'll be in my filter bubble. Right now, at least the setup is not that the copilots are not making you do that. They're, like, hey, what are you trying to do? I'll help you do better. So I think that task differentiation and that proactiveness, lack of proactive assistance is what's not making these LLMs, like, be filter bubbly.

Okay, let's check what we have up next. Okay, so would you ever want a programming language designed for a higher predictability with code assistance? That's a great question. I never thought about do we need, like, a programming language. I think we can talk about abstractions, right? Look, to tackle a complex task, what I need is I need high-level steps, right? These are sequence of steps which I would follow. Now, right now, the programming languages are, like, yeah, I mean, they're writing syntax for, like, low-level details, right? We have abstracted away from compilers to languages. But I think that's a great point, that maybe there is, like, things, especially as we start looking at the outer loop of things, right? One step away, two step above. These are, like, sub-steps of what is needed to tackle a task. Right now, there is no protocol to define that, right? There is no syntax or way to define or create this dataset. So does it look like a language? Does it look like a protocol? Does it look like a sequence of steps? I think the jury is still out, but I think that's a great thought. Something at the higher level abstraction has to exist to do, attempt at least, like, RL planning on top of these complex tasks.

Yeah, that makes a lot of sense. So the next person here, in the next voted question is... Let me just get it up here. Do I understand correctly that Kodi looks at the context of the repository for giving code prediction? Yeah, it looks at the context of your repository, of a lot of other, like, open context is something a protocol which the team is developing, which is, again, to answer your code, I need not just rely on code repositories. I can look at your documentation, your road map, maybe Slack conversations, so it's not just about looking at your code repository context, but everywhere I can get information which helps you solve the task. And again, it's not just your repository. Imagine you're in an enterprise, 20,000 developers, 15,000 repositories, millions of lines of code. That multi-repo world is where this is gonna be really, really helpful. Why? Because that search is harder. Within a repository, you might know there's only a few files, you're aware of them, but in that 15,000 repository world, or in that five repository world, that multi-repo search, that's non-trivial.

13. Multi-Repo Search and Suitable Tasks

Short description:

In a multi-repo environment, searching across 15,000 repositories is challenging. The most suitable tasks for AI agents to automate are determined by the availability of datasets. Large content windows can solve the context problem, but they are not practical in current use cases.

Within a repository, you might know there's only a few files, you're aware of them, but in that 15,000 repository world, or in that five repository world, that multi-repo search, that's non-trivial.

And which common developer tasks you find the most suitable for AI agents to automate? Great question. I... Yeah, I think I don't have a specific answer like which of these tasks are immediately automatable, but I think, again, this is not a code domain expert answer, but an ML domain expert answer is whichever dataset I can create first, that's the model I can build first.

Would models with large content windows like 1 million plus tokens with good needle haystack performance largely solve the context problem by just sending everything? Yeah, I think that's a great point. I mean, so again, the question is, hey, we're thinking about context ranking. It's not just code here. Everybody's thinking about context. And the Gemini models are like 1 million tokens. Again, we saw that tradeoff between latencies and performance, right? So even if you throw in a million context items, it's gonna take a long time for the inference to happen. It's gonna cost you 100x more. It's gonna be far more latency. So yes, I mean, larger context sizes are helpful, but they're not practical in the use cases which most of these companies are dealing with right now.

14. Dataset Importance and Context Sizes

Short description:

The dataset plays a crucial role in determining the faults and intricacies of a programming language. Context ranking is essential, but larger context sizes are not practical for most companies due to increased latency and cost.

It's more about what is in that dataset. If you're a great Rust programmer, then you know where the current faults are. You know the intricacies of Rust. And that's what your contribution probably is, that, hey, I mean, let's find out the failure cases and get these models to do really well on those.

Okay. And would models with large content windows like 1 million plus tokens with good needle haystack performance largely solve the context problem by just sending everything? Yeah, I think that's a great point. I mean, so again, the question is, hey, we're thinking about context ranking. It's not just code here. Everybody's thinking about context.

So even if you throw in a million context items, it's gonna take a long time for the inference to happen. It's gonna cost you 100x more. It's gonna be far more latency. So again, if you're able to help the LLM focus the right attention, then a lot of the tasks get easier. So yes, I mean, larger context sizes are helpful, but they're not practical in the use cases which most of these companies are dealing with right now.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

React Compiler - Understanding Idiomatic React (React Forget)
React Advanced 2023React Advanced 2023
33 min
React Compiler - Understanding Idiomatic React (React Forget)
Top Content
Watch video: React Compiler - Understanding Idiomatic React (React Forget)
Joe Savona
Mofei Zhang
2 authors
The Talk discusses React Forget, a compiler built at Meta that aims to optimize client-side React development. It explores the use of memoization to improve performance and the vision of Forget to automatically determine dependencies at build time. Forget is named with an F-word pun and has the potential to optimize server builds and enable dead code elimination. The team plans to make Forget open-source and is focused on ensuring its quality before release.
Speeding Up Your React App With Less JavaScript
React Summit 2023React Summit 2023
32 min
Speeding Up Your React App With Less JavaScript
Top Content
Watch video: Speeding Up Your React App With Less JavaScript
Mishko, the creator of Angular and AngularJS, discusses the challenges of website performance and JavaScript hydration. He explains the differences between client-side and server-side rendering and introduces Quik as a solution for efficient component hydration. Mishko demonstrates examples of state management and intercommunication using Quik. He highlights the performance benefits of using Quik with React and emphasizes the importance of reducing JavaScript size for better performance. Finally, he mentions the use of QUIC in both MPA and SPA applications for improved startup performance.
SolidJS: Why All the Suspense?
JSNation 2023JSNation 2023
28 min
SolidJS: Why All the Suspense?
Top Content
Suspense is a mechanism for orchestrating asynchronous state changes in JavaScript frameworks. It ensures async consistency in UIs and helps avoid trust erosion and inconsistencies. Suspense boundaries are used to hoist data fetching and create consistency zones based on the user interface. They can handle loading states of multiple resources and control state loading in applications. Suspense can be used for transitions, providing a smoother user experience and allowing prioritization of important content.
Jotai Atoms Are Just Functions
React Day Berlin 2022React Day Berlin 2022
22 min
Jotai Atoms Are Just Functions
Top Content
State management in React is a highly discussed topic with many libraries and solutions. Jotai is a new library based on atoms, which represent pieces of state. Atoms in Jotai are used to define state without holding values and can be used for global, semi-global, or local states. Jotai atoms are reusable definitions that are independent from React and can be used without React in an experimental library called Jotajsx.
From GraphQL Zero to GraphQL Hero with RedwoodJS
GraphQL Galaxy 2021GraphQL Galaxy 2021
32 min
From GraphQL Zero to GraphQL Hero with RedwoodJS
Top Content
Tom Pressenwurter introduces Redwood.js, a full stack app framework for building GraphQL APIs easily and maintainably. He demonstrates a Redwood.js application with a React-based front end and a Node.js API. Redwood.js offers a simplified folder structure and schema for organizing the application. It provides easy data manipulation and CRUD operations through GraphQL functions. Redwood.js allows for easy implementation of new queries and directives, including authentication and limiting access to data. It is a stable and production-ready framework that integrates well with other front-end technologies.
The Epic Stack
React Summit US 2023React Summit US 2023
21 min
The Epic Stack
Top Content
Watch video: The Epic Stack
This Talk introduces the Epic Stack, a project starter and reference for modern web development. It emphasizes that the choice of tools is not as important as we think and that any tool can be fine. The Epic Stack aims to provide a limited set of services and common use cases, with a focus on adaptability and ease of swapping out tools. It incorporates technologies like Remix, React, Fly to I.O, Grafana, and Sentry. The Epic Web Dev offers free materials and workshops to gain a solid understanding of the Epic Stack.

Workshops on related topic

AI on Demand: Serverless AI
DevOps.js Conf 2024DevOps.js Conf 2024
163 min
AI on Demand: Serverless AI
Top Content
Featured WorkshopFree
Nathan Disidore
Nathan Disidore
In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.
AI for React Developers
React Advanced 2024React Advanced 2024
142 min
AI for React Developers
Top Content
Featured Workshop
Eve Porcello
Eve Porcello
Knowledge of AI tooling is critical for future-proofing the careers of React developers, and the Vercel suite of AI tools is an approachable on-ramp. In this course, we’ll take a closer look at the Vercel AI SDK and how this can help React developers build streaming interfaces with JavaScript and Next.js. We’ll also incorporate additional 3rd party APIs to build and deploy a music visualization app.
Topics:- Creating a React Project with Next.js- Choosing a LLM- Customizing Streaming Interfaces- Building Routes- Creating and Generating Components - Using Hooks (useChat, useCompletion, useActions, etc)
Vibe coding with Cline
JSNation 2025JSNation 2025
64 min
Vibe coding with Cline
Featured Workshop
Nik Pash
Nik Pash
The way we write code is fundamentally changing. Instead of getting stuck in nested loops and implementation details, imagine focusing purely on architecture and creative problem-solving while your AI pair programmer handles the execution. In this hands-on workshop, I'll show you how to leverage Cline (an autonomous coding agent that recently hit 1M VS Code downloads) to dramatically accelerate your development workflow through a practice we call "vibe coding" - where humans focus on high-level thinking and AI handles the implementation.You'll discover:The fundamental principles of "vibe coding" and how it differs from traditional developmentHow to architect solutions at a high level and have AI implement them accuratelyLive demo: Building a production-grade caching system in Go that saved us $500/weekTechniques for using AI to understand complex codebases in minutes instead of hoursBest practices for prompting AI agents to get exactly the code you wantCommon pitfalls to avoid when working with AI coding assistantsStrategies for using AI to accelerate learning and reduce dependency on senior engineersHow to effectively combine human creativity with AI implementation capabilitiesWhether you're a junior developer looking to accelerate your learning or a senior engineer wanting to optimize your workflow, you'll leave this workshop with practical experience in AI-assisted development that you can immediately apply to your projects. Through live coding demos and hands-on exercises, you'll learn how to leverage Cline to write better code faster while focusing on what matters - solving real problems.
Building Full Stack Apps With Cursor
JSNation 2025JSNation 2025
46 min
Building Full Stack Apps With Cursor
Featured Workshop
Mike Mikula
Mike Mikula
In this workshop I’ll cover a repeatable process on how to spin up full stack apps in Cursor.  Expect to understand techniques such as using GPT to create product requirements, database schemas, roadmaps and using those in notes to generate checklists to guide app development.  We will dive further in on how to fix hallucinations/ errors that occur, useful prompts to make your app look and feel modern, approaches to get every layer wired up and more!  By the end expect to be able to run your own AI generated full stack app on your machine!
Please, find the FAQ here
Free webinar: Building Full Stack Apps With Cursor
Productivity Conf for Devs and Tech LeadersProductivity Conf for Devs and Tech Leaders
71 min
Free webinar: Building Full Stack Apps With Cursor
Top Content
WorkshopFree
Mike Mikula
Mike Mikula
In this webinar I’ll cover a repeatable process on how to spin up full stack apps in Cursor.  Expect to understand techniques such as using GPT to create product requirements, database schemas, roadmaps and using those in notes to generate checklists to guide app development.  We will dive further in on how to fix hallucinations/ errors that occur, useful prompts to make your app look and feel modern, approaches to get every layer wired up and more!  By the end expect to be able to run your own ai generated full stack app on your machine!
Working With OpenAI and Prompt Engineering for React Developers
React Advanced 2023React Advanced 2023
98 min
Working With OpenAI and Prompt Engineering for React Developers
Top Content
Workshop
Richard Moss
Richard Moss
In this workshop we'll take a tour of applied AI from the perspective of front end developers, zooming in on the emerging best practices when it comes to working with LLMs to build great products. This workshop is based on learnings from working with the OpenAI API from its debut last November to build out a working MVP which became PowerModeAI (A customer facing ideation and slide creation tool).
In the workshop they'll be a mix of presentation and hands on exercises to cover topics including:
- GPT fundamentals- Pitfalls of LLMs- Prompt engineering best practices and techniques- Using the playground effectively- Installing and configuring the OpenAI SDK- Approaches to working with the API and prompt management- Implementing the API to build an AI powered customer facing application- Fine tuning and embeddings- Emerging best practice on LLMOps