OneContext Launches (Serverless) Today
This blog post makes liberal use of memes to discuss the problem we're solving. Apologies in advance if you don't like memes.
What's OneContext?
OneContext is a platform that lets machine learning engineers compose and deploy custom RAG pipelines on SOTA infrastructure without having to think about any devops.
You define the entire pipeline in YAML, and OneContext takes care of the infra (SSL certs, DNS, Kubernetes cluster, GPUs, autoscaling, load balancing, etc).
We give you an endpoint that you can hit with a single API request, and your entire pipeline is executed in the cloud, on the one cluster.
Why did we build this?
Because the reality is that most "AI products" are really just a bunch of sequential API requests, which is (a) terribly slow, (b) makes for a terrible user experience, and (c) limits the scope of what ML can do for the end-user.
We address these concerns below, and propose a solution.
Most AI in production today is just request routing
It seems all the rage to "inject AI into the product". Woohoo overnight we are an AI company! We've now 10x'd our valuation. Congrats guys.
The reality is that it's not that easy, and that's mainly down to one reason: the modern ML stack is basically just a web of API requests to third party services.1
For example
Just take the simple example of retrieving the relevant context for a user's query at runtime in your app.
i.e. the user, a curious non-technical product-manager asks a question of a "helpful" chatbot on your text-to-SQL platform:
Please show me a list of all the customers who have spent more than $1000 in physical stores in the Bay Area in the last 30 days
Great. Seems "trivial". But serving this "trivial" request first involves furnishing the language model with the relevant context, e.g. which tables in the database are relevant, which columns in those tables are relevant, what are the datatypes of those columns, how often are those tables updated, etc.
Fortunately, you have some documentation provided by the data-engineers, and you can introspect the schemas of the various DBs the company uses. Unfortunately, the "documentation" is a 200 page PDF that hasn't been updated since 2021, and the company actually stores its data in 4 different DBs, one of which is GraphQL and one of which is MongoDB, and in the SQL DBs alone there are 84 tables that could potentially be relevant to the user's query.
- • Take the user's query and make a request to embedding model (e.g. OpenAI) for an embedding of the text.
- • Wait for embedding to come back over the wire.
- • Take that embedding, and send over the wire to Vector DB (e.g. Pinecone / Milvus) for the most similar embeddings to this. (A smart implementation here would be to separately search for "most relevant table description" and "most relevant pages in the documentation", against two separate indices).
- • Wait for the results to come back over the wire.
- • (Probably) filter and rank these results according to some business logic (funky regex).
- • Again send the results over the wire to a ReRanker model (for example on Replicate or Hugging Face).
- • Wait for the results to come back over the wire from the ReRanker model.
- • Finally, send the results to the prompt model (GPT-4o or similar) to start generating the final response for the user.
- • Wait TTFT (time to first token) seconds before streaming the prompt model's response back to the user.
This is a grand total of 7 API requests made by the backend, before the prompt model even starts thinking about generating a response for the user. And, of course, this all happens even before you start doing anything with the result! You still need to actually execute that SQL query on the database(s), and pray the SQL is actually valid2, and the query doesn't take a long time.
Latency kills the user experience
Even in the above (really vanilla) example of "cool AI enterprise product", the latency is a killer.
Speed is a feature. Faster sites lead to better user engagement, better user retention, and higher conversions
"We are not accustomed to measuring our everyday encounters in milliseconds, but studies have shown that most of us will reliably report perceptible “lag” once a delay of over 100–200 milliseconds is introduced into the system. Once the 300 millisecond delay threshold is exceeded, the interaction is often reported as “sluggish,” and at the 1,000 milliseconds (1 second) barrier, many users have already performed a mental context switch while waiting for the response—anything from a daydream to thinking about the next urgent task.
-- Ilya Grigorik, High Performance Browser Networking. See footnotes for source 3.
The thing is, language models will get things wrong. And that's totally fine, because they have users to correct them. The problem is, if the model gets it wrong immediately, the user can have a back-and-forth dialogue with it, giving it more context and helping it get to the right answer. However, if the model takes 10 seconds to respond, only to come back with the wrong answer, your users think your product is useless.
Latency limits what you can do with ML
If the above example start-to-finish took 1 second, round-trip, there are loads of cool things you could do to improve your downstream performance! However, if the above example already takes you 10 seconds start to finish, then you have no more room to improve your pipeline.
Things you can do if you don't have a latency problem
Failure Loop
For example; if the model gets it wrong (i.e. the SQL it generated was invalid), you can just take the stack trace of the SQL executor, and feed that back into the model, with some pre-amble along the lines of:
"The user's original query was _, the model generated the following SQL query _, and the SQL executor threw the following error _, try again".
You could also get more funky and add some agentic workflow logic...
Agentic Workflow
Rather than deterministically following one recipe of instructions (pipeline), which is basically saying to the model:
Do A, then B, then C, then D, then E.
So, if you view our original pipeline as the above:
We could change this up, and instead, you could tell the model:
Hello model, you have a set of "actions" you can take. You must use initially use your abilities in a specific order A --> B --> C, however, after that point, you are free to choose whichever action you like.
A concrete example of this would be, first, deterministically go from:
A: Embed --> B: Retrieve --> C: Filter --> D: ReRanker --> E: Action-Chooser-Model.
Then, the Action-Chooser-Model, is told:
You have this given context already _, now you can decide from where else you want to get context. You can query context from E (the documentation), F (the user), G (past queries), or, you can decide you already have enough context already, and go straight to the prompt model H.
Agentic Workflow AND Failure Loop
You could of course also do both of the above. The DAG would get bigger, and would include a step where the prompt model would output a query, which would be fed into the SQL executor, and if the query failed, then it would take the stack trace, and pass it alongside all the previous context to an "Evidence Aggregator", which would select the most relevant information about the failure, and pass it back into the Action-Chooser-Model to decide what to do (ask the user for help, or query other context sources).
But all of these approaches are impossible, unless you fix your latency problem!
All this would be really great, and probably result in a large improvement in your downstream model performance. However, you should not try any of these things unless your latency is already really low! There's no point making your product better if it's going to take 30 seconds to do anything for your users. So, first you need to get your baseline latency down to a reasonable level.
So, how do we fix the latency problem!?
Fixing the latency problem
It doesn't take a rocket scientist to figure out that the latency problem is principally caused by:
- (1) the fact that we are making 7 API requests.
- (2) we're waiting for each one to come back before we can make the next one.
Unfortunately, we can't solve (2), i.e. we can't parallelise the requests, because each request is dependent on the previous one. However, we can solve (1), i.e. we can reduce the number of API requests we make.
The naïve solution to (a) is to turn up on Monday and tell your boss that you've decided that instead of relying on a different third party service for each step of your RAG pipeline, you're instead going to build all of these constituent functions in house.
It's pretty immediately clear this makes for a bad tym™. You're now responsible for building, and maintaining a bunch of different services, each of which has its own devops, its own scaling problems, its own security problems, and its own monitoring problems. Whispering sweet nothings into a GPU cluster's ear is not fun.
So the second solution is to say,
"hey, surely someone is dumb enough to actually have attempted this already?".
And the answer to that question, is yes, that's us. That's exactly what OneContext is.
Enter OneContext
At OneContext, we run every single node of your pipeline on the same VPC. As far as latency between the steps goes, we've got it down to basically 0.
This means you can get performance on par with you running everything in-house, but without having to go through the pain of actually building all of this yourself.
You define your entire pipeline in one YAML file, and deploy it with one line of code. We take care of everything else.
Examples
TypeScript
const queryPipeline: OneContext.PipelineCreateType = \
OneContext.PipelineCreateSchema.parse({
API_KEY: API_KEY,
pipelineName: queryPipelineName,
pipelineYaml: "./query.yaml",
})
await OneContext.createPipeline(queryPipeline);
CLI
onecli pipeline create --pipeline-name=query --pipeline-yaml=./query.yaml
What this unlocks
Speed
Crucially, speed. Below we have a gantt chart for the time taken to execute the basic pipeline. The top section is the N different services implementation, and the bottom section is the OneContext implementation. The x-axis is milliseconds. Each blob is a "step". We include red blobs of 400ms each which denote the time it takes (on average) to make an HTTP request to/from a 3rd party server.
The first thing you notice is that the OneContext section has a lot less red waiting time, which is to be expected.
The second thing you notice is that neither section has a red blob at the end, and that's because in both implementation we assume that after TTFT (time-to-first-token) the prompt node starts streaming the response back to the user (makes for a better user experience). In any case, it's not material to the comparison.
The third thing you'll notice is that simply by cutting out the network latency between the steps, OneContext reduces the pipeline execution time by 57%. Even without changing anything else, this is already a far better user experience.
N.B. We haven't even begun to talk about how OneContext can also make the individual steps execute faster than 3rd party providers (we speed up a lot of steps in Rust and use a finely tuned GPU cluster to execute the steps so in many instances we can be faster than our single-service competitors). Here we are literally just comparing apples-to-apples and assuming no change in step execution time. i.e. here we are just removing the red blobs, and haven't even begun to talk about how we can also reduce the green blobs.
Downstream Performance
We've just halved the time taken to execute your RAG pipeline. You can now use those seconds you've saved!
You can now go and implement those things we discussed above, like agentic-workflows, graceful failure-recovery loops, and a load of other methods that can significantly improve the downstream performance of your AI product, without any increase in latency over your baseline implementation!
Version Control
By defining your entire pipeline in a yaml file, it becomes a lot easier to roll back and forward through deployments.
We've worked with a couple of companies (as beta users of our platform) who had the execution graphs of their RAG pipelines implicitly defined through the logic in their codebase, but never explicitly defined anywhere. OneContext works by extracting the specific logic of your RAG pipeline into one declarative file, and then isolating all the infrastructure related to that.
By decoupling RAG (logic and infra) from the rest of your codebase, you can iterate through numerous variations of pipelines without worrying about breaking the rest of your application.
How can I try it out?
We just launched our serverless plan so you can get started today with either our TypeScript SDK, CLI Tool.
You can sign up online and get 2,500 free credits or 1 week of service to hit our API (whichever ends sooner). If you like our product, you can buy more credits on our website.
We also offer a "dedicated" plan, where we deploy a version of our cluster in your compute environment (so it's even faster and more secure). The dedicated plan also comes with some experimental features such as (a) evaluation, and (b) version control. If that's something you're interested in please reach out here. We are currently at capacity, but you will be added to a waitlist, and we've also just raised so our bandwidth should increase in the near future.
If you liked this blog and you want to stay in touch, let us know here!
Don't be a meme. Use OneContext!
Footnotes
-
[Footnote 2]: Don't get triggered if you work at Anthropic or Meta or a hardcore ML startup. We know there are also loads of cool companies doing actual ML. The point here is that most companies building cool AI applications for enterprise find themselves drowning in API requests. ↩
-
[Footnote 3]: It is still not a given that the SQL will actually be correct. See more here. ↩
-
[Footnote 4]: High Performance Browser Networking, Ilya Grigorik. Buy it here. ↩