How AngelList Automates Tax Form Parsing
In 2021, AngelList imported its first fund from another fund services provider onto the platform. Since then, we’ve gone on to port many funds per week onto AngelList.
Apr 4, 2023 — 4 min read
Written by
Venture funds, like any financial vehicle, can be represented as a collection of events, transactions, and documents (legal, accounting, finance, tax). Every fund admin provider supports and tracks these vehicles in different ways, which makes porting a cross-team collaboration between operations, tax, finance, accounting, and engineering. This process involves converting the existing representation of the financial vehicle into one that our systems can readily use.
A single fund can have hundreds of tax documents associated with it: K-1s, Schedule Bs, Form 1042s, etc. To offer our all-in-one fund administration with tax services, we have to construct scalable ways to ingest and integrate these tax documents into our systems, verify this data at scale, and identify uncommon fact patterns.
Tax filings are semi-structured in nature and include a labyrinth of vertical and horizontal lines, solid and dotted lines, boxes, X’s, columns, and rows. While fairly straightforward to maneuver with the human eye, it’s harder to decompose into a set of consistent machine-interpretable rules. Using a few tools and techniques, we were able to extract relevant information from the tax form and integrate it into our data pipelines.
We began by using AWS Textract to parse these tax documents, but found that Textract regularly failed to recognize the X’s in Yes/No columns. To diagnose the cause of the detection issues, we whited out the X’s, overlaid other letters in various fonts and sizes, and found that letters without sharp edges like C and S were picked up more frequently.
Additionally, Textract was unable to logically draw the connections between row-major line-items and column-major answers (picture). So we integrated with Sensible to extract liberally formatted key-value pairs across various document configurations—all without rebuilding a document parser in-house.
However, Sensible is expensive and sending it hundreds of pages per fund isn’t a cost-effective approach. To cut costs we included a regex processing step to prune the partnership tax filing down to a few pages per fund. In the end we used Sensible to parse the fund-level filings (fixed number of pages per fund) and Textract to parse the partner-level filings like Schedule K-1s (variable number of pages per fund—one per partner).
Storing and integrating extracted data
In order to integrate parsed data into our systems, we relied on ops to upload hundreds of tax packets and noticed export errors. To handle job requests at scale we set up a standard SQS, Lambda, DynamoDB combo and split extraction jobs at the subsection level, allowing ops to easily re-run subsection-level document parsing. When reviewing extracted data by hand we noticed non-standard fact patterns and the occasional incorrect extraction. To catch suspicious values like incorrect extractions and nonstandard fund filings, we implemented a manual review step. Once our in-house tax professionals have completed a manual review of the incoming filings, we can roll a ported fund into our standard tax filing processes.
What’s Next
We’re currently revamping various other parts of our porting process to support the increased demand from funds that want to move onto our platform. If you want to work on challenging problems that contribute towards our mission of accelerating innovation, we’re hiring!