Question / Claim
Do knowledge graphs meaningfully improve LLM numerical reasoning over financial documents?
Key Assumptions
- LLMs struggle with numerical and multi-hop reasoning when information is only in unstructured text.(high confidence)
- Financial documents contain implicit structure (tables, periods, units) that is lost when flattened to text.(high confidence)
- A domain-specific schema can capture most relevant financial facts needed for reasoning.(medium confidence)
- Most observed 'math errors' in document QA come from selecting or mixing the wrong numbers, not from arithmetic mistakes.(high confidence)
- Flattening tables and semi-structured data into text destroys constraints that LLMs do not reliably reconstruct.(high confidence)
- Even larger models will continue to make grounding and attribution errors without an explicit structured representation.(medium confidence)
Evidence & Observations
- The arXiv paper 'Structure First, Reason Next' reports ~12% relative improvement on FinQA using a KG-enhanced pipeline.(citation)
- Empirical result in 'Structure First, Reason Next' shows ~12% relative improvement on FinQA when using a KG, suggesting structure and grounding, not raw computation, are the bottleneck.(citation)
- FinQA benchmark paper shows models often fail due to wrong number selection and multi-step reasoning over tables and text, not pure arithmetic.(citation)
- Chain-of-Thought prompting improves arithmetic but still suffers from grounding and retrieval errors in long documents.(citation)
- Tabular reasoning work (TaPas) shows that structure-aware representations significantly outperform text-only models on table QA.(citation)
- RAG and tool-using agents reduce hallucination by grounding models in structured sources, supporting the idea that representation, not computation, is the bottleneck.(citation)
Open Uncertainties
- How well does this approach generalize beyond FinQA or beyond financial documents?
- Is the cost and complexity of building the knowledge graph worth it compared to just using larger or better LLMs?
- How robust is the pipeline to extraction errors when building the KG?
- To what extent can better table-aware or tool-using models close this gap without a full knowledge graph?
- What is the minimal structure needed to get most of the benefit (KG vs simpler schemas)?
Current Position
LLMs are not inherently bad at math; failures mostly come from poor grounding, structure loss, and information selection in long, messy documents. Providing a structured world model (e.g., a knowledge graph) before reasoning materially improves reliability for multi-step numerical tasks.
This is work-in-progress thinking, not a final conclusion.
References(5)
- 1.^"Structure First, Reason Next: Enhancing a Large Language Model Using Knowledge Graph for Numerical Reasoning in Financial Documents"โarxiv.orgโ Paper proposing KG-augmented reasoning for financial numerical QA.
- 2.^"FinQA: A Dataset of Numerical Reasoning over Financial Data"โarxiv.orgโ Shows multi-step numerical reasoning failures often come from retrieval and grounding issues.
- 3.^"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"โarxiv.orgโ Shows LLMs can do arithmetic but still depend on correct intermediate facts.
- 4.^"TaPas: Weakly Supervised Table Parsing"โarxiv.orgโ Demonstrates the importance of structure-aware models for table reasoning.
- 5.^"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"โarxiv.orgโ Classic paper on grounding LLMs in external structured knowledge to improve factuality.
Engage with this Thought
Comments
No comments yet. Be the first to share your thoughts!