All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Locally attached memory tiering
@ 2024-05-07  3:37 David Rientjes
  2024-05-07 11:52 ` Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: David Rientjes @ 2024-05-07  3:37 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, Michal Hocko, Dan Williams, John Hubbard, Zi Yan,
	Bharata B Rao, Dave Jiang, Aneesh Kumar K.V, Huang, Ying,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, Davidlohr Bueso

Hi all,

I think it would be very worthwhile to have a block set aside for 
discussion on locally attached memory tiering extensions at LSF/MM/BPF 
2024.

Primarily interested in discussing Linux enlightenment for CXL 1.1 and 
later type-3 memory expansion devices (CXL.mem).  I think we could touch 
on CXL 2.0 and later memory pooling architectures if we have time and 
there is interest, but the primary focus here would be local attached.

Based on the premise for a Memory Tiering Working Group[1], there is 
widespread interest in the foundational topics for generally useful Linux 
enlightenment:

 - Decoupling CPU balancing from memory balancing (or obsoleting CPU
   balancing entirely)

   + John Hubbard notes this would be useful for GPUs:

      a) GPUs have their own processors that are invisible to the kernel's
         NUMA "which tasks are active on which NUMA nodes" calculations,
         and

      b) Similar to where CXL is generally going, we have already built
         fully memory-coherent hardware, which include memory-only NUMA
         nodes.

 - In-kernel hot memory abstraction, informed by hardware hinting drivers
   (incl some architectures like Power10), usable as a NUMA Balancing
   backend for promotion and other areas of the kernel like transparent
   hugepage utilization

 - NUMA and memory tiering enlightenment for accelerators, such as for
   optimal use of GPU memory, extremely important for a cloud provider
   (hint hint :)

 - Asynchronous memory promotion independent of task_numa_fault() while
   considering the cost of page migration (due to identifying cold memory)

 - What the role of userspace plays in this decision-making and how we can
   extend the default policy and mechanisms in the kernel to allow for it
   if necessary

Additional topics that you find interesting are also very helpful!

I'm biased toward a generally useful solution that would leverage the 
kernel as the ultimate source of truth for page hotness that can be 
extended for multiple use caes, one of which is memory tiering support.  
But certainly if there are other approaches, we can discuss that as well.

A few main goals from this discussion:

 - Ensure that proposals address, or can be extended to address, the 
   emerging needs of the various use cases that users may have

 - Surface any constraints that stakeholders may find to be prohibitive
   for support in the core MM subsystem

 - Alignment and division of work for developers who are actively looking
   to contribute to this area

As I'm just one of many stakeholders for this discussion, I'd nominate 
Michal Hocko to moderate it if he's willing to do so.  If he's so willing, 
we'd be in good hands :)

 [1] https://lore.kernel.org/linux-mm/45d850ec-623b-7c07-c266-e948cdbf1f62@linux.com/T/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
  2024-05-07  3:37 [LSF/MM/BPF TOPIC] Locally attached memory tiering David Rientjes
@ 2024-05-07 11:52 ` Michal Hocko
  2024-05-07 20:09   ` David Rientjes
  2024-05-08  4:14 ` Huang, Ying
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2024-05-07 11:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: lsf-pc, linux-mm, Dan Williams, John Hubbard, Zi Yan,
	Bharata B Rao, Dave Jiang, Aneesh Kumar K.V, Huang, Ying,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, Davidlohr Bueso

On Mon 06-05-24 20:37:19, David Rientjes wrote:
> Hi all,
> 
> I think it would be very worthwhile to have a block set aside for 
> discussion on locally attached memory tiering extensions at LSF/MM/BPF 
> 2024.
> 
> Primarily interested in discussing Linux enlightenment for CXL 1.1 and 
> later type-3 memory expansion devices (CXL.mem).  I think we could touch 
> on CXL 2.0 and later memory pooling architectures if we have time and 
> there is interest, but the primary focus here would be local attached.
> 
> Based on the premise for a Memory Tiering Working Group[1], there is 
> widespread interest in the foundational topics for generally useful Linux 
> enlightenment:
> 
>  - Decoupling CPU balancing from memory balancing (or obsoleting CPU
>    balancing entirely)
> 
>    + John Hubbard notes this would be useful for GPUs:
> 
>       a) GPUs have their own processors that are invisible to the kernel's
>          NUMA "which tasks are active on which NUMA nodes" calculations,
>          and
> 
>       b) Similar to where CXL is generally going, we have already built
>          fully memory-coherent hardware, which include memory-only NUMA
>          nodes.
> 
>  - In-kernel hot memory abstraction, informed by hardware hinting drivers
>    (incl some architectures like Power10), usable as a NUMA Balancing
>    backend for promotion and other areas of the kernel like transparent
>    hugepage utilization
> 
>  - NUMA and memory tiering enlightenment for accelerators, such as for
>    optimal use of GPU memory, extremely important for a cloud provider
>    (hint hint :)
> 
>  - Asynchronous memory promotion independent of task_numa_fault() while
>    considering the cost of page migration (due to identifying cold memory)
> 
>  - What the role of userspace plays in this decision-making and how we can
>    extend the default policy and mechanisms in the kernel to allow for it
>    if necessary
> 
> Additional topics that you find interesting are also very helpful!
> 
> I'm biased toward a generally useful solution that would leverage the 
> kernel as the ultimate source of truth for page hotness that can be 
> extended for multiple use caes, one of which is memory tiering support.  
> But certainly if there are other approaches, we can discuss that as well.
> 
> A few main goals from this discussion:
> 
>  - Ensure that proposals address, or can be extended to address, the 
>    emerging needs of the various use cases that users may have
> 
>  - Surface any constraints that stakeholders may find to be prohibitive
>    for support in the core MM subsystem
> 
>  - Alignment and division of work for developers who are actively looking
>    to contribute to this area

Do you think having 2 contigious slots would be sufficient for these
topics?

> As I'm just one of many stakeholders for this discussion, I'd nominate 
> Michal Hocko to moderate it if he's willing to do so.

Sure I can help out with that.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
  2024-05-07 11:52 ` Michal Hocko
@ 2024-05-07 20:09   ` David Rientjes
  0 siblings, 0 replies; 11+ messages in thread
From: David Rientjes @ 2024-05-07 20:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lsf-pc, linux-mm, Dan Williams, John Hubbard, Zi Yan,
	Bharata B Rao, Dave Jiang, Aneesh Kumar K.V, Huang, Ying,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, Davidlohr Bueso

On Tue, 7 May 2024, Michal Hocko wrote:

> On Mon 06-05-24 20:37:19, David Rientjes wrote:
> > Hi all,
> > 
> > I think it would be very worthwhile to have a block set aside for 
> > discussion on locally attached memory tiering extensions at LSF/MM/BPF 
> > 2024.
> > 
> > Primarily interested in discussing Linux enlightenment for CXL 1.1 and 
> > later type-3 memory expansion devices (CXL.mem).  I think we could touch 
> > on CXL 2.0 and later memory pooling architectures if we have time and 
> > there is interest, but the primary focus here would be local attached.
> > 
> > Based on the premise for a Memory Tiering Working Group[1], there is 
> > widespread interest in the foundational topics for generally useful Linux 
> > enlightenment:
> > 
> >  - Decoupling CPU balancing from memory balancing (or obsoleting CPU
> >    balancing entirely)
> > 
> >    + John Hubbard notes this would be useful for GPUs:
> > 
> >       a) GPUs have their own processors that are invisible to the kernel's
> >          NUMA "which tasks are active on which NUMA nodes" calculations,
> >          and
> > 
> >       b) Similar to where CXL is generally going, we have already built
> >          fully memory-coherent hardware, which include memory-only NUMA
> >          nodes.
> > 
> >  - In-kernel hot memory abstraction, informed by hardware hinting drivers
> >    (incl some architectures like Power10), usable as a NUMA Balancing
> >    backend for promotion and other areas of the kernel like transparent
> >    hugepage utilization
> > 
> >  - NUMA and memory tiering enlightenment for accelerators, such as for
> >    optimal use of GPU memory, extremely important for a cloud provider
> >    (hint hint :)
> > 
> >  - Asynchronous memory promotion independent of task_numa_fault() while
> >    considering the cost of page migration (due to identifying cold memory)
> > 
> >  - What the role of userspace plays in this decision-making and how we can
> >    extend the default policy and mechanisms in the kernel to allow for it
> >    if necessary
> > 
> > Additional topics that you find interesting are also very helpful!
> > 
> > I'm biased toward a generally useful solution that would leverage the 
> > kernel as the ultimate source of truth for page hotness that can be 
> > extended for multiple use caes, one of which is memory tiering support.  
> > But certainly if there are other approaches, we can discuss that as well.
> > 
> > A few main goals from this discussion:
> > 
> >  - Ensure that proposals address, or can be extended to address, the 
> >    emerging needs of the various use cases that users may have
> > 
> >  - Surface any constraints that stakeholders may find to be prohibitive
> >    for support in the core MM subsystem
> > 
> >  - Alignment and division of work for developers who are actively looking
> >    to contribute to this area
> 
> Do you think having 2 contigious slots would be sufficient for these
> topics?
> 

Yes, I think that makes perfect sense.

> > As I'm just one of many stakeholders for this discussion, I'd nominate 
> > Michal Hocko to moderate it if he's willing to do so.
> 
> Sure I can help out with that.
> 

Thank you Michal!

All: if there are additional topics that we discuss in this block beyond 
what is listed above, that would be great feedback.  We can make sure that 
it is all covered in session, so please expand upon the above for things 
that we should cover.

Thanks!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
  2024-05-07  3:37 [LSF/MM/BPF TOPIC] Locally attached memory tiering David Rientjes
  2024-05-07 11:52 ` Michal Hocko
@ 2024-05-08  4:14 ` Huang, Ying
  2024-05-10  3:10   ` David Rientjes
  2024-05-08 21:39 ` Davidlohr Bueso
       [not found] ` <CGME20240509173529uscas1p1b6e43b169514d36915cd2bc8aabc4200@uscas1p1.samsung.com>
  3 siblings, 1 reply; 11+ messages in thread
From: Huang, Ying @ 2024-05-08  4:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: lsf-pc, linux-mm, Michal Hocko, Dan Williams, John Hubbard,
	Zi Yan, Bharata B Rao, Dave Jiang, Aneesh Kumar K.V,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, Davidlohr Bueso

Hi, David,

Thanks!  This is a great summary!

David Rientjes <rientjes@google.com> writes:

> Hi all,
>
> I think it would be very worthwhile to have a block set aside for 
> discussion on locally attached memory tiering extensions at LSF/MM/BPF 
> 2024.
>
> Primarily interested in discussing Linux enlightenment for CXL 1.1 and 
> later type-3 memory expansion devices (CXL.mem).  I think we could touch 
> on CXL 2.0 and later memory pooling architectures if we have time and 
> there is interest, but the primary focus here would be local attached.
>
> Based on the premise for a Memory Tiering Working Group[1], there is 
> widespread interest in the foundational topics for generally useful Linux 
> enlightenment:
>
>  - Decoupling CPU balancing from memory balancing (or obsoleting CPU
>    balancing entirely)
>
>    + John Hubbard notes this would be useful for GPUs:
>
>       a) GPUs have their own processors that are invisible to the kernel's
>          NUMA "which tasks are active on which NUMA nodes" calculations,
>          and
>
>       b) Similar to where CXL is generally going, we have already built
>          fully memory-coherent hardware, which include memory-only NUMA
>          nodes.
>
>  - In-kernel hot memory abstraction, informed by hardware hinting drivers
>    (incl some architectures like Power10), usable as a NUMA Balancing
>    backend for promotion and other areas of the kernel like transparent
>    hugepage utilization
>
>  - NUMA and memory tiering enlightenment for accelerators, such as for
>    optimal use of GPU memory, extremely important for a cloud provider
>    (hint hint :)
>
>  - Asynchronous memory promotion independent of task_numa_fault() while
>    considering the cost of page migration (due to identifying cold memory)
>
>  - What the role of userspace plays in this decision-making and how we can
>    extend the default policy and mechanisms in the kernel to allow for it
>    if necessary
>
> Additional topics that you find interesting are also very helpful!

In addition to the hot memory identification and promotion, I think that
we should consider the cold memory identification and demotion too as a
full solution.  The existing method based on the page table accessed bit
may be good enough, but we still need to consider the full solution in
the context of the general NUMA balancing.

> I'm biased toward a generally useful solution that would leverage the 
> kernel as the ultimate source of truth for page hotness that can be 
> extended for multiple use caes, one of which is memory tiering support.  
> But certainly if there are other approaches, we can discuss that as well.
>
> A few main goals from this discussion:
>
>  - Ensure that proposals address, or can be extended to address, the 
>    emerging needs of the various use cases that users may have
>
>  - Surface any constraints that stakeholders may find to be prohibitive
>    for support in the core MM subsystem
>
>  - Alignment and division of work for developers who are actively looking
>    to contribute to this area
>
> As I'm just one of many stakeholders for this discussion, I'd nominate 
> Michal Hocko to moderate it if he's willing to do so.  If he's so willing, 
> we'd be in good hands :)
>
>  [1] https://lore.kernel.org/linux-mm/45d850ec-623b-7c07-c266-e948cdbf1f62@linux.com/T/

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
  2024-05-07  3:37 [LSF/MM/BPF TOPIC] Locally attached memory tiering David Rientjes
  2024-05-07 11:52 ` Michal Hocko
  2024-05-08  4:14 ` Huang, Ying
@ 2024-05-08 21:39 ` Davidlohr Bueso
  2024-05-09  1:42   ` Huang, Ying
       [not found] ` <CGME20240509173529uscas1p1b6e43b169514d36915cd2bc8aabc4200@uscas1p1.samsung.com>
  3 siblings, 1 reply; 11+ messages in thread
From: Davidlohr Bueso @ 2024-05-08 21:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: lsf-pc, linux-mm, Michal Hocko, Dan Williams, John Hubbard,
	Zi Yan, Bharata B Rao, Dave Jiang, Aneesh Kumar K.V, Huang, Ying,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, peterz, a.manzanares

On Mon, 06 May 2024, David Rientjes wrote:

>Hi all,
>
>I think it would be very worthwhile to have a block set aside for
>discussion on locally attached memory tiering extensions at LSF/MM/BPF
>2024.

+1

fyi Adam's proposal which touches on both cxl and tiering:

https://lore.kernel.org/all/9bf86b97-319f-4f58-b658-1fe3ed0b1993@nmtadam.samsung/

>Primarily interested in discussing Linux enlightenment for CXL 1.1 and
>later type-3 memory expansion devices (CXL.mem).  I think we could touch
>on CXL 2.0 and later memory pooling architectures if we have time and
>there is interest, but the primary focus here would be local attached.
>
>Based on the premise for a Memory Tiering Working Group[1], there is
>widespread interest in the foundational topics for generally useful Linux
>enlightenment:
>
> - Decoupling CPU balancing from memory balancing (or obsoleting CPU
>   balancing entirely)
>
>   + John Hubbard notes this would be useful for GPUs:
>
>      a) GPUs have their own processors that are invisible to the kernel's
>         NUMA "which tasks are active on which NUMA nodes" calculations,
>         and
>
>      b) Similar to where CXL is generally going, we have already built
>         fully memory-coherent hardware, which include memory-only NUMA
>         nodes.

+Cc peterz

> - In-kernel hot memory abstraction, informed by hardware hinting drivers
>   (incl some architectures like Power10), usable as a NUMA Balancing
>   backend for promotion and other areas of the kernel like transparent
>   hugepage utilization
>
> - NUMA and memory tiering enlightenment for accelerators, such as for
>   optimal use of GPU memory, extremely important for a cloud provider
>   (hint hint :)
>
> - Asynchronous memory promotion independent of task_numa_fault() while
>   considering the cost of page migration (due to identifying cold memory)

This would be nice for users who like to disable NUMA balancing. But overall
when compared to anything hardware can give us (ala ppc, without the required
kernel overhead of x86-based counters), I fear that software solutions will
always be found wanting. And, afaik, numa balancing based promotion is still
one of the top pain points in memory tiering.

So, of course, improving the software approach is still a good thing. Fyi
along these lines, improving/optimizing the current numa balancing approach
has proven irrelevant in the larger scale of benchmarks, afaik. For example
(active) LRU based promotion instead of blindly promoting the faulting page
which could be rarely used. Benchmarks shows significant reduction in a lot
of the promote/demote traffic dealing with ping pong cases, but unfortunately
show little to no tangible performance wins in actual benchmark numbers.
Similarly, the proposed migrc[1] which shows great TLB flushing benefits but
minimal benchmark (XSBench) improvement.

... which brings me to the topic of benchmarking. What are the workloads
people care about, beyond pmbench? I tend to use oltp based database workloads
with wss/buffers larger than the total amount of fast memory nodes.

> - What the role of userspace plays in this decision-making and how we can
>   extend the default policy and mechanisms in the kernel to allow for it
>   if necessary
>
>Additional topics that you find interesting are also very helpful!
>
>I'm biased toward a generally useful solution that would leverage the
>kernel as the ultimate source of truth for page hotness that can be
>extended for multiple use caes, one of which is memory tiering support.
>But certainly if there are other approaches, we can discuss that as well.
>
>A few main goals from this discussion:
>
> - Ensure that proposals address, or can be extended to address, the
>   emerging needs of the various use cases that users may have
>
> - Surface any constraints that stakeholders may find to be prohibitive
>   for support in the core MM subsystem
>
> - Alignment and division of work for developers who are actively looking
>   to contribute to this area
>
>As I'm just one of many stakeholders for this discussion, I'd nominate
>Michal Hocko to moderate it if he's willing to do so.  If he's so willing,
>we'd be in good hands :)

>
> [1] https://lore.kernel.org/linux-mm/45d850ec-623b-7c07-c266-e948cdbf1f62@linux.com/T/
>

Thanks,
Davidlohr

[1] https://lore.kernel.org/linux-mm/20240226030613.22366-1-byungchul@sk.com/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
  2024-05-08 21:39 ` Davidlohr Bueso
@ 2024-05-09  1:42   ` Huang, Ying
  2024-05-13  1:49     ` Davidlohr Bueso
  0 siblings, 1 reply; 11+ messages in thread
From: Huang, Ying @ 2024-05-09  1:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: lsf-pc, linux-mm, Michal Hocko, Dan Williams, John Hubbard,
	Zi Yan, Bharata B Rao, Dave Jiang, Aneesh Kumar K.V,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, peterz, a.manzanares

Hi, Davidlohr,

Davidlohr Bueso <dave@stgolabs.net> writes:

> On Mon, 06 May 2024, David Rientjes wrote:
>
>>Hi all,
>>
>>I think it would be very worthwhile to have a block set aside for
>>discussion on locally attached memory tiering extensions at LSF/MM/BPF
>>2024.
>
> +1
>
> fyi Adam's proposal which touches on both cxl and tiering:
>
> https://lore.kernel.org/all/9bf86b97-319f-4f58-b658-1fe3ed0b1993@nmtadam.samsung/
>
>>Primarily interested in discussing Linux enlightenment for CXL 1.1 and
>>later type-3 memory expansion devices (CXL.mem).  I think we could touch
>>on CXL 2.0 and later memory pooling architectures if we have time and
>>there is interest, but the primary focus here would be local attached.
>>
>>Based on the premise for a Memory Tiering Working Group[1], there is
>>widespread interest in the foundational topics for generally useful Linux
>>enlightenment:
>>
>> - Decoupling CPU balancing from memory balancing (or obsoleting CPU
>>   balancing entirely)
>>
>>   + John Hubbard notes this would be useful for GPUs:
>>
>>      a) GPUs have their own processors that are invisible to the kernel's
>>         NUMA "which tasks are active on which NUMA nodes" calculations,
>>         and
>>
>>      b) Similar to where CXL is generally going, we have already built
>>         fully memory-coherent hardware, which include memory-only NUMA
>>         nodes.
>
> +Cc peterz
>
>> - In-kernel hot memory abstraction, informed by hardware hinting drivers
>>   (incl some architectures like Power10), usable as a NUMA Balancing
>>   backend for promotion and other areas of the kernel like transparent
>>   hugepage utilization
>>
>> - NUMA and memory tiering enlightenment for accelerators, such as for
>>   optimal use of GPU memory, extremely important for a cloud provider
>>   (hint hint :)
>>
>> - Asynchronous memory promotion independent of task_numa_fault() while
>>   considering the cost of page migration (due to identifying cold memory)
>
> This would be nice for users who like to disable NUMA balancing. But overall
> when compared to anything hardware can give us (ala ppc, without the required
> kernel overhead of x86-based counters), I fear that software solutions will
> always be found wanting. And, afaik, numa balancing based promotion is still
> one of the top pain points in memory tiering.
>
> So, of course, improving the software approach is still a good thing. Fyi
> along these lines, improving/optimizing the current numa balancing approach
> has proven irrelevant in the larger scale of benchmarks, afaik. For example
> (active) LRU based promotion instead of blindly promoting the faulting page
> which could be rarely used.

With the default configuration, current NUMA balancing based promotion
solution will almost try to promote any faulting pages.  To select hot
pages to promote and control thrashing between NUMA nodes, the promote
rate limit needs to be configured.  For example, via,

echo 200 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps

200MB hot pages will be selected and promoted every second.  Can you try it?

> Benchmarks shows significant reduction in a lot
> of the promote/demote traffic dealing with ping pong cases, but unfortunately
> show little to no tangible performance wins in actual benchmark numbers.
> Similarly, the proposed migrc[1] which shows great TLB flushing benefits but
> minimal benchmark (XSBench) improvement.
>
> ... which brings me to the topic of benchmarking. What are the workloads
> people care about, beyond pmbench? I tend to use oltp based database workloads
> with wss/buffers larger than the total amount of fast memory nodes.
>
>> - What the role of userspace plays in this decision-making and how we can
>>   extend the default policy and mechanisms in the kernel to allow for it
>>   if necessary
>>
>>Additional topics that you find interesting are also very helpful!
>>
>>I'm biased toward a generally useful solution that would leverage the
>>kernel as the ultimate source of truth for page hotness that can be
>>extended for multiple use caes, one of which is memory tiering support.
>>But certainly if there are other approaches, we can discuss that as well.
>>
>>A few main goals from this discussion:
>>
>> - Ensure that proposals address, or can be extended to address, the
>>   emerging needs of the various use cases that users may have
>>
>> - Surface any constraints that stakeholders may find to be prohibitive
>>   for support in the core MM subsystem
>>
>> - Alignment and division of work for developers who are actively looking
>>   to contribute to this area
>>
>>As I'm just one of many stakeholders for this discussion, I'd nominate
>>Michal Hocko to moderate it if he's willing to do so.  If he's so willing,
>>we'd be in good hands :)
>
>>
>> [1] https://lore.kernel.org/linux-mm/45d850ec-623b-7c07-c266-e948cdbf1f62@linux.com/T/
>>
>
> Thanks,
> Davidlohr
>
> [1] https://lore.kernel.org/linux-mm/20240226030613.22366-1-byungchul@sk.com/

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
       [not found] ` <CGME20240509173529uscas1p1b6e43b169514d36915cd2bc8aabc4200@uscas1p1.samsung.com>
@ 2024-05-09 17:35   ` Adam Manzanares
  0 siblings, 0 replies; 11+ messages in thread
From: Adam Manzanares @ 2024-05-09 17:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: lsf-pc, linux-mm, Michal Hocko, Dan Williams, John Hubbard,
	Zi Yan, Bharata B Rao, Dave Jiang, Aneesh Kumar K.V, Huang, Ying,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, Davidlohr Bueso, mcgrof

On Mon, May 06, 2024 at 08:37:19PM -0700, David Rientjes wrote:
> Hi all,
> 
> I think it would be very worthwhile to have a block set aside for 
> discussion on locally attached memory tiering extensions at LSF/MM/BPF 
> 2024.
> 

Agreed.

> Primarily interested in discussing Linux enlightenment for CXL 1.1 and 
> later type-3 memory expansion devices (CXL.mem).  I think we could touch 
> on CXL 2.0 and later memory pooling architectures if we have time and 
> there is interest, but the primary focus here would be local attached.

Same thought as well, but I tend to decopule CXL specifcation version from 
CXL device type. I see the CXL 2.0 feature of hot add/remove as being 
controversial, but there is not a hard requirement to hot add/remove CXL
2.0 capable devices from systems. Type 3 memory devices are a type of CXL
device that can be compatible with different CXL specification versions.

What I do like about CXL 2.0 is the push for more OS control of the device
and of the CXL hierarchy (HDM decoder programming). IMO the notion of locally 
attached is also not as important as the performance characterics of the link. 

> 
> Based on the premise for a Memory Tiering Working Group[1], there is 
> widespread interest in the foundational topics for generally useful Linux 
> enlightenment:
> 
>  - Decoupling CPU balancing from memory balancing (or obsoleting CPU
>    balancing entirely)
> 
>    + John Hubbard notes this would be useful for GPUs:
> 
>       a) GPUs have their own processors that are invisible to the kernel's
>          NUMA "which tasks are active on which NUMA nodes" calculations,
>          and
> 
>       b) Similar to where CXL is generally going, we have already built
>          fully memory-coherent hardware, which include memory-only NUMA
>          nodes.
> 
>  - In-kernel hot memory abstraction, informed by hardware hinting drivers
>    (incl some architectures like Power10), usable as a NUMA Balancing
>    backend for promotion and other areas of the kernel like transparent
>    hugepage utilization
> 
>  - NUMA and memory tiering enlightenment for accelerators, such as for
>    optimal use of GPU memory, extremely important for a cloud provider
>    (hint hint :)
> 
>  - Asynchronous memory promotion independent of task_numa_fault() while
>    considering the cost of page migration (due to identifying cold memory)
> 
>  - What the role of userspace plays in this decision-making and how we can
>    extend the default policy and mechanisms in the kernel to allow for it
>    if necessary
> 
> Additional topics that you find interesting are also very helpful!
> 
> I'm biased toward a generally useful solution that would leverage the 
> kernel as the ultimate source of truth for page hotness that can be 
> extended for multiple use caes, one of which is memory tiering support.  
> But certainly if there are other approaches, we can discuss that as well.
> 
> A few main goals from this discussion:
> 
>  - Ensure that proposals address, or can be extended to address, the 
>    emerging needs of the various use cases that users may have
> 
>  - Surface any constraints that stakeholders may find to be prohibitive
>    for support in the core MM subsystem
> 
>  - Alignment and division of work for developers who are actively looking
>    to contribute to this area

Luis has done a great job doing this in the large block effort. If he can
join this discussion I think his input would be valuable.

> 
> As I'm just one of many stakeholders for this discussion, I'd nominate 
> Michal Hocko to moderate it if he's willing to do so.  If he's so willing, 
> we'd be in good hands :)
> 
>  [1] https://lore.kernel.org/linux-mm/45d850ec-623b-7c07-c266-e948cdbf1f62@linux.com/T/
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
  2024-05-08  4:14 ` Huang, Ying
@ 2024-05-10  3:10   ` David Rientjes
  0 siblings, 0 replies; 11+ messages in thread
From: David Rientjes @ 2024-05-10  3:10 UTC (permalink / raw)
  To: Huang, Ying
  Cc: lsf-pc, linux-mm, Michal Hocko, Dan Williams, John Hubbard,
	Zi Yan, Bharata B Rao, Dave Jiang, Aneesh Kumar K.V,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, Davidlohr Bueso, Yuanchu Xie

On Wed, 8 May 2024, Huang, Ying wrote:

> > Hi all,
> >
> > I think it would be very worthwhile to have a block set aside for 
> > discussion on locally attached memory tiering extensions at LSF/MM/BPF 
> > 2024.
> >
> > Primarily interested in discussing Linux enlightenment for CXL 1.1 and 
> > later type-3 memory expansion devices (CXL.mem).  I think we could touch 
> > on CXL 2.0 and later memory pooling architectures if we have time and 
> > there is interest, but the primary focus here would be local attached.
> >
> > Based on the premise for a Memory Tiering Working Group[1], there is 
> > widespread interest in the foundational topics for generally useful Linux 
> > enlightenment:
> >
> >  - Decoupling CPU balancing from memory balancing (or obsoleting CPU
> >    balancing entirely)
> >
> >    + John Hubbard notes this would be useful for GPUs:
> >
> >       a) GPUs have their own processors that are invisible to the kernel's
> >          NUMA "which tasks are active on which NUMA nodes" calculations,
> >          and
> >
> >       b) Similar to where CXL is generally going, we have already built
> >          fully memory-coherent hardware, which include memory-only NUMA
> >          nodes.
> >
> >  - In-kernel hot memory abstraction, informed by hardware hinting drivers
> >    (incl some architectures like Power10), usable as a NUMA Balancing
> >    backend for promotion and other areas of the kernel like transparent
> >    hugepage utilization
> >
> >  - NUMA and memory tiering enlightenment for accelerators, such as for
> >    optimal use of GPU memory, extremely important for a cloud provider
> >    (hint hint :)
> >
> >  - Asynchronous memory promotion independent of task_numa_fault() while
> >    considering the cost of page migration (due to identifying cold memory)
> >
> >  - What the role of userspace plays in this decision-making and how we can
> >    extend the default policy and mechanisms in the kernel to allow for it
> >    if necessary
> >
> > Additional topics that you find interesting are also very helpful!
> 
> In addition to the hot memory identification and promotion, I think that
> we should consider the cold memory identification and demotion too as a
> full solution.  The existing method based on the page table accessed bit
> may be good enough, but we still need to consider the full solution in
> the context of the general NUMA balancing.
> 

I think that's a great suggestion!  We'll be able to cover the approach 
taken by workingset reporting[*] which is quite powerful for the purposes 
of proactive reclaim through memory.reclaim and would also very be useful 
for identifying cold memory for the purposes of demotion as well.

 [*] https://lore.kernel.org/linux-mm/20240504073011.4000534-1-yuanchu@google.com/T/

> > I'm biased toward a generally useful solution that would leverage the 
> > kernel as the ultimate source of truth for page hotness that can be 
> > extended for multiple use caes, one of which is memory tiering support.  
> > But certainly if there are other approaches, we can discuss that as well.
> >
> > A few main goals from this discussion:
> >
> >  - Ensure that proposals address, or can be extended to address, the 
> >    emerging needs of the various use cases that users may have
> >
> >  - Surface any constraints that stakeholders may find to be prohibitive
> >    for support in the core MM subsystem
> >
> >  - Alignment and division of work for developers who are actively looking
> >    to contribute to this area
> >
> > As I'm just one of many stakeholders for this discussion, I'd nominate 
> > Michal Hocko to moderate it if he's willing to do so.  If he's so willing, 
> > we'd be in good hands :)
> >
> >  [1] https://lore.kernel.org/linux-mm/45d850ec-623b-7c07-c266-e948cdbf1f62@linux.com/T/
> 
> --
> Best Regards,
> Huang, Ying
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
  2024-05-09  1:42   ` Huang, Ying
@ 2024-05-13  1:49     ` Davidlohr Bueso
  2024-05-13  3:28       ` Bharata B Rao
  2024-05-13  7:48       ` Huang, Ying
  0 siblings, 2 replies; 11+ messages in thread
From: Davidlohr Bueso @ 2024-05-13  1:49 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Rientjes, lsf-pc, linux-mm, Michal Hocko, Dan Williams,
	John Hubbard, Zi Yan, Bharata B Rao, Dave Jiang,
	Aneesh Kumar K.V, Alistair Popple, Christoph Lameter,
	Andrew Morton, Linus Torvalds, Dave Hansen, Mel Gorman,
	Jon Grimm, Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, peterz, a.manzanares

On Thu, 09 May 2024, Huang, Ying wrote:

>With the default configuration, current NUMA balancing based promotion
>solution will almost try to promote any faulting pages.  To select hot
>pages to promote and control thrashing between NUMA nodes, the promote
>rate limit needs to be configured.  For example, via,
>
>echo 200 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
>
>200MB hot pages will be selected and promoted every second.  Can you try it?

Yes, I've played with this tunnable and, just like the LRU approach, it
shows nice micro wins (less amount of promotions/demotions) but little for
actual benchmark improvements at a higher level, merely noise level or
very sublte wins. In fact, the actual data from that series for this
parameter was a ~2% pmbench win with the rate limiting, but a 69% promotion
rate descrease.

And this is really my point, how much effort do we want to put in optimizing
software mechanisms for hot page detection? Are there other benchmarks we
should be using? And perhaps doing the async promotion and not incurring in
the numa balancing overhead and comparing the cost of migration before
promoting would yield some better numbers, but that also might be easy to
get wrong when compared to the relative hotness of the page.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
  2024-05-13  1:49     ` Davidlohr Bueso
@ 2024-05-13  3:28       ` Bharata B Rao
  2024-05-13  7:48       ` Huang, Ying
  1 sibling, 0 replies; 11+ messages in thread
From: Bharata B Rao @ 2024-05-13  3:28 UTC (permalink / raw)
  To: Huang, Ying, David Rientjes, lsf-pc, linux-mm, Michal Hocko,
	Dan Williams, John Hubbard, Zi Yan, Dave Jiang, Aneesh Kumar K.V,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, peterz, a.manzanares



On 5/13/2024 7:19 AM, Davidlohr Bueso wrote:
> On Thu, 09 May 2024, Huang, Ying wrote:
> 
>> With the default configuration, current NUMA balancing based promotion
>> solution will almost try to promote any faulting pages.  To select hot
>> pages to promote and control thrashing between NUMA nodes, the promote
>> rate limit needs to be configured.  For example, via,
>>
>> echo 200 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
>>
>> 200MB hot pages will be selected and promoted every second.  Can you 
>> try it?
> 
> Yes, I've played with this tunnable and, just like the LRU approach, it
> shows nice micro wins (less amount of promotions/demotions) but little for
> actual benchmark improvements at a higher level, merely noise level or
> very sublte wins. In fact, the actual data from that series for this
> parameter was a ~2% pmbench win with the rate limiting, but a 69% promotion
> rate descrease.
> 
> And this is really my point, how much effort do we want to put in 
> optimizing
> software mechanisms for hot page detection? Are there other benchmarks we
> should be using?

Yes, some representative benchmarks to evaluate the effectiveness of hot 
page promotion would be useful.

Recently there was a discussion about the effectiveness of hot page 
detection in the context of a micro-benchmark. More details here:

https://lore.kernel.org/linux-mm/929b22ca-bb51-4307-855f-9b4ae0a102e3@amd.com/T/#m04eb5d9dfb30133156d4dcb33b09b89a4e9299ea

> And perhaps doing the async promotion and not incurring in
> the numa balancing overhead and comparing the cost of migration before
> promoting would yield some better numbers, but that also might be easy to
> get wrong when compared to the relative hotness of the page.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering
  2024-05-13  1:49     ` Davidlohr Bueso
  2024-05-13  3:28       ` Bharata B Rao
@ 2024-05-13  7:48       ` Huang, Ying
  1 sibling, 0 replies; 11+ messages in thread
From: Huang, Ying @ 2024-05-13  7:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: lsf-pc, linux-mm, Michal Hocko, Dan Williams, John Hubbard,
	Zi Yan, Bharata B Rao, Dave Jiang, Aneesh Kumar K.V,
	Alistair Popple, Christoph Lameter, Andrew Morton,
	Linus Torvalds, Dave Hansen, Mel Gorman, Jon Grimm,
	Gregory Price, Wei Xu, Johannes Weiner, SeongJae Park,
	David Hildenbrand, peterz, a.manzanares

Davidlohr Bueso <dave@stgolabs.net> writes:

> On Thu, 09 May 2024, Huang, Ying wrote:
>
>>With the default configuration, current NUMA balancing based promotion
>>solution will almost try to promote any faulting pages.  To select hot
>>pages to promote and control thrashing between NUMA nodes, the promote
>>rate limit needs to be configured.  For example, via,
>>
>>echo 200 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
>>
>>200MB hot pages will be selected and promoted every second.  Can you try it?
>
> Yes, I've played with this tunnable and, just like the LRU approach, it
> shows nice micro wins (less amount of promotions/demotions) but little for
> actual benchmark improvements at a higher level, merely noise level or
> very sublte wins. In fact, the actual data from that series for this
> parameter was a ~2% pmbench win with the rate limiting, but a 69% promotion
> rate descrease.

Thanks a lot for update!

IIUC, page promotion/demotion only helps performance if there are hot
pages in the slow memory and cold pages in the hot memory.  This may be
not true for quite some workloads configurations.

For example, the default allocation mechanism is local first.  In the
context of memory tiering, it's the fast memory first.  In various
workloads, it's quite normal that hot pages will be allocated firstly.
This makes it unnecessary to optimize the page placement until there's
some configuration changes in the system.

So, to evaluate the optimization, we need to

1) check the overhead of the optimization when page placement is almost
optimal already.

2) find configurations where the page placement isn't good enough, and
check whether memory tiering optimization works.

> And this is really my point, how much effort do we want to put in optimizing
> software mechanisms for hot page detection? Are there other benchmarks we
> should be using? And perhaps doing the async promotion and not incurring in
> the numa balancing overhead and comparing the cost of migration before
> promoting would yield some better numbers, but that also might be easy to
> get wrong when compared to the relative hotness of the page.

I believe that there are still quite some spaces to optimize the
software mechanisms.  The current implementation is as simple as
possible in fact.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-05-13  7:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-07  3:37 [LSF/MM/BPF TOPIC] Locally attached memory tiering David Rientjes
2024-05-07 11:52 ` Michal Hocko
2024-05-07 20:09   ` David Rientjes
2024-05-08  4:14 ` Huang, Ying
2024-05-10  3:10   ` David Rientjes
2024-05-08 21:39 ` Davidlohr Bueso
2024-05-09  1:42   ` Huang, Ying
2024-05-13  1:49     ` Davidlohr Bueso
2024-05-13  3:28       ` Bharata B Rao
2024-05-13  7:48       ` Huang, Ying
     [not found] ` <CGME20240509173529uscas1p1b6e43b169514d36915cd2bc8aabc4200@uscas1p1.samsung.com>
2024-05-09 17:35   ` Adam Manzanares

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.